The Library of Congress
The American Library of Congress, online, is a wonderful place to be. The treasure contained within and freely shared are an example on how these things should be done, how they should be done right. Included in their treasures is a section called Chronicling America, which is a large collection of old scanned newspapers. From their own site:
Chronicling America is a Website providing access to information about historic newspapers and select digitized newspaper pages, and is produced by the National Digital Newspaper Program (NDNP). NDNP, a partnership between the National Endowment for the Humanities (NEH) and the Library of Congress (LC), is a long-term effort to develop an Internet-based, searchable database of U.S. newspapers with descriptive information and select digitization of historic pages. Supported by NEH, this rich digital resource will be developed and permanently maintained at the Library of Congress. An NEH award program will fund the contribution of content from, eventually, all U.S. states and territories.
This is akin to saying the Sistine Chapel is a bit of Dulux (tm) slapped on the ceiling.
The long way there is to go here: http://www.loc.gov/library/libarch-digital.html
[caption id="" align="aligncenter" width="800" caption="http://www.loc.gov/library/libarch-digital.html"][/caption]
The click on Historic Newspapers, which gives us the following.
[caption id="" align="aligncenter" width="800" caption="http://chroniclingamerica.loc.gov/"][/caption]
From this page you can carry out searches via the little box at the top. Restricting your search by date and geographical location. This is carried out by the magic to Optical Character Recognition (OCR), which is not brilliant but seems to be good enough.
The advanced search page looks like this:
[caption id="" align="aligncenter" width="800" caption="http://chroniclingamerica.loc.gov/#tab_advanced_search"][/caption]
Which gives you a finer control over how your search is carried out. The search results tend to be returned in order of potential usefulness. The pages higher up tend to be more likely to be what you were looking for. As you see above I am looking for the words Olympic and Games when they occur within 5 words of one another. The Olympics are coming to my country next year so I might be able to find an interesting snippet of Olympic history or a nice picture of something to put on a Zazzle mug.
Papers can also be searched by browsing through individual titles, like browsing through the giant books at your local library, except without a stern schoolmarmish librarian staring at you in case you sneeze in the pages.
[caption id="" align="aligncenter" width="800" caption="http://chroniclingamerica.loc.gov/newspapers/"][/caption]
The list of papers covered can be downloaded as a simple plain text file.
[caption id="" align="aligncenter" width="800" caption="http://chroniclingamerica.loc.gov/newspapers.txt"][/caption]
So let's do a quick search and check the kind of quality we get. As you see I searched for the Olympic Games above. A basic search for the two words appearing within 5 words of one another. This gave me the following page.
[caption id="" align="aligncenter" width="247" caption="Search - Olympic Games"][/caption]
What you see is a list of pages upon which the words appear, according to the specified rules. The red highlights are the words searched for, the red only shows up on the web pages not any of the downloads. A cursory glance seems to indicate that it has got it mostly right. I like the look of the image at the top left.
One click later.
[caption id="" align="aligncenter" width="499" caption="The Terrible Yank! - http://chroniclingamerica.loc.gov/lccn/sn83045487/1912-07-08/ed-1/seq-11/"][/caption]
There are many ways of interacting with the above image. The mouse wheel can be used to zoom in and out. The button to the right of the home symbol makes the web page full screen. The home symbol resets the zoom level. Various links let you browse other issues of the paper and connecting pages. The clip image link lets you copy whatever part of the page is shown on the screen and download it as a jpeg file. The PDF links allows the page to be downloaded as a PDF (funny that, example http://www.scribd.com/doc/63940385/The-Terrible-Yankee.)
The jp2 allows the page to be downloaded as a jpeg 2000 image file. Which should probably be converted quickly to a png format as for some reason the jp2 format often crashed my file browser. On linux I use the following:
mogrify -verbose -format png *.jp2 && rm *.jp2
Entirely up to you though. Many public domain image sites on the web give you very poor quality images, then encourage you to buy their large DVDs full of the higher resolution images. Which is fair enough. The Library of Congress though lets the original file be downloaded.
Here is the image shrunk to fit and at actual size. Certainly high enough detail to be used as part of a public domain artwork.
[caption id="" align="aligncenter" width="335" caption="The Terrible Yank - Zoomed In to 100%"][/caption]
[caption id="" align="aligncenter" width="335" caption="The Terrible Yank - Zoomed to 24%"][/caption]
A Thousand Words Can Tell A Picture
What about the text? http://chroniclingamerica.loc.gov/lccn/sn83045487/1912-07-03/ed-1/seq-11/
[caption id="" align="aligncenter" width="641" caption="Cartoon - Not Olympic Related - http://chroniclingamerica.loc.gov/lccn/sn83045487/1912-07-03/ed-1/seq-11/"][/caption]
The text is perfectly legible but is it good enough to be recognised well by OCR? Look at the top, there is a text link. This should take us to a copy of the page with the text extracted.
[caption id="" align="aligncenter" width="611" caption="Library of Congress OCR"][/caption]
Not brilliant but could be a lot worse. Now we try Google's Tesseract.
[caption id="" align="aligncenter" width="568" caption="Google OCR"][/caption]
Looks worse than the LoC OCR attempt. You mileage may vary with different OCR packages.
The clipping from the above page can be seen here:
[caption id="" align="aligncenter" width="557" caption="Cartoon - Clipped From Web Page"][/caption]
I don't really get it. The quality is not high enough for print resolution as a web clipping.
I am sure there is much more for me to discover from Chronicling America. Just a shame that I am unable to find a UK equivalent.
Hope this might be helpful to someone.