Author Archives: Janet Gertz

Recreating a lost Yiddish database: The LCAAJ Project

The Language and Culture Archive of Ashkenazic Jewry (LCAAJ) is an extraordinary resource for research in Yiddish studies.  It consists of field interviews recorded between 1959 and 1972 with Yiddish-speaking informants conducted by Columbia University’s Department of Linguistics, who donated the Archive to Columbia University Libraries in 1995.

The Archive presents an interesting preservation challenge, since the original researchers created not only the audiotapes and large quantities of paper documents, but also computer data that has not survived the test of time.

The interviews were collected from people who originally lived in 603 different locations in Central and Eastern Europe, to create a sample that reflected the distribution of the Yiddish-speaking population on the eve of World War II.  The informants answered questions on a wide variety of topics concerning Yiddish language and culture during interviews lasting anywhere from 2.5 to 16 hours.  In all, the project produced 5,755 hours of audiotaped sessions with the native speakers and ca. 100,000 pages of questionnaires.  The documents are covered with hand-written linguistic field notes that were taken during the interviews in a mix of English, Yiddish, and a linguistic notation system developed for the project that uses only characters that the computers of the day could handle.  No verbatim transcriptions of the interviews were ever made.

Examples of questionnaire pages, with linguistic field notes in English, Yiddish and a special linguistic notation system developed for the project.

In the late 1960s and early 1970s, about half of the data collected by the project was coded onto punch cards and read onto computer tapes in order to create lists that would facilitate creation of maps of linguistic features.  These were later published in the multi-volume Language and Culture Atlas of Ashkenazic Jewry.

Current scholars want to manipulate the data for further study, but the original punch cards and computer tapes vanished decades ago.  No one thought of preserving them.  If they had, it would have presented an interesting challenge for digital archaeologists.  Instead, all we have left is printouts of the data on the green-and-white striped pin-fed paper that evokes memories from people of a certain age.

Example of a printout from the original computer database on green-and-white striped pin-fed paper.

CUL obtained a grant from the National Endowment for the Humanities in 2015 to start recreating the database.  We scanned each printout page to create TIF images, then put them through OCR (optical character recognition) and mark-up to generate new machine-readable tables.  Abbyy FineReader OCR software was used for this purpose.  The pages were first zoned and analyzed to identify the tables of data on each page, and the text in each of the series was then subjected to a few hours of software “training” to enhance accuracy.  After full machine reading and some cleanup, all of the pages were exported as MS Excel spreadsheets and put through additional cleanup processes.  Scholars can now search and manipulate the data once again.

The handwritten notes that served as the input to the computer database contain additional information that was never coded in.  We have also digitized those as page images.  Our new site allows scholars to move between the tables and the questionnaire pages to make sure they have all the information relevant to their research. (See the user’s guide http://guides.library.columbia.edu/lcaaj.)

Luckily, the original audiotapes were preserved.

CUL digitized the tapes some years ago in a multi-year effort with generous support from NEH, private foundations, the New York State Conservation/Preservation Program, and EYDES (Evidence of Yiddish Documented in European Societies, a project of the German Förderverein für Jiddische Sprache und Kultur).  The audio files are available online on the EYDES site (www.eydes.de).

One of our next aims is to raise money to link the audio files and the digital data.  Columbia’s LCAAJ site will continue to evolve and add more information and more functionality to keep this re-created database relevant for new researchers.

Links cited in this post:

  • Language and Culture Atlas of Ashkenazic Jewry https://clio.columbia.edu/catalog/1231536
  • LCAAJ in Columbia Digital Library Collections https://dlc.library.columbia.edu/lcaaj
  • LCAAJ User’s Guide http://guides.library.columbia.edu/lcaaj
  • Evidence of Yiddish Documented in European Societies www.eydes.de

Is Your Google Book Incomplete? We May Be Able To Help.

As many people know, Google has digitized hundreds of thousands of books from libraries around the world, including Columbia University Libraries, and they’ve created Google Books, a wonderful resource for readers and researchers.  Subsequently Columbia and many other libraries have contributed their Google digital versions to HathiTrust to assure that the e-books are preserved into the future.

It’s also well known that some Google books have problems – for instance, because Google didn’t open out folded pages when the books were digitized, those pages are not visible to readers.  Recently HathiTrust and its member libraries have developed a process to fix some of those problems.

Let’s look at The Royal Land Com’y of Virginia, published in 1877 and digitized by Google in 2009 from a copy owned by Columbia University Libraries.  Until a few weeks ago, anyone trying to read it on Google or HathiTrust, would have found unreadable folded plates, including this one that follows page 72.

Someone reading the book on HathiTrust discovered the folded plates and reported them by using the Feedback button at the bottom of the page display.

HathiTrust staff then notified Columbia, because it is our copy that Google digitized.  We received messages of the form “the plate following page 72 of this title is folded and cannot be read”.  That alerted us to the need for new digital images of the foldouts.

When we looked at the volume, we discovered that the foldouts were torn.  Conservation treated the damage, and then our Imaging Lab digitized the unfolded plates.

We sent the images to Google, and they inserted the new images in place of the faulty ones.  They then loaded the new version into HathiTrust to replace the incomplete copy there.  Today the corrected e-book is available to everyone through Google and HathiTrust, and preserved for anyone to use in the future.

Now that everyone has the ability to search and view millions of books online in a matter of seconds, libraries are taking time and effort to collaborate with HathiTrust and Google to solve problems.  Behind the digital images that appear to be an easy click away, teams of library professionals are dedicated to digitizing physical books and improving the e-book experience.

Hearing Voices from a Broken Disc

Hearing the voices of people who lived in another century brings them close to us, but early recording technology makes hearing them a challenge. In the first half of the 20th century a common recording method was to use discs with a lacquer surface. Sound waves caused a stylus to vibrate and cut grooves into the lacquer while the disc turned. The recording was played back by running another stylus through the grooves and amplifying the sound. The inner core of the discs was metal, cardboard, or even glass. Playing these old recordings is a problem – the lacquer deteriorates over time, developing cracks and sometimes detaching from the core, and of course glass is easily broken.

Until a few years ago, a broken record was a lost cause – while conservators can repair many types of damage, they cannot put broken glass recordings back together again. But in 2013 scientists from Lawrence Berkeley National Laboratory developed IRENE (Image Reconstruction Eliminate Noise, Etc.), a digital imaging system that can make a picture of the grooves on a disc and then transform the images into digital sound files. Carl Haber, the lead scientist and a Columbia graduate, won the MacArthur Fellows award for his work. (For more on Haber and how he developed IRENE, see this article in Columbia College Today).

disc-13-join-the-news-reel

Glass disc, WNEW Join the News Reel, 10 February 1944, American Bureau for Medical Aid to China 1937-2005, Rare Book & Manuscript Library, Columbia University

Like many other libraries and archives, Columbia has its share of glass and other fragile recordings. When IRENE became available from the Northeast Document Conservation Center, we sent off this disc from 1944 to test the new service. The disc had shattered and small fragments along the edges of the breaks had been completely lost. Using IRENE, each surviving fragment was separately imaged, and then the entire recording was digitally reassembled. Pops and clicks can be heard where bits of the lacquer were missing, but this recording of WNEW’s Join the News Reel from 10 February 1944, broken decades ago, now speaks once more.

Listen here:

Learn more about IRENE at NEDCC.

irene-system

The IRENE system at the Northeast Document Conservation Center, mounted on a vibration-damping pneumatic air table. Photo courtesy of Northeast Document Conservation Center.