Monthly Archives: February 2018

Recreating a lost Yiddish database: The LCAAJ Project

The Language and Culture Archive of Ashkenazic Jewry (LCAAJ) is an extraordinary resource for research in Yiddish studies.  It consists of field interviews recorded between 1959 and 1972 with Yiddish-speaking informants conducted by Columbia University’s Department of Linguistics, who donated the Archive to Columbia University Libraries in 1995.

The Archive presents an interesting preservation challenge, since the original researchers created not only the audiotapes and large quantities of paper documents, but also computer data that has not survived the test of time.

The interviews were collected from people who originally lived in 603 different locations in Central and Eastern Europe, to create a sample that reflected the distribution of the Yiddish-speaking population on the eve of World War II.  The informants answered questions on a wide variety of topics concerning Yiddish language and culture during interviews lasting anywhere from 2.5 to 16 hours.  In all, the project produced 5,755 hours of audiotaped sessions with the native speakers and ca. 100,000 pages of questionnaires.  The documents are covered with hand-written linguistic field notes that were taken during the interviews in a mix of English, Yiddish, and a linguistic notation system developed for the project that uses only characters that the computers of the day could handle.  No verbatim transcriptions of the interviews were ever made.

Examples of questionnaire pages, with linguistic field notes in English, Yiddish and a special linguistic notation system developed for the project.

In the late 1960s and early 1970s, about half of the data collected by the project was coded onto punch cards and read onto computer tapes in order to create lists that would facilitate creation of maps of linguistic features.  These were later published in the multi-volume Language and Culture Atlas of Ashkenazic Jewry.

Current scholars want to manipulate the data for further study, but the original punch cards and computer tapes vanished decades ago.  No one thought of preserving them.  If they had, it would have presented an interesting challenge for digital archaeologists.  Instead, all we have left is printouts of the data on the green-and-white striped pin-fed paper that evokes memories from people of a certain age.

Example of a printout from the original computer database on green-and-white striped pin-fed paper.

CUL obtained a grant from the National Endowment for the Humanities in 2015 to start recreating the database.  We scanned each printout page to create TIF images, then put them through OCR (optical character recognition) and mark-up to generate new machine-readable tables.  Abbyy FineReader OCR software was used for this purpose.  The pages were first zoned and analyzed to identify the tables of data on each page, and the text in each of the series was then subjected to a few hours of software “training” to enhance accuracy.  After full machine reading and some cleanup, all of the pages were exported as MS Excel spreadsheets and put through additional cleanup processes.  Scholars can now search and manipulate the data once again.

The handwritten notes that served as the input to the computer database contain additional information that was never coded in.  We have also digitized those as page images.  Our new site allows scholars to move between the tables and the questionnaire pages to make sure they have all the information relevant to their research. (See the user’s guide.) For more information on this project, check out this interview with Michelle Chesner, Norman E. Alexander Librarian for Jewish Studies at Columbia University.

Luckily, the original audiotapes were preserved.

CUL digitized the tapes some years ago in a multi-year effort with generous support from NEH, private foundations, the New York State Conservation/Preservation Program, and EYDES (Evidence of Yiddish Documented in European Societies, a project of the German Förderverein für Jiddische Sprache und Kultur).  The audio files are available online on the EYDES site (

One of our next aims is to raise money to link the audio files and the digital data.  Columbia’s LCAAJ site will continue to evolve and add more information and more functionality to keep this re-created database relevant for new researchers.

Links cited in this post:

  • Language and Culture Atlas of Ashkenazic Jewry
  • LCAAJ in Columbia Digital Library Collections
  • LCAAJ User’s Guide
  • In Geveb Journal of Yiddish Studies
  • Evidence of Yiddish Documented in European Societies