Category Archives: Uncategorized

Python Open Labs S.2018: How it all went down

Lab Structure and Technicalities:

This semester I returned to running the Python Open Labs with another student intern and upon started, we had some discussions about our ideas for structuring the labs over the next few months. We decided to stay consistent with the formatting of the lab: started the semester with the starter kit (Python fundamentals), and continued to build on those fundamentals every week (see weekly blogs here).

The first change we implemented was switching to Jupyter notebooks as opposed to running the labs on the console itself. For me, this was quite a challenge! I had never used Jupyter notebooks, it seemed like a strange and abstract way to code, one that was definitely built with user experience in mind and something that was unfamiliar to me on all fronts. I spent a few weeks playing around with its functionality – the headings and commenting features as well as common errors that can happen (e.g. running every cell to ensure the code works). Once I got the hang of it though, everything changed! I have tasted and enjoyed the Jupyter notebook kool-aid and there is no going back.

One of the best features of the Jupyter notebooks is the UX of its layout. The simplicity of its layout makes the code that much easier to parse out and build upon. Organizing the code, or in our case, entire lesson in the Jupyter notebooks meant that we could share the lessons with the class at the end. Prior to the lesson, we would come up with the content and create the notebook in full. Then, we would go back and recreate another lesson without the completed cell blocks so that we could use the prompts and live code. At the end of the lesson, we shared the lessons in full with the class to ensure that students could spend time going over and reviewing the examples and problems with the full code (i.e. all answers) readily available.

That being said, we also changed how we shared the lessons with the students. This semester, we maintained a google folder with all the Jupyter notebook lessons and .pdfs and shared it with the students that came. A welcome change considering the amount of paper it took last semester to print and share each lesson! We also received great feedback on the organization and sharing of the Jupyter lessons so that’s definitely something we will keep in mind and hope to continue next semester.


This semester, the range of programs represented by the students who attended the labs were incredibly diverse. Students from the School of Professional Studies and the School of International and Public affairs were the most consistent, however we did encounter students from Journalism, Economics, and Latin American Studies as well. Although it’s a challenge to encourage students to attend on a regular basis, we were able to see some faces week after week, and sharing the lessons on an accessible drive folder ensured that those who were not able to make it in person but interested in continuing to expand their coding horizons could keep up.

Most enjoyable Lab:

The lab we held on python classes in early April was my favorite lab – partly because I taught the entire session on my own, but mostly because I structured the lesson in a way that focused on fewer and more intense practice problems. Instead of going through quicker and shorter sample problems I thought I would try create problem sets that incorporated functions as well to keep things interesting and to offer students a challenge. The class was well received and you can find the lesson on the DSSC blog if you want to check it out!

Ideas for future labs:

To conclude this post, I will underline two suggestions for future lab lessons:

  1. Plan out lessons before the labs

It would be great, in my opinion, to post a description of each lab before it happens to outline the structure of the lab and the concepts covered. This was the route we took for the R open labs towards the end of the semester and it worked really great – I am excited to try it out for Python as well!

  1. Continue to market to a diverse group of students

Before commencing in the fall, I would like to spend some time strategizing on how to market to different departments. The open labs are such a great way to learn a coding language – they are free (!!!), but more importantly the communal vibe is optimistic and welcoming and a great space to learn.

I have learned so much this year in preparing and leading labs and now that they have wound down for the summer, I feel motivated to continue to market the space and engage students across all departments.

Computationally Detecting Similar Books in Project Gutenberg

As one of the first digital libraries, Project Gutenberg has lived through a few generations of computers, digitization techniques, and textual infrastructures. It’s not surprising, then, that the corpus is fairly messy. Early transcriptions of some electronic texts, hand-keyed using only uppercase letters, were succeeded by better transcriptions, but without replacing the early versions. As such, working with the corpus as a whole means working with a soup of duplicates. To make matters worse, some early versions of text were broken into many parts, presumably as a means to mitigate historical bandwidth limitations. Complete versions were then later created, but without removing the original parts. I needed a way to deduplicate Project Gutenberg books.

To do this, I used a suggestion from Ben Schmidt and vectorized each text, using the new Python-based natural language processing suite SpaCy. SpaCy creates document vectors by averaging word vectors from its model containing about 1.1M 300-dimensional vectors. These document vectors can then be compared using cosine similarity to determine the semantic similarities of the documents. It turns out that this is a fairly good way to identify duplicates, but has some interesting side-effects.

Here, for instance, are high-ranking similarities (99.99% vector similarity or above) for the first 100 works in Project Gutenberg. The numbers are the Project Gutenberg book IDs (see, for instance, this index of the first 768 works).

1.  The King James Version of (10) -similar to- The Bible, King James Ver (30)
2.  Alice's Adventures in Won (11) -similar to- Through the Looking-Glass (12)
3.  Through the Looking-Glass (12) -similar to- Alice's Adventures in Won (11)
4.  The 1990 CIA World Factbo (14) -similar to- The 1992 CIA World Factbo (48)
5.  Paradise Lost             (20) -similar to- Paradise Lost             (26)
6.  O Pioneers!               (24) -similar to- The Song of the Lark      (44)
7.  O Pioneers!               (24) -similar to- Alexander's Bridge        (91)
8.  Paradise Lost             (26) -similar to- Paradise Lost             (20)
9.  The 1990 United States Ce (29) -similar to- The 1990 United States Ce (37)
10. The Bible, King James Ver (30) -similar to- The King James Version of (10)
11. The 1990 United States Ce (37) -similar to- The 1990 United States Ce (29)
12. The Strange Case of Dr. J (42) -similar to- The Strange Case of Dr. J (43)
13. The Strange Case of Dr. J (43) -similar to- The Strange Case of Dr. J (42)
14. The Song of the Lark      (44) -similar to- O Pioneers!               (24)
15. The Song of the Lark      (44) -similar to- Alexander's Bridge        (91)
16. Anne of Green Gables      (45) -similar to- Anne of Avonlea           (47)
17. Anne of Green Gables      (45) -similar to- Anne of the Island        (50)
18. Anne of Avonlea           (47) -similar to- Anne of Green Gables      (45)
19. Anne of Avonlea           (47) -similar to- Anne of the Island        (50)
20. The 1992 CIA World Factbo (48) -similar to- The 1990 CIA World Factbo (14)
21. The 1992 CIA World Factbo (48) -similar to- The 1993 CIA World Factbo (84)
22. Anne of the Island        (50) -similar to- Anne of Green Gables      (45)
23. Anne of the Island        (50) -similar to- Anne of Avonlea           (47)
24. A Princess of Mars        (60) -similar to- The Gods of Mars          (62)
25. A Princess of Mars        (60) -similar to- Warlord of Mars           (65)
26. The Gods of Mars          (62) -similar to- A Princess of Mars        (60)
27. The Gods of Mars          (62) -similar to- Warlord of Mars           (65)
28. Warlord of Mars           (65) -similar to- A Princess of Mars        (60)
29. Warlord of Mars           (65) -similar to- The Gods of Mars          (62)
30. Adventures of Huckleberry (73) -similar to- Tom Sawyer Abroad         (88)
31. Tarzan of the Apes        (75) -similar to- The Return of Tarzan      (78)
32. The Return of Tarzan      (78) -similar to- Tarzan of the Apes        (75)
33. The Beasts of Tarzan      (82) -similar to- Tarzan and the Jewels of  (89)
34. The 1993 CIA World Factbo (84) -similar to- The 1992 CIA World Factbo (48)
35. Tom Sawyer Abroad         (88) -similar to- Adventures of Huckleberry (73)
36. Tarzan and the Jewels of  (89) -similar to- The Beasts of Tarzan      (82)
37. Alexander's Bridge        (91) -similar to- O Pioneers!               (24)
38. Alexander's Bridge        (91) -similar to- The Song of the Lark      (44)

The first pair here is of duplicates: both are King James Versions of the Bible. The same is true of lines 5 and 8, and lines 12-13: they’re just duplicates. All the other works are members of a novelistic series. Lines 2 and 3 are Alice in Wonderland and its sequel. Lines 6 and 7 are Willa Cather novels of the Great Plains trilogy. Lines 16-19, and 22-23 identify the Anne of Green Gables novels. Lines 24-29 are a cluster of Edgar Rice Burroughs of the Mars series, and there is also another cluster of Burroughs novels, the Tarzan series, at 31-33. Line 35 shows Mark Twain novels of the Tom Sawyer and Huck Finn world. The algorithm even identifies the two 90s CIA World Factbooks as part of a series.

When I lower the cutoff similarity score, I can get even more interesting pairs. Less-recognizable series, like Paradise Lost and Paradise Regained, have similarity scores of around 97%. At that level, completely unrelated novels with the same settings, or written in around the same time period (Victorian novels, for instance), begin to cluster together.

The chart below shows a PCA-reduced 2D vector space approximating the similarity between books 1-20. There are interesting clusters here: the American Constitution and Bill of Rights cluster together, along with the Declaration of Independence, the Federalist Papers, and Abraham Lincoln’s first inaugural address. Lincoln’s second address, however, clusters rather with his Gettysburg Address, and John F. Kennedy’s inaugural address.

PCA of PG Books 1-20

PCA of PG Books 1-20

Non-fiction seems to be clustered at the bottom of the chart, whereas fiction is at the top. Quasi-fictional works, like the Book of Mormon and the Bible, are in between. Similarly, Moby Dick, a work of fiction that nonetheless contains long encyclopedic passages of non-fiction, lies in the same area. The most fantastical works, which are also the three children’s books, Peter Pan and the two Carroll novels, cluster together in the upper left.

As always, the code used to create all of this is on GitHub. I welcome your comments and suggestions below!

>>> myPythonOpenLab = (“A Holistic Approach to Python”)

Overview of the Fall 2017 Python Open Labs

Python Open Labs reconvened in September with three new DSSC interns. Unlike my colleagues studying Computer Science and Computer Engineering, I am a Human Rights Masters Student at SIPA/ISHR. While we come from different backgrounds and levels of expertise in Python, this semester has been both productive and challenging as we worked to leverage our abilities and share our experiences in Python in an accessible and comprehensible format.

Teaching Style:

Our approach to the workshops was simple: teaching in a lecture-style format week after week– a slow progression to build on concepts introduced the previous session. We collectively agreed that Python is a coding language that is relatively easy to grasp given that the correct tools are available. Moreover, given its relevance to a variety of academic disciplines and careers, we strove elicit a positive reception from students attending these sessions. The latter point was quickly reciprocated by learners, who responded really well to the linear format of the labs. We welcomed students from a wide variety of schools, including Teacher’s College, SIPA, Urban Planning, and Journalism.

There was always too much material (or too much ambition — call it what you may), and we tried our best to manage time, but often found that sessions ran over their allotted time and continued on to the following week. Our pseudo lectures always included practicing concepts (eg. classes, dictionaries, loops etc.) a few times over the course of the 2 hours to see how students interpreted and applied them. We felt inspired by the variety of solutions that students shared! It has been incredibly rewarding to watch students become more confident in their abilities to write code and utilize their fluency of Python applications to solve a given exercise or problem as we delved deeper into the language.


As a student disengaged from Python in my day to day studies, I found myself trying to push my fellow interns to simplify material and slow down! This has definitely been one of our most obvious challenges as moving through material too quickly has caused confusion and an influx of questions via email post-lab. It has been difficult at times to gauge exactly how students are responding to a specific concept such as list methods (a particularly complex lesson), as we receive little to no participatory feedback from students in the lab. I hope to challenge this next semester by pausing more often for feedback and for creating a space where active dialogue between interns and students allows us to work in sync.

Looking Beyond:

Between the three interns, we have had great fun amalgamating our skills, strengths, and weaknesses in and around the lab to optimize students’ experiences. In doing so, we have learned more than we imagined about our own approaches to the language as well as teaching habits. Next semester, though my teammates will be graduating and moving on from Columbia, I am very much looking forward to continuing the labs in the aforementioned format.

One of the ways I hope to further enhance the workshop format is to focus on team, or group learning by way of small projects or discussions to accompany the lessons. Collaborative learning not only promotes learning by bringing people of different skill levels together, it also replicates the type of environment (think career) in which one would participate as a professional with coding expertise.

I also hope replace the weekly handouts I would create for lessons with an electronic format. I hope to share lesson structures and practice problems in a blog post or a Jupyter notebook to alleviate the environmental impact of printing paper.

Already looking forward to the Spring semester! All questions, comments, and feedback are strongly encouraged.

Please visit the DSSC blog for Fall 2017 Python Open Lab weekly summaries and materials.

R Open Lab: Looking Back and Moving Forward

In the past academic year, I worked as a teaching intern with the Digital Social Science Center and Digital Science Center, hosting R open labs and workshops. Most of the people at R Open Lab are using R for their research projects; therefore in the past semester, I tried out several different teaching practices in search of the best way to enable the participants to harness the power of R as a research tool.

New Teaching Practices

  1. Peer learning: It was the first method I tried out and it seemed to be the most helpful one. By talking with people with similar research interests and learning from each other’s experiences, many participants expanded their professional networks and found a better way to apply R to their research. Exchanging learning experiences with people from different academic backgrounds helped the participants to gain a broader understanding of the functionality of R. However, creating a stimulating environment for peer learning but not making the participants feel pressured can be challenging.
  2. Group discussion: I tried to encourage group discussion by throwing open-ended questions at the participants during the instructions. The problem with this method is that more often than not, the discussions were very shallow, probably because the questions were not interesting enough. Looking forward, it might worth trying to prepare a thought-provoking question for each R Open Lab session and make group discussion as a standalone part.

Challenges and Solutions

Based upon my experience with both workshops and open labs, and the feedback from the attendees, the challenges facing the R Open Labs and possible ways to improve them are as follows:

  1. Due to lack of a clear syllabus for the whole semester and its self-paced nature, most participants are not motivated to attend R Open Labs regularly. To help participants have a better idea of the progress of open labs, we can try to post topics for each session a month in advance on the workshop list. We can also maintain a GitHub repository and upload materials used for instruction after each session so that other interested individuals could easily access them.
  2. Although the Swirl package is an immensely helpful starter kit, most people new to coding and statistics still struggle with very basic R operations. Furthermore, the huge gap between basic operations and being able to fully implement a research project can be frustrating and overwhelming. Therefore, instead of Swirl, we could provide participants with some sample code and links to GitHub repositories, so that they could use them as a starting point for their own projects and learn best practices of building a project with R
  3. Some questions are frequently asked by the participants, so we can probably provide participants with a list of FAQs and the general answers to maximize efficiency.

To conclude, this internship motivated me to think deeper about teaching and better understand people’s needs. It was an incredibly challenging and rewarding experience. As data analysis becomes increasingly important in a multitude of research areas ranging from biology to history, R is becoming an essential research tool. I hope by continuing to make improvements to the contents and structure of R Open Lab, it could serve as a platform to introduce R as a useful digital tool and promoting collaboration between scholars interested in R.


Python Open Labs – Spring 2017


This post details my experience as a Digital Science Center Teaching Intern for the Spring ‘17 semester, wherein I hosted the weekly Python Open Labs to teach programming with python. This internship is my first full fledged teaching experience where I got complete freedom to choose the topics that I introduced during Python Open Labs and the way I conducted  weekly sessions and a couple of workshops on python through the semester. My internship this semester was in continuation of the year long Digital Science Centers Teaching Internship that I secured in Fall 2016. A detailed post about my last semester’s experience can be found on this link: Muggles Speak English. Pythonistas Speak Python

This semester I continued introducing new topics in Python, building on top of the basics of programming that were covered in the open labs held in Fall 2016 semester. The broad range of topics included object oriented programming, web scraping, file and data handling etc which included applying the basic concepts of programming from past sessions in a cohesive manner to solve relatively complex problems. In particular, I had two motives for the sessions this semester: 1) To be as inclusive as possible and hence I tried to keep the open labs generic, catering to the needs of attendees from various Columbia Schools such as Law, Journalism, Medical, SIPA etc rather than restricting ourselves to a particular domain like data science or scientific programming. 2) To introduce Python as a helping tool that would make the day to research, academic and professional tasks easier for the attendees, a majority of whom had little or no prior programming experience. I ensured that everyone who attended these labs had something to take away that would facilitate their encounters with programming.

In the first half of this semester I included some advanced topics in python such as object oriented programming, file I/O and some data structures. This was in continuation of the basics of programming with python that had already been covered during the open labs sessions from previous semester. At this point of open labs the main challenge I faced was to ensure that people attending the labs were able to grasp the concepts very well and tie the multiple concepts together as we moved towards more complex applications and problem solving. This required extended practice problems and discussions during the two hour long weekly open labs. This approach towards teaching meant less number of slides and more examples to give a better perspective to everyone, which also helped me get better at explaining concepts to others and be involved in detailed discussions.

The second half of semester was concentrated towards covering on-demand topics from the people who attended Python Open Labs. As a result we got to cover a good number of python libraries and topics like BeautifulSoup for web scraping, csv module for csv file handling, lxml parser for XML parsing and the requests module for handling web requests. These are some of the topics that I had initially thought to be too complex to be introduced in the open labs, but it turned out to be a pleasant experience that many of the attendees specifically asked for these topics which related with their academic and professional works and they were able to better relate with these topics combined together with the basics of programming. It ensured that they were able to practically apply the programming concepts that were covered throughout the open labs to their advantage and also gave me an opportunity to learn many new things about Python as a programming language.

To conclude this post I must mention that the past one year as a teaching intern with the Digital Science Center, Columbia University Libraries was a wonderful enriching experience that gave me a good insight into teaching a self-designed open course, an opportunity to interact with many people from various backgrounds and to brush up my skills on python. I am thankful to Columbia Libraries for providing this great opportunity for students.

Muggles Speak English. Pythonistas Speak Python.

Python is one of the fastest growing programming languages, owing to its simplicity of syntax, varied applications ranging from simplistic tasks such as Hello World program to crunching huge amounts of data for multi-million dollar companies or numerous research and academic works. As more and more fields integrate computer science into their regular work flow, the importance and need of programming knowledge is set to grow, especially Python.

In Fall 2016 semester, I had an awesome opportunity to host weekly open labs for sharing my knowledge of Python programming language with everyone from Columbia University, under the Digital Centers Internship Program with Columbia Libraries. This was quite a humbling experience for me as a teaching intern where I enjoyed the freedom and challenge to create my own course content, to closely interact with lots of people and introduce them to the niches of programming through Python.

As a Digital Center Teaching Intern, the most prominent takeaway was my first teaching experience, as I learned along the way how to grasp things myself in a crystal clear way and explain them simplistically to others. Initially I designed the open labs based on the Python course taught by Dr. Charles Severance on Coursera and gradually learned how to create my own new content to explain the concepts in a better way. Throughout the semester long internship I concentrated on introducing the basic concepts of programming and computation challenges through Python, not limiting the scope to just Python.

During this teaching internship in Fall 2016, the biggest challenge was to make teaching inclusive for all the attendees, most of whom had no or very little programming experience. The attendees of Python Open Labs were from various schools within Columbia, including Journalism, CUMC, SIPA and Columbia Law which posed a challenge to keep the content simple and basic yet informative enough to enable to take on any kind of programming task that might come along their way. Hence, I concentrated on introducing basics like conditional statements, string processing, file handling, loop statements, breaking solutions into small individual tasks, defining functions and organising code into packages and modules. This is where I enjoyed the most as I got to choose and design my own course as a first time teaching experience, right from scratch while trying to understand the same task or problem from the various different perspectives of attendees in open labs.

Python Open Labs this semester also included an experimental component of introducing a powerful and popular open source library to participants in each session, such as csv library for reading comma separated large sized files for data processing, however given the inexperience with prior programming for most of the audience of open labs, this turned out to be a difficult task, making me more aware of the knowledge gap that exists for people who want to use programming but are unable to do so due to lack of basics which most of the open sourced libraries assume at the user’s end. Therefore, I put this on hold for the rest of the semester and continued with the basics of programming through Python.

As the semester comes to close, so far we have had 10 open lab sessions in this semester wherein I covered programming topics ranging from simple tasks like printing or simple arithmetic to more complex ones like processing large input data from text files. In the next semester I will be continuing with the teaching internship and plan to once again try to introduce the open source libraries for various tasks such as data processing, analysis and a simplistic web development framework etc which truly unleash the power of Python as a modern programming language which is capable of accomplishing maximum work in minimal effort.

Digital Archives and Music Scores: Analysis, Manipulation, and Display

My project concerns the retrieval and display of digitally encoded music scores in existing archives and databases. While there are currently a great number of scores available in digital format, those formats differ, affecting not only their utility and but also their portability.

I recently made an attempt at working around this portability issue while trying to devise a demonstration for an undergraduate music humanities class this past semester. I had wanted to isolate and play back small groupings of instruments playing separate dances in different meters and show how Mozart was able to weave them together for the Act I climax of Don Giovanni. Though I ended up with a successful and visually appealing demonstration, it was a labor intensive process, not feasible to do with other pieces without vast reserves of preparation time. Below is the interface of an audio program, Logic Pro, that allowed me to arrange the different instrumental and vocal parts into separate tracks, and then assign each group to its own fader. I could then isolate and play the individual dance tunes and show how they blend together.

What made the process so time consuming was the lack of a machine readable copy of the score. Creating this example required entering the notes from the score into a music notation program, and then translating that information into a midi file, which could then be read and played back by Logic Pro. Consumer software programs that produce formatted sheet music are readily available and are well suited to their primary task of creating and editing scores. They are not so good at importing scores from an outside source or for playing anything back. One of these notation programs, Sibelius, can import direct scans of sheet music and export them in a variety of formats. For import, it uses an optical music recognition plugin, comparable to the more familiar optical character recognition technology used for text files.

That any program can even approximate the job of converting lines, dots, symbols and text of an engraved music score into a digitally editable format is a minor miracle. But more often than not, the results using the Sibelius plug-in are just too flawed to use, making the time consuming task of manual entry the only reasonable way to access Sibelius’s processing and document translation features.

Hundreds of hours of work could be saved if previously encoded music scores could be used for such demonstration purposes. It would also be useful if there were a program that could play back those scores so that a separate audio file of a performance of the piece would not be needed. (Sibelius does have a playback feature; however, the program itself can be slow and somewhat unwieldy, using such a large portion of the computer’s resources that its performance is too sluggish to make it a useful tool for dynamic classroom demonstrations.)

The availability, then, of a repository of public domain scores in digital format would be a highly prized resource—not only for instructors, but researchers as well. To be able to quickly locate individual instances of musical structures in a given corpus, identify their context, and tabulate their frequency, would grant a degree of rigor and generalizability to analytic observations that can be difficult to achieve in music theory.

Such archives of digital scores do exist, but accessing those scores and being able to use them are two different tasks. In addition, different modes of digitization allow for varying degrees of analytic access. Some archives, like the International Music Score Library Project (IMSLP), store image files of score pages. These image files are not analyzable by machine unless further coded into some hierarchical data structure. As such, they present the same import difficulties noted above: they need to be scanned, and the optical technology does not yet exist to do that efficiently. Collections of midi files also exist, which contain the information necessary for producing an audio simulation of the notated music, but do not necessarily contain all the information indicated by a score (for example, expressive markings, articulation symbols, and other instructions to the performer).

MEI—the Music Encoding Initiative—is emerging as a standard format that will allow analytic processing. Any discussion of computer-assisted musicology needs to take this project into account. However, as this is a newer format, readily available instructional materials are few. Training is accomplished primarily through professional workshops; user-friendly editing and display software is proprietary. I decided, therefore, to begin by looking at another widely used format, MusicXML, and the set of programming tools designed by and for academic researchers to work with that format, Music21.

MusicXML is a file format that allows for a representation of the elements in a score to be shared among various programs and platforms. And conveniently, Music21 comes packaged with a sizable corpus of over 500 scores to begin working with. Though designed to be easy to use, Music21 assumes a fairly sophisticated computing background on the part of the user, as difficulties in installation, utilization, and troubleshooting often arise. So while the package is well supported with online documentation, learning to use its contents is not the same as starting out with a consumer software product, safe in the knowledge that a little thoughtful clicking around, along with some methodical exploration of drop down menus, will eventually yield the desired results. Music21 is for coders. And the main drawback with Music21 is that the package requires a more than passing familiarity with its coding language: Python.

Python is a programming language that is relatively quick to learn. Nevertheless, it does require a certain amount of time and practice to get up to speed. Fortunately, the Digital Social Science Center offers a programming workshops designed specifically to help scholars analyze big data. The weekly sessions led by a computer science graduate student allow participants to get their feet wet and encourage outside practice. Its programs are designed specifically for scholars in my position: researchers needing to develop project-specific tools that will enable them to take advantage of the ever growing body of digital data that is made available to the public.

Below is an example of my latest programming efforts.

Two things are notable about this.

  1. Often, learning is best accomplished through actual hands on work. While I consider this program a success in many ways—most significantly, in that it didn’t just return an error message—it surprisingly did not return any notes with accidentals. That is because of the coding strategy used by MusicXML: a note’s identity as sharp or flat is stored separately from its letter name. This is crucial information to know; in fact, a note’s pitch in MusicXML is represented by a letter name, plus an octave designation, plus an optional “alteration” attribute of +1 or -1 (sharp and flat, respectively). Although explained in the documentation on the website for MusicXML, it’s programming experiences like these that really allow the data structure to be internalized.
  1. Note how easy it would be to fall into the trap of devising research questions based on the capabilities for information retrieval. Though perhaps a bit obvious, this provides a very clear example of how the organization of the data and the design of the programming language facilitate the accumulation of numeric facts which can end up directing further inquiry: “Bach’s Brandenburg Concerto No. 5 has 10,539 notes! I wonder if No. 6 has even more?” In order to take advantage of the depth of detail that is represented in the MusicxMXL format, it is important to guard against this tendency and instead continue to develop more refined programming skills.

The learning curve for Python, as is true for any language, is long and shallow—the programming results presented above may look rather paltry when compared to those attainable by experts in computer-assisted musicology. But as a scholar working independently to acquire new skills and gain access to recently developed research tools, they represent a lot of digging, exploring, and evaluating the programs, standards, and methods involved in storing music as a text file—in addition to the basics of learning a programming language. The analogy of language learning here is especially apt. Deciding to work with a corpus of computer encoded music scores for a research project is like deciding to work with a community of musicians in another region of the world. A new set of communication skills needs to be acquired, often from scratch, and a significant amount of time must be set aside for getting up to speed in the language.

While I continue to use this project as an opportunity to develop my own coding skills so I can make use of existing digitized corpora, I also intend to lay out exactly:

  1.  what resources need to be made available to other graduate student researchers,
  2.  what skills they will need to acquire in order to make use of those resources, and
  3.  how much time they should plan on devoting to acquiring those skills.


Map Club: Reflections on Teaching Self-Teaching in Digital Scholarship

This academic year, through my internship with the Center for Spatial Research and the Digital Social Science Center, I aspired to demystify digital mapping. I formed a series of fast-paced hack sessions focused on play, exploration, and the rapid acquisition of skills. To evoke exploration and inclusivity, I named the series Map Club.

Map Club represents an approach to learning. It seeks to hone the capacity to adapt to change, to encourage fearlessness in the face of new technology, and to nourish the value of experimentation without a specific goal. I color this description with a rhetorical intrepidity because I believe humility, determination, and bravery are the best traits to muster when digging into unfamiliar modes of making. Through Map Club, I wanted to leverage individual autonomy and creativity to teach attendees how to be self-taught. I hoped to achieve this by creating a space for collective, unstructured exploration, within which attendees could teach themselves.

Since its inception this summer, Map Club has met for 14 sessions and has explored 10 different mapping and visualization tools. Attendees have written code in JavaScript, Python, CartoCSS, and a bit of GLSL. We have fostered a space of creativity, collaboration, and digital empowerment, while continuing to grapple with the roadblocks that surface in new, less structured endeavors.

At the same time, this model has been neither unanimously fulfilling nor consistently easy. Map Club has suffered low attendance, mixed feedback, and inconsistent interest. Here, I would like to examine some of the reasons behind its irregular reception, as well as suggest some ideas for mitigating it.


In an effort to combat disorientation, each Map Club session this semester was loosely divided into three sections:

  • (20 minutes) Setup. Downloading a text editor, setting up a local server, and ensuring that example files display properly in the browser.
  • (60 minutes) Self-paced making. Unstructured time for attendees to experiment with the tool or library of the day.
  • (10 minutes) Sharing. Go-around to exhibit screenshots, cool glitches, and creative compositions.

While this schedule does help to divide up session time, it does not supplant the comforting sense of structure provided by a knowledgeable workshop leader. Though some students regularly stayed for entire sessions, others left early and never returned.


Based on attendee feedback, as well as my own observation, I believe the inefficacy of the initial Map Club model has three consistent causes.

  1. Attendees new to code have a harder time adopting it as a medium. Everybody learns differently. In the absence of prior experience, jumping into a new programming language without a guided tutorial can be confusing and disorganized.
  2. Unstructured time is not necessarily productive. Sometimes, the biggest challenge is figuring out what to do. Even for attendees who do have experience with code, determining how to spend the hour can become its own obstacle.
  3. An undefined goal is not the best stimulus. In choosing to attend a scheduled meeting, many attendees hope to avoid the obstacles and glitches that come from figuring out new platforms or libraries on their own. An un-guided workshop seems pointless.

📍Looking forward

Future Map Club sessions can improve by providing certain types of guidance for attendees, without encroaching upon the self-paced nature of learning-by-hacking.

  1. Provide a starter kit for new Map Club members. A bundled tutorial for introducing new attendees to basic digital mapping concepts provides material to help them better spend the session in a valuable way.
  2. Provide basic templates for download (when applicable). Even experienced attendees benefit from the time saved.
  3. Provide a list of tool-specific challenges. To make the session as productive as possible, put together a list of potential ideas, or challenges, for members to independently explore.
  4. Be available for questions. Even though these sessions are self-driven, nobody should be left in the dark. Leverage other attendees’ knowledge, too.
  5. Emphasize the value of mistakes. Some of the coolest visual output this semester came from members who took novel approaches to producing digital maps — Ruoran’s GeoJSON/Cartogram mashup, for instance, or Rachael’s vibrant approach to tiling. Encourage attendees to relish the proverbial journey and focus on editing, manipulating, and experimenting. De-emphasizing an end goal helps to alleviate the impetus to finish something.
  6. Include some guided workshops. To combat fatigue induced by Map Club’s ambiguous structure, I inserted several guided workshops into the series throughout the semester. Aside from keeping the momentum going, certain tools or frameworks (such as D3.js or QGIS) benefit from a step-by-step introduction.

📍Final thoughts

As an alternate workshop model, I believe that Map Club has the capacity to position technology as an ephemeral means to an end rather than a capability to master. By emphasizing what is plaint, inessential, and surprising about digital platforms, instead of what is inaccessible and opaque, my hope is that this series can foreground the process of learning as an end in itself.

To view the full repository of Map Club materials, sessions, and tutorials, click here. For recaps of each session, visit the “map club” tag on the Digital Social Science Center blog.

A Medium-Scale Approach to the History of Literary Criticism: Machine-Reading the Review Column, 1866-1900

Book reviews in nineteenth-century periodicals seem like the perfect data for doing computer-assisted disciplinary history. The body of the review gives information about the words used by early generations of literary critics while the paratext provides semi-structured information about how these early literary critics read, evaluated, classified: they include section headings labeling the topic or genre of books under review alongside bibliographic information. Such material, when studied in aggregate, could provide valuable insight into the long history of literary criticism. Yet there’s a significant obstacle to this work: important metadata created by nineteenth authors and editors is captured erratically (if at all) within full-text databases and the periodical indexes that reference them.

My project aims to tackle this dilemma and develop a method for doing this kind of disciplinary history. To do so, I’m constructing a medium-sized collection of metadata that draws on both unsupervised and supervised models of reading. Working with a corpus of three key nineteenth century British periodicals over a 35 year period (1866–1900), this project collects metadata on the reviews works––capturing the review metadata as it exists in existing database and indexes, and using more granular data extraction to capture sections headings like “new books,” “new novels,” or “critical notices”). I then pair this metadata with computer-assisted readings of the full texts, generating “topic models” of frequently co-occurring word clusters using MALLET, a toolkit for probabilistic modeling. While the topic models offer the possibility of reading over a larger number of unlabeled texts, the metadata provides a way of slicing up these topic models based on the way these reviews were originally labeled and organized. The end goal here is to create a set of metadata that might be navigated in an interface or downloaded (as flat CSV files).

Though the case study will be of practical use for Victorianists, the project aims to address questions of interest to literary historians more generally. What patterns emerge when we look at an early literary review’s subject headings over time? What can we learn from using machine-learning to sift through a loose, baggy category like “contemporary literature” as it was used by reviewers during the four decades of specialization and discipline formation at the end of the century? Critical categories and vocabularies about them presents a particularly thorny problem for literary interpretation and classification of “topics” (see work by Andrew Goldstone and Ted Underwood or John Laudun and Jonathan Goodman). I hope to assuage some of these anxieties by leveraging the information already provided by nineteenth century review section headings, which themselves index, organize and classify their contents.

Much of the first phase of this project is already underway: I’ve collected nearly 418 review sections in three prominent Victorian periodicals The Fortnightly Review, The Contemporary Review and The Westminster Review, with a total of nearly 1,230 individual reviews. I’ve extracted and stored the bibliographic metadata in Zotero, and I’m in the process of batch-cleaning the texts of the reviews so as to prepare the texts for topic modeling and for further extraction of bibliographic citations. I’ve also begun topic modeling a subsection of the “fiction” section of the Contemporary Review. Some of the preliminary results are exciting––for instance, the relatively late emergence of “fiction” as its own separate category within the broader category of “literature” reviews in The Contemporary Review.

The next phases will require further data wrangling as I prepare the corpus of metadata and the full-texts for modeling. In the immediate future, I plan to improve my script for extracting the section headers and the titles of reviewed works. Once this is done, I’ll generate a set of topic models for the entire corpus, then use the enriched metadata to sort and analyze into sub-sets (by journal, review section or genre title, and date). Most of the work of the project comes in pre-processing the data for the topic models; running topic models themselves will be a relatively quick process. This will give me time to refining the topic models––disambiguating “topics,” refining the stopwords list––and to work on the best method for collating the topic results with the existing metadata. Finally, I plan to spending the last stages of the project experimenting with the best ways to visualize this topic model and metadata collection. Goldstone and Goodman have created for visualizing topic models of academic that I’ll be building off of in displaying my data from the Victorian reviews.

While relatively in scale (3 periodicals, a 35-year period), this narrower scope, I hope, will make this an achievable project and a test case for how topic modeling could be used more strategically when paired with curated metadata. For my own research, this work is essential. My goal with the project, however, is not just to provide a way to read and study the review section over time, but to provide a portable methodology useful for intellectual historians, genre, and narrative theorists and literary sociologists. By structuring the project around metadata and methodology, I hope to also make a small bid for treating the accessibility and re-usability of data as just as important as the models made from it.

A Reflection on My Internship with DCIP

This fall semester I joined the Digital Center Intern Program(DCIP) as an instructor intern. My internship is primarily focused on developing lesson plans for and hosting weekly R Open Labs. This internship allowed me to try different teaching approaches and explore different topics about R. It was an intellectually challenging and rewarding experience. The highlights of my experience were discussing with people from diverse academic backgrounds about how to use R to help their research. I learned a lot about applications of statistical analysis from these discussions and it felt wonderful to help people.

At the beginning of the semester, I started R Open Lab as a very structured instructional session and covered the basic usage of R. Later on, after talking with other librarians, I decided to make the open lab more free-flowing and put more emphasis on discussion instead of instruction. I found that, by getting participants more engaged in conversation, I was able to better understand their needs and help them with their research.

The internship offered a great opportunity for me to see for myself how R and statistics could be used as a tool for research. For example, one of the open lab’s regular participants used R to conduct sentiment analysis to gain insights about stress measurement and management in medical research. Another participant used R to extract information from Russian literature and conducted text analysis to understand the political situation at different times. It never occurred to me that statistics are so broadly used in different fields until I talked with these people.

Considering the participants’ interest and needs, I am planning to talk more about plotting, data cleaning and data scraping in the next semester. Since people coming to the open lab often have completely different levels of understanding of R, I am hoping to encourage more peer learning at the open lab next semester.

This internship motivated me to gain a deeper understanding of R and enhanced my teaching skills. It is an amazing program and I had an incredibly fulfilling experience. I really look forward to the future work in the program and I hope to do better next semester.