R Open Lab Summary Report – Fall 2018

This semester, I returned to holding R open labs with another student intern, Hanying Ji. We made some adjustment about the structure of the open labs. Time flies, and now our R open labs have reached its end for this semester. I would like to summarize our journey and provide insights for both myself and future interns, hopefully.

A quick overview

We covered 9 topics this semester in total including the starter kit. This is almost twice as many as last semester. For the last open lab, we even ran out of major topics. This is unimaginable to me, and it is proof of success. We conducted open labs on the fundamentals, apply family, exploratory data analysis, character strings, data visualization, data manipulation, shiny, and randomness. These are pretty much everything a beginner needs to learn to use R without going too theoretical.

The flow of teaching is the same. I first introduce and do an example of the function quickly. Then, I show attendee the help page of the function so that they get a sense of arguments and results. Finally, I will talk about additional important arguments and do some examples with those arguments changed. For example, when I was introducing the function matrix, I demonstrated examples with the argument byrow = T and byrow = F. By visualizing the output of both, students can understand that argument better.


We made some changes compared to last semester, and they are all positive based on our program’s performance.

  1. We started using Github repository as the way of sharing codes and data sets. It was messy to share files through google drive, and people had to come back to me asking for access. Using Github solves this problem. Plus, I think it is great to introduce Github to them since it is useful while handling and managing codes and projects.
  2. We promoted and advertised this program at the beginning of this semester. A lot of students told us that they wish they knew this program earlier. So I asked my friends to help me spread the words in SIPA and QMSS community. Also, my supervisors got the words out by sending emails to different departments. As a result, we have seen a decent population who attends our open labs consistently.
  3. We stuck with using examples and let the functions demonstrate themselves. Introducing and describing coding can be boring and dry. Having examples let students visualize the process in their heads, and thus, they can get familiar with the new functions quickly.

Potential improvement

  1. The fundamental concepts, especially the part about loops, should have more examples and practice problems. I noticed that for people who had coding experience before, it is easy to grasp the basic concepts, but for those people with zero experience in coding, the loop and conditional part is the first obstacle. Having enough practice in this part will help students get used to R coding.
  2. The pace could be a little slower. As I mentioned above, we ran out of topics to cover by the end of this semester. I also felt that students didn’t fully understand some of the topics. I think in the future, the pace of open labs should always be adjusted according to the situation. Also, diverse examples can be added to help attendee learn important topics such as data manipulation.

This is my last semester at Columbia University, and I am honored to participate in this program again. It has been a fulfilling experience. Starting from my undergraduate years, I have been a TA for many years. I enjoy helping fellow students overcoming their difficulties and serving the university community. It is definitely pleasant to see that this program has a better performance this semester. For the ones who take this position in the future, I am more than happy to discuss with you about how to be a helpful TA and how to run this fantastic program.

R Open Lab Reflections

Sharing is always good!

I am really happy to get a great opportunity to work for Digital Center in my last semester at Columbia as a R Open Lab Intern in Digital Center Internship Program.R is really a powerful tool for statistic and data science nowadays which I love most. Holding weekly R Open Lab and daily consulting really give me good experience to teach and share my understand of data and R programming. During this process, I also refresh my own idea by interacting with people from different academic background.

Teaching and sharing are quite different from just telling the knowledge. When I was deciding where to start and how to explain every idea clearly, I was thrown back to the moment when I was new to R, such as “what foundation is important but tricky?”, “what mistake will we make if we are confused with some concept?”. Most importantly, this is always the idea that guide me how to teach in the whole semester. Nowadays there are tons of material people can find online to get them familiar with a programming language, but I think our Open Lab is great place for (1)getting idea of where to start; (2)improving understand of R.

In R Open Lab, I followed the roadmap of basic knowledge of variable type, data structures, functions and environment, data manipulation, exploratory data analysis/ visualization, apply family, text analysis, R Shiny, which starting with the fundamentals and going to more applied knowledge.

I always went forward and backward when we met fundamentals during teaching something new and gave enough exercises after each new concept, which can not only remind the attendee the fundamental but also satisfy them when they know that they really learnt the knowledge. Discussions and questions were always encouraged during lab, which means that every attendee engaged in the lab and thought by themselves.

In the consulting part,  I have to say it is really worthwhile to take time listening to people’s question, clarifying it and solving together. During this semester, I worked on different problems from coding in R, statistical modeling and data analysis. I realized that it is often the case that people come with error but have no idea about what their question really is, I always talk with them, dig into their problems and try to find out the real question. It is a skill for problem solving. I also improved my own knowledge like building boosting models and writing SQL in R to simplify data manipulation.

I recommend this internship program. Sharing what we have and learning from people are what we need to do all the time.


Hanying Ji
M.A. Statistics (2018)

Python Open Lab Reflections

This fall semester I got an excellent opportunity to work as a Digital Center Intern, as part of the Digital Centers Internship Program. My work involved data consulting, as well as assisting in leading Python OpenLabs. I am a Statistician, so  R is more intuitive to me. However, due to onset of Machine Learning and Neural Networks, Python is “the” language.

I believe that one can not build anything great on a weak foundation. This is what drove all the Python OpenLab sessions. We spent considerable time on teaching about variables, variable assignment, different operators, data types, manipulating them, writing small functions, looping over different data types, and then finally introduced Numpy and Pandas package. I personally tried to draw a parallel with R language as and when I could, to help R users learn better as well as students who are unfamiliar with R language, to gain an insight into a very helpful tool.

Practice makes one perfect, and therefore after midterm, we started off our sessions with first half-an hour devoted to working on assessment worksheets. This seemed to be very helpful, as not only did it help the students to understand where they stand and things they need to work on, but also gave us an idea as to what topics need to be repeated over and over again. This also lead to more engaging discussions, where every student came up with their own way to solve a problem. Discussion definitely helps to understand things better, and we did see it work during our OpenLab sessions. We found the students to be more involved in learning together.

The consulting part of my job was very rewarding indeed. During this one semester, I got to work on very diverse problems that people were trying to solve- be it coding in R, Python, Machine Learning, or basic Statistics. I even got to learnt a bit of generalized linear models, while trying to help somebody understand their model better. I would strongly recommend anybody to be part of this wonderful internship. I believe I have become a better student and a learner by being a part of this internship, and I do hope all the students who came for help, found us to be useful.

Anshuma Chandak
M.A. Statistics (2018)

Python Open Lab mid-term reflection (Fall 2018)

This semester I am glad to participate in the Python Open Lab of the digital center. Staying with students just reminds of myself. When I learned my first programming language, I had a lot of struggle and made tons of mistakes. As a result, I get much experience and I am really happy to share them with others.

Python is one of the programming languages I ever learned and it is the most concise one. With Python, many complicated things can be simplified. For most students who join our open lab, Python is the first advanced programming language they learn. This is great because Python can help students learn the basic programming skills and data structures without spending much time on irrelevant details. It helps students focus on their task and gives them the confidence to explore the unknown world.

I started with the basic ideas of types and operations in Python and introduced data structures like list, dictionary later. Other topics such as function, conditional statement and FileIO are also taught. I found that the sequence of materials can be organized better. And more examples could be shown for students to have a better understanding of the materials.

During the teaching, I learned a lot too. I began to see things from a different perspective. Although I have knowledge about programming languages, it still takes great efforts to spread the ideas to students and support them. The most important thing I learned is I need to focus on the big picture and skip the specific tricks to help students have a complete view of the materials. This is where I need to improve myself.


Mid-Semester Reflection – R Open Labs

Now the semester is halfway done, I think it is a good time to review our progress of this code teaching program. The R open lab’s performance has improved compared to the last one. We are having a consistent population of attendee, and we have already covered more topics than last semester. I am confident to say that our program is heading towards the right direction.

The people attending the open labs have diverse backgrounds: SIPA, SPS, GSAS, etc. I’m glad to see many people actively showing up to learn how to apply coding techniques to their customized scenarios. Another surprise is that we are having a decent number of the attendee in each open lab so far. Also, people are showing up consistently. One of the biggest problems in the past is that we are not having enough attendee, and people don’t attend continuously. That causes the issue that attendee’ coding levels have huge gaps; there could be complete beginners, and there could be people who’re already familiar with R. Now, the attendee will have the same level of R proficiency after a few workshops. That allows us to move further to new topics.

Also, using Github seems like a good idea. In the past, sharing my codes and data sets in open labs was troublesome and inconvenient. Now, students can always go to my Github repository to look at past materials without asking for access from me.

I am happy to see that our program is developing, and we are able to help more fellow students step in a new field. I invite all to our open labs regardless of your level; we welcome you to take a journey to explore the facts behind data.

R Open Lab Summary Report – Spring 2018

This was the first semester I worked as a teaching intern at the Digital Center, and I was in charge of running the R workshops. As a great semester reaches its end, I would like to summarize this journey and reflect on it.

A Quick Summary

I conducted 5 workshops besides the starter kit. 4 of them are new topics, and the other one is an intermediate version of the starter kit which provides a deeper exploration on the fundamentals in R. For all the labs, I prepared all the scripts ahead without all the input parameters  because I wanted to fill them in as examples during the labs. I always believe that the first step of learning is imitation. You don’t have to understand the content to be able to imitate, and by imitating again and again, you will be able to observe the pattern and comprehend the nature of it. The usual flow of my labs is:

  1. Introduce the function briefly.
  2. Show the help function output of this function so that students can see what the inputs look like and what the results are.
  3. Enter the input arguments and run it.

I believe this way of demonstrating codes is better than having it written in the script file and running it in a blink of an eye; students are actually seeing me coding, and it becomes easier for them to imitate.

I also started a new teaching practice in the middle of this semester: I started to add practice questions at the end of every lab. The purpose of it is to help fix the codes in memory. No matter how well you understand the new contents, they won’t become part of you skills unless you use it again and again; that’s how we learn: imitate and repeat. I figured that by implementing the functions to a real-world problem, it leaves a mark in student’s memory, and they will be more likely to recall it in the future. I even used the same dataset for several workshops. In that way, they are working with something they are already familiar with, and hopefully, it will help them connect everything we learned.

Challenges and Possible Solutions

Of course, there were quite a few challenges. I will list them below and provide possible solutions to them from my perspective.

  • The outcome of the attendee’ amount could be better. Due to lower application popularity and high difficulty at entry level, fewer people choose to learn R. However, I still see a big potential group to which R will be helpful from the positive feedbacks from the attendee. The Digital Center could market the open labs(both R and Python) to specific communities which would be interested such as Economics programs and SIPA. If you are reading this post and you want to learn coding in R and Python, remember to check out the open labs this fall!
  • Attendee don’t attend labs regularly. This has been troublesome. I didn’t build the labs consecutively because I didn’t want to review earlier material every time; I think that’s inefficient. But I think that didn’t encourage students to come back, and it forms a negative loop. A solution to that is to post the topics ahead of time. I’m still undecided whether I should have a syllabus for the whole semester due to the nature of high dependence on attendee, but I think for Fall 2018, we should post topics at least 2 weeks ahead, and make sure students know there will be continuous contents.
  • Through all the labs of this semester, I found that students who were new to R needed longer time than what I expected to fully understand the syntax and logic. As mentioned above, R has a steep learning curve for beginners. The starter kit is not enough. I should conduct richer materials on the fundamentals, and I will bring more examples to demonstrate. I may also conduct one lab focusing on the swirlpackage since some students thought it was a good tutorial.

All in a nutshell, this experience was exciting and informative. Now I have more understanding of how to teach. To be honest, I don’t really consider myself as an instructor or a teacher. I am just a student who tries to help others with what I know. This internship enlightened me of new ways to pass on the useful knowledge and to exchange information with others. I am looking forward to continuing my internship at Digital Center and serve the Columbia community and fellow students.

Python Open Labs S.2018: How it all went down

Lab Structure and Technicalities:

This semester I returned to running the Python Open Labs with another student intern and upon started, we had some discussions about our ideas for structuring the labs over the next few months. We decided to stay consistent with the formatting of the lab: started the semester with the starter kit (Python fundamentals), and continued to build on those fundamentals every week (see weekly blogs here).

The first change we implemented was switching to Jupyter notebooks as opposed to running the labs on the console itself. For me, this was quite a challenge! I had never used Jupyter notebooks, it seemed like a strange and abstract way to code, one that was definitely built with user experience in mind and something that was unfamiliar to me on all fronts. I spent a few weeks playing around with its functionality – the headings and commenting features as well as common errors that can happen (e.g. running every cell to ensure the code works). Once I got the hang of it though, everything changed! I have tasted and enjoyed the Jupyter notebook kool-aid and there is no going back.

One of the best features of the Jupyter notebooks is the UX of its layout. The simplicity of its layout makes the code that much easier to parse out and build upon. Organizing the code, or in our case, entire lesson in the Jupyter notebooks meant that we could share the lessons with the class at the end. Prior to the lesson, we would come up with the content and create the notebook in full. Then, we would go back and recreate another lesson without the completed cell blocks so that we could use the prompts and live code. At the end of the lesson, we shared the lessons in full with the class to ensure that students could spend time going over and reviewing the examples and problems with the full code (i.e. all answers) readily available.

That being said, we also changed how we shared the lessons with the students. This semester, we maintained a google folder with all the Jupyter notebook lessons and .pdfs and shared it with the students that came. A welcome change considering the amount of paper it took last semester to print and share each lesson! We also received great feedback on the organization and sharing of the Jupyter lessons so that’s definitely something we will keep in mind and hope to continue next semester.


This semester, the range of programs represented by the students who attended the labs were incredibly diverse. Students from the School of Professional Studies and the School of International and Public affairs were the most consistent, however we did encounter students from Journalism, Economics, and Latin American Studies as well. Although it’s a challenge to encourage students to attend on a regular basis, we were able to see some faces week after week, and sharing the lessons on an accessible drive folder ensured that those who were not able to make it in person but interested in continuing to expand their coding horizons could keep up.

Most enjoyable Lab:

The lab we held on python classes in early April was my favorite lab – partly because I taught the entire session on my own, but mostly because I structured the lesson in a way that focused on fewer and more intense practice problems. Instead of going through quicker and shorter sample problems I thought I would try create problem sets that incorporated functions as well to keep things interesting and to offer students a challenge. The class was well received and you can find the lesson on the DSSC blog if you want to check it out!

Ideas for future labs:

To conclude this post, I will underline two suggestions for future lab lessons:

  1. Plan out lessons before the labs

It would be great, in my opinion, to post a description of each lab before it happens to outline the structure of the lab and the concepts covered. This was the route we took for the R open labs towards the end of the semester and it worked really great – I am excited to try it out for Python as well!

  1. Continue to market to a diverse group of students

Before commencing in the fall, I would like to spend some time strategizing on how to market to different departments. The open labs are such a great way to learn a coding language – they are free (!!!), but more importantly the communal vibe is optimistic and welcoming and a great space to learn.

I have learned so much this year in preparing and leading labs and now that they have wound down for the summer, I feel motivated to continue to market the space and engage students across all departments.

End-of-Semester Reflection (Python Open Labs – Spring 2018)

It’s hard to believe that the end of the semester has arrived and that Python Open Lab sessions for Spring 2018 have come to an end. Instead of writing a sappy post about “the end,” I’d like to share five things I learned while teaching my students Python this semester.

# 1 – Teach with Examples – Different programming languages vary in syntax, but they all share similar concepts such as variable usage, conditionals, and loops. Explaining such concepts to students unfamiliar with programming is certainly helpful, but can probably only get them so far. To show how to use a language to creatively solve problems, examples – especially multiple examples showcasing the same concept – are a must. I would also encourage instructors to create examples that reflect the demographics of their students when given the opportunity (i.e. initialize a list of more diverse names versus solely American names).

# 2 – Teach with the Right Tools – The agile method in software development encourages reiteration, and I like to encourage my students to think in a similar manner when writing code. The easiest way to test whether or not your code has worked is to run it and see the output. As an instructor, I live-coded each lesson. I wanted to my students to see me run my code block, examine my output, and fix my errors if need be. Using Jupyter Notebooks really allowed me to do this in a clear manner. I was able to isolate each example within a code block, which was especially helpful. Another IDE would work in regards to teaching a programming language as well, but I would not recommend teaching via a Google doc or a Powerpoint presentation for a non-lecture style session.

# 3 – Incorporate Wait Time – In addition to studying computer science, I also studied (English) education as an undergraduate. I learned a lot about teaching methodologies and one concept that has stuck with me is the idea of wait time. Wait time is the time an instructor waits after asking a question before calling on a student. Sometimes, it can be easy to answer your own question right away if no one has raised their hands, but waiting gives students time to think about their answer. If no hands are raised after some amount of waiting, then you can possibly provide a hint or make the judgement call whether or not to answer your own question.

# 4 – Have a Positive Attitude – If you are not excited about the material you’re teaching or the lessons you’re crafting, it may be a little harder to get your students excited about the subject as well. I like to use varying examples to keep things fresh as well as think back to my earlier days when I started learning about Python for the first time – and how incredibly fulfilling it is now to be able to code on my own with no instruction. When I imagine and see my students feeling the same way, I feel all the more positive. As Jim Henson says, “[students] don’t remember what you try to teach them. They remember what you are,” and I’d like to be remembered as someone who was wildly passionate about computer science education.

# 5 – Ask Students for Feedback – Not every lesson you create is something your students are satisfied with. Ask them what they’d specifically like to see more of or less of. From student feedback, I learned to spend more time coming up with examples for loops and functions and less time reviewing classes. Students also wanted to see more of a workshop-style lesson towards the end and with feedback, I created a data visualization lesson that ended up being quite well-received. Always be sure to ask for your students’ inputs – you are not the only one in control of the class and its structure.

I hope you found these takeaways valuable and can apply them to your own lessons if you are an instructor. I’ve greatly enjoyed serving as one this semester and hope to take on more teaching-related opportunities in my spare time after I graduate this May. Working as a teaching intern at the Digital Social Science Center for Columbia University Libraries has been an incredibly fulfilling experience – I would do it again in a heartbeat. If you are encouraged by my post and love teaching as well, I hope that you apply to be an intern for the upcoming semester!

Navie Narula

Mid-Semester Reflection (Python Open Labs – Spring 2018)

Stuart Walesh, an author and consultant, once said: “The computer is incredibly fast, accurate, and stupid. Man is unbelievably slow, inaccurate, and brilliant. The marriage of the two is a challenge and opportunity beyond imagination.”

Many of us use computers. Sometimes, the time we spend on them consume the majority of our day. Whether or not this is a good or bad thing can be debated in another blog post, but the fact is…technology is an overwhelming part of our diet.

Taking my first computer science class as an undergraduate made it apparent to me that learning about how code and algorithms work was a really important thing, especially if I wanted to solve problems on my own. I declared my major in computer science and focused on  learning more about how code could be used to analyze large amounts of text more efficiently. I have not regretted it since, and am beyond happy to see a good number of students show up to the Python Open Labs to learn more about how to write code to perhaps automate their own tasks.

The people who show up to our class are diverse in terms of major – coming from backgrounds ranging from education to international affairs to pure math/analytics. It’s been really nice to see people actively show up to our labs with a desire to learn how to code and truly curious about how to solve problems. It’s proven to me again and again that anyone can learn how to code, and it’s been wildly encouraging to see people who think they cannot do it actually do it!

This is my first semester helping to lead the Python Open Labs. I find that lessons introducing a new programming language or new programming concepts are best taught in a step-by-step manner. Jupyter Notebooks have allowed me to accomplish this very well, allowing for space to write comments in markdown and running code in cell blocks. The students in class love this medium as well, and at the end of the lesson, they can easily look back over the notebook and remember what we learned about.

I’ve really enjoyed helping out with the labs so far and answering so many questions from the students who show up. Anyone is welcome to stop by the Python Open Labs – even if you have never written a line of code before in your life. I look forward to learning more from my students as the semester goes on.

Navie Narula

Computationally Detecting Similar Books in Project Gutenberg

As one of the first digital libraries, Project Gutenberg has lived through a few generations of computers, digitization techniques, and textual infrastructures. It’s not surprising, then, that the corpus is fairly messy. Early transcriptions of some electronic texts, hand-keyed using only uppercase letters, were succeeded by better transcriptions, but without replacing the early versions. As such, working with the corpus as a whole means working with a soup of duplicates. To make matters worse, some early versions of text were broken into many parts, presumably as a means to mitigate historical bandwidth limitations. Complete versions were then later created, but without removing the original parts. I needed a way to deduplicate Project Gutenberg books.

To do this, I used a suggestion from Ben Schmidt and vectorized each text, using the new Python-based natural language processing suite SpaCy. SpaCy creates document vectors by averaging word vectors from its model containing about 1.1M 300-dimensional vectors. These document vectors can then be compared using cosine similarity to determine the semantic similarities of the documents. It turns out that this is a fairly good way to identify duplicates, but has some interesting side-effects.

Here, for instance, are high-ranking similarities (99.99% vector similarity or above) for the first 100 works in Project Gutenberg. The numbers are the Project Gutenberg book IDs (see, for instance, this index of the first 768 works).

1.  The King James Version of (10) -similar to- The Bible, King James Ver (30)
2.  Alice's Adventures in Won (11) -similar to- Through the Looking-Glass (12)
3.  Through the Looking-Glass (12) -similar to- Alice's Adventures in Won (11)
4.  The 1990 CIA World Factbo (14) -similar to- The 1992 CIA World Factbo (48)
5.  Paradise Lost             (20) -similar to- Paradise Lost             (26)
6.  O Pioneers!               (24) -similar to- The Song of the Lark      (44)
7.  O Pioneers!               (24) -similar to- Alexander's Bridge        (91)
8.  Paradise Lost             (26) -similar to- Paradise Lost             (20)
9.  The 1990 United States Ce (29) -similar to- The 1990 United States Ce (37)
10. The Bible, King James Ver (30) -similar to- The King James Version of (10)
11. The 1990 United States Ce (37) -similar to- The 1990 United States Ce (29)
12. The Strange Case of Dr. J (42) -similar to- The Strange Case of Dr. J (43)
13. The Strange Case of Dr. J (43) -similar to- The Strange Case of Dr. J (42)
14. The Song of the Lark      (44) -similar to- O Pioneers!               (24)
15. The Song of the Lark      (44) -similar to- Alexander's Bridge        (91)
16. Anne of Green Gables      (45) -similar to- Anne of Avonlea           (47)
17. Anne of Green Gables      (45) -similar to- Anne of the Island        (50)
18. Anne of Avonlea           (47) -similar to- Anne of Green Gables      (45)
19. Anne of Avonlea           (47) -similar to- Anne of the Island        (50)
20. The 1992 CIA World Factbo (48) -similar to- The 1990 CIA World Factbo (14)
21. The 1992 CIA World Factbo (48) -similar to- The 1993 CIA World Factbo (84)
22. Anne of the Island        (50) -similar to- Anne of Green Gables      (45)
23. Anne of the Island        (50) -similar to- Anne of Avonlea           (47)
24. A Princess of Mars        (60) -similar to- The Gods of Mars          (62)
25. A Princess of Mars        (60) -similar to- Warlord of Mars           (65)
26. The Gods of Mars          (62) -similar to- A Princess of Mars        (60)
27. The Gods of Mars          (62) -similar to- Warlord of Mars           (65)
28. Warlord of Mars           (65) -similar to- A Princess of Mars        (60)
29. Warlord of Mars           (65) -similar to- The Gods of Mars          (62)
30. Adventures of Huckleberry (73) -similar to- Tom Sawyer Abroad         (88)
31. Tarzan of the Apes        (75) -similar to- The Return of Tarzan      (78)
32. The Return of Tarzan      (78) -similar to- Tarzan of the Apes        (75)
33. The Beasts of Tarzan      (82) -similar to- Tarzan and the Jewels of  (89)
34. The 1993 CIA World Factbo (84) -similar to- The 1992 CIA World Factbo (48)
35. Tom Sawyer Abroad         (88) -similar to- Adventures of Huckleberry (73)
36. Tarzan and the Jewels of  (89) -similar to- The Beasts of Tarzan      (82)
37. Alexander's Bridge        (91) -similar to- O Pioneers!               (24)
38. Alexander's Bridge        (91) -similar to- The Song of the Lark      (44)

The first pair here is of duplicates: both are King James Versions of the Bible. The same is true of lines 5 and 8, and lines 12-13: they’re just duplicates. All the other works are members of a novelistic series. Lines 2 and 3 are Alice in Wonderland and its sequel. Lines 6 and 7 are Willa Cather novels of the Great Plains trilogy. Lines 16-19, and 22-23 identify the Anne of Green Gables novels. Lines 24-29 are a cluster of Edgar Rice Burroughs of the Mars series, and there is also another cluster of Burroughs novels, the Tarzan series, at 31-33. Line 35 shows Mark Twain novels of the Tom Sawyer and Huck Finn world. The algorithm even identifies the two 90s CIA World Factbooks as part of a series.

When I lower the cutoff similarity score, I can get even more interesting pairs. Less-recognizable series, like Paradise Lost and Paradise Regained, have similarity scores of around 97%. At that level, completely unrelated novels with the same settings, or written in around the same time period (Victorian novels, for instance), begin to cluster together.

The chart below shows a PCA-reduced 2D vector space approximating the similarity between books 1-20. There are interesting clusters here: the American Constitution and Bill of Rights cluster together, along with the Declaration of Independence, the Federalist Papers, and Abraham Lincoln’s first inaugural address. Lincoln’s second address, however, clusters rather with his Gettysburg Address, and John F. Kennedy’s inaugural address.

PCA of PG Books 1-20

PCA of PG Books 1-20

Non-fiction seems to be clustered at the bottom of the chart, whereas fiction is at the top. Quasi-fictional works, like the Book of Mormon and the Bible, are in between. Similarly, Moby Dick, a work of fiction that nonetheless contains long encyclopedic passages of non-fiction, lies in the same area. The most fantastical works, which are also the three children’s books, Peter Pan and the two Carroll novels, cluster together in the upper left.

As always, the code used to create all of this is on GitHub. I welcome your comments and suggestions below!