Sente for PDF Management on the Mac and iPad (1): Capturing and Organizing PDFs

This post will focus on the business of capturing, categorizing, and organizing your PDFs in a coherent library using Sente for Mac.

Sente Loading Screen

If you followed my last post, PDF Chaos? Digital Workflow Basics, I discussed the chaos that can ensue without establishing a coherent filing system for PDF documents–and illustrated it with a chaotic demo library. I then walked through some “do’s and dont’s” of filenaming, splitting, and OCRing PDFs in a library staging inbox. Here we will start to  transform this disorganized library, and you will see how you can simplify and organize your PDF Chaos while also exploring how Sente can help you with the rest of your Academic Workflow.

So for the first post (of several) on Sente, I will focus chiefly on setting up libraries and introducing Sente’s key features. I assume you will have read my first post, Introducing Digital Workflows for Academic Research, and latest post where I give some basic principles for workflows with PDFs.

 Subsequent posts will explore:

  1. Annotation and notetaking on PDFs, including tagging, and also annotating with your iPad on the go.
  2. Sente’s automated research and document collecting capabilities; smart collections; bibliographic formatting; and other selected advanced functions.
  3. How to make use of Sente Assistant and some amazing free Apple Scripts to integrate the power of OPML into your workflow so that you can move your Sente annotations into Devonthink Pro Office, Scrivener, and other software, for writing.

But, why Sente?

Some people will argue that it is not worth paying for Sente when there is other software like Mendeley, for example, which does similar things for free. This is a complex issue,  which is not as simple as free or unfree–and similar can hide substantial difference. Since it cannot be discussed appropriately here, I will post a separate, companion piece on this issue, as it also offers the opportunity to discuss some key considerations about privacy and academic work, the pros and cons of paid versus unpaid software, and a more holistic view of the various trade-offs–including functionalities like social networking–that users should consider in selecting the core application component of their Digital Workflow among software like Sente, Papers, Mendeley, Zotero, and others.

For now, let’s just say it’s my opinion that Sente really shines over software like Zotero, for example, which does not offer an integrated cloud-based synchronization system for large PDF libraries and bibliographies along with a professional solution for serious annotation and idea collection during the review and thinking phase of your research. Again, with Devonthink, OPML scripts, and Sente Assistant, I’ll show you how you can use, search, tag, organize and analyze your collected quotes and comments from reading your PDFs, and even how to drop them into outlining software and use them as fungible material for your writing and production phase of work, but Sente itself is amazing as a one stop-shop tool. Sente’s versatility is what makes it so effective for maintaining control in the research process, leaving you ready to mold and create the research product you want, and is why I have chosen it as my staple workflow application.

In fact, according to President of Third Street Software, Inc. and Sente creator, Michael Cinkosky–with whom I’ve had the pleasure of discussing Sente and his long term goals for its development in some detail in preparing this and future posts–the name “Sente” derives from Japanese:

the name “Sente” is a Japanese term from the ancient game of Go. A player is said to have sente when they are controlling the direction of the game through the force of their moves. The other player is said to have gote (go-tay) because they have little choice but to respond passively to the player with sente. My goal with Sente was to help people feel more in control of their literature research and less like they were simply struggling to keep up.

As he elaborates, this vision of control drove the development of the software’s various features to where it is today:

My primary goal when I launched this company was to make it easier for people involved in research to acquire, organize and keep abreast of the literature most relevant to their research. I had already spent many years building software systems in support of scientific research (mostly biology) but I regularly heard complaints from users about how hard it was for them to stay current with the literature. All of the reference managers at the time were focused on formatting bibliographies, not on facilitating literature discovery and organization. I asked people what tools they were using to stay current with the literature, but they never had any. Programs like EndNote were (accurately, I think) seen as formatting tools, but not research tools.

So that is the main problem we have been focused on. For the first couple of years, we did not even do bibliography formatting, but people obviously want their reference manager to format bibliographies, so we eventually added this capability. But our primary focus was, and remains, on search, acquisition, organization, understanding, etc. Thus, we have devoted significant effort to features like: hierarchical tagging (what we call QuickTags); the ability to automatically capture quotations when highlighting text in a PDF; and transparent sync that lets you have your library up-to-date at all times, across all your Macs and iPads (and, soon, iPhones). We understand that people involved in academic research never really stop thinking about their research and they want to know they are not missing anything important in their field, and that once they find something, they don’t want to lose it. Our vision for Sente is that it be integral to the day-to-day activities of becoming, and remaining, an expert in each user’s chosen area of study.

I also asked Michael about how he sees Sente in relation to “free” tools like Mendeley, especially with regards to privacy and monetization, and he was kind enough to give me some details of his vision for the future of Sente. As I already said, I will discuss these issues in a later post.

Download Sente

Now, if you’re new to Sente, head on over to Third Street Software and download it, if not you might already know the basics covered here. The free license allows you a limited library, but the $59.99 academic license is completely worth it, giving you unlimited libraries and as much cloud synchronization space you need. If you’re not convinced, use the free version until you are.

Considerations before we begin

Sente is a powerful piece of software that includes many functions. The next few posts are merely designed to demonstrate what you can do with it in some elementary ways, but I insist that–as with anything else worthwhile–if you like what you see in trying it here, you will need to eventually spend some time reading the Sente manual, especially in regards to its more complex cite and scan features, the use and modification of citation styles, the integration of Sente with Microsoft Word, Scrivener, and Mellel, setting up autolinks etc. I realize that many people might balk at this initially because learning new software can often interfere with our work and involves an investment of precious time. But the truth is that Sente is such an amazing program because it combines several functions and processes that used to belong to multiple applications, streamlines them, and as such is worth spending some time (beyond reading these posts) to learn to use properly if you like what you see.

Capture and Organization

Sente’s first amazing feature is its easy interface for capturing PDFs and organizing them, which I focus on in this post, leaving aside its research collection functionalities for later.

Before we start bringing in our PDFs, let’s set it up. I argued previously that it is really important to have a consistent system of filing. David Allen reccomends an A-Z file, and I agree. Basically, I think every Sente library (which can be set up as a local or synced library) should be set-up with the Chicago Author-Date (or APA) system in mind–and I mean this conceptually: Sente will allow you to format your actual citations and bibliography in all of the standard styles, and thousands of others. All I mean is that we will set up the library so that when you add a PDF to Sente, it will be re-named, added to the library bundle, deleted in its previous location, and given a new name based on its Author-Date-Title.

Here is a demo library, DigitalWorkflows, I’ve made for this series.

Screen Shot 2014-04-15 at 11.12.14 AM

For now, I will let the user explore the greater interface. Let’s go immediately to “Library Setup,” and “Attachment Handling.”

Screen Shot 2014-04-14 at 3.15.05 PM

“Attachment Handling” is where you set up your library. “Attachment” handling because Sente is going to allow you to create a reference in your library and attach the document to that reference.

Screen Shot 2014-04-17 at 1.30.14 PM

While there is a separate panel of Sente Preferences, the library settings are mainly here. I reccommend the following settings (as pictured above)–but the good thing is that no matter what you choose for your file names and filing system, even if you have a thousand items and attachments, Sente will automatically re-name and re-file everything for you to fit your needs and whim, even if you change your mind later.

As I indicated, I model mine on the Chicago Manual of Style Author-Date model, which is very similar to APA bibliographic model, and in any case, makes perfect rational sense:  a folder for every Author (Last Name, fore name),  a folder for the Year/Date of works, and a folder for individual Titles (which is a good policy, especially if an author has more than one book or article published in the same year).

With Sente, when the file gets added to a library with this setup, the PDF is automatically renamed too. When we press apply, as the Sente box here shows, the software will now set up this structure for your library, and henceforth automatically move the files to the bundle, rename them, and delete the old files. Again, I advise selecting “file/renaming” instead of making a copy, because it makes little sense to have multiple versions of PDF files loose on your system outside of the library–unless under special circumstances.

Go ahead, press apply. Once you hit apply, you will receive a notification explaining your choice:

Screen Shot 2014-04-14 at 10.06.20 AM

Now, you may be wondering what “inside the library bundle” means. Where will Sente put my stuff? Sente stores files inside a closed library as a Sente library file, aka “bundle” (with a “.sente6lib” file extension), but don’t worry, that doesn’t mean you can’t access it. In fact, it just means that it keeps things filed for you automatically. The bundle is like a package containing all the references, attachments, and other information that comprises your entire library. Thus each library file is a bundle in this set up, and while you can set up libraries without bundles, this is not recommended, because it presents a hazard for breaking libraries and opens the door to losing data and inconsistencies. I make a master library for all my documents, and make new libraries including those and other files–or merely import Zotero or Endnote bibliographies as libraries–for different projects or with specific products in mind.

I want to show what things actually look like in your library bundle, so that you understand the rationality of the organizing principle Sente operates on, and so that you see concretely what the above pictured configuration looks like under the hood.

Library Bundle

To access the library bundle, navigate to the folder in which you keep the Sente library file (.sente6lib file), and right clicking “show package contents.” Here is what my “DigitalWorkflows.sente6lib” bundle looks lke. What we see is our author date framework, later we will index this structure to our Macintosh finder and spotlight, and DevonThink, but now just note that while Sente gives you a beautiful interface to experience your files and use them, it is also ordering them and keeping them safe as data not proprietarily locked in its system.

Screen Shot 2014-04-17 at 1.33.16 PM

So, now that we have our library set up, have made the settings to our liking, and understand the concept behind the library bundle, let’s add our PDFs.

Adding PDFs to Sente Libraries from your PDF inbox

So remembering our messy PDF inbox, let’s one by one add the files.  Remember the article I mentioned in the last post? Namely, 724707_1.pdf? Screen-Shot-2014-04-08-at-3.26.19-PM

Screen Shot 2014-04-16 at 1.19.59 PM

To add this file, simply drag it from the folder into the Library.  Sente will then then open up a “citation lookup” box.

Citation Lookup

Now since we OCRed the PDF before, everything in it is searchable, highlightable. This makes that essential part of the workflow worthwhile. Sente goes the rest of the way: simply highlight the title of the article, and right click (which we Mac users means control click). You now get a choice of citation look up.

Sente gives you two options, one is to automatically search for the selected text on Google Scholar, Google Books, Library of Congress, or WorldCat; the second option will allow you to copy the text and will then automatically open a search box in the selected catalog or database and let you manually paste it to search for it there.  I choose Google scholar for now:

Screen Shot 2014-04-16 at 1.20.46 PM

Once I highlight the title, control click it, and select “Google Scholar,” Sente now opens its targeted citation lookup mode. Voilà, here’s our reference!

Screen Shot 2014-04-16 at 1.20.57 PM

Screen Shot 2014-04-16 at 1.21.19 PMOn the right hand side, we see a reference box with targets. Upon clicking the target that matches, Sente pulls up the reference editor, which will allow us to edit the reference before adding it to the Library.

Screen Shot 2014-04-16 at 1.21.25 PM

An aside on precise and imprecise metadata 

Now, people should understand here that Sente merely gives you access to a wide variety of options for importing “metadata” about your document. I’ve found in my experience that Google metadata tends to be inconsistent and sometimes prone to errors in precision, because it seems that they build the metadata (I assume) often by scanning title and catalog pages, and having an algorithm make somewhat accurate guesses based on large pools of data about which pieces of information, that is–type of publication, author name, editor name, press, etc.– belong to the appropriate fields. Though it is almost always better to select Worldcat (OCLC) or an official academic library catalog for importing metadata, since Google is so pervasive I wanted to show that while it does work, it exemplifies some pitfalls that you should always look out for when adding metadata period.

Thou shalt always make sure your metadata is accurate the first time, and save hours and embarrassment later

Just as I have put so much emphasis on coherent filing, so too, we must put emphasis on precise metadata–not least because with every file you add, correct metadata will ensure you can actually simply just find things in your library. Depending on the database you import your information from, you will sometimes not populate your fields completely accurately. If you do not check to make sure that it looks good the first time, and that the data is correct, you will have to spend hours later correcting it all when you go to make your official bibliography, and use the cite and scan functions. In the worst case scenario, your bibliography will have embarrassing errors later if you have to use it but don’t have time to fix it when you do notice.

In other words, we still will need (and this is often the case, because full automation is somewhat of an absurd idea), to exercise rational intelligence in populating the fields. Sente does populate the fields correctly in so far as the data input to begin with in the originating database is correct.

As a general rule, Worldcat and academic libraries, like Stanford and University of Wisconsin work quite well within Sente. You can also target from Columbia’s CLIO, which I’ll show in a later post. The point is that since you want to treat this like your real, legitimate library–because it is as real and legitimate as a paper library–you want the information to match up as much as possible the first time.

Checking your entry

Screen Shot 2014-04-16 at 1.21.25 PM

Screen Shot 2014-04-16 at 1.24.22 PMLook at this example, does anything look off? First of all, in the original import of metadata, it says “conference proceedings.” In many ways, conference proceedings will function like journal articles, but I prefer in this case, based on the information available to me, to classify this as a journal article. Always check and make sure it’s the correct sort of publication, sometimes it is not, or will require you to decide long term if you care about making the difference between the books and conference proceedings category, or the difference between certain proceedings and stand-alone articles which proceed from said conference proceedings. My advice is merely to be consistent. Moreover, note that Google scholar has populated “7247, 724707″ in the pages field. This is clearly not the pages, but is the volume data information for this database publication! (If you click “add new reference” in a hurry, you just added something you’ll need to correct later, or will inconveniently discover as an error). We now discover the origins and rationale of 724707_1.pdf as a file name.

Here I not only correct the data, but take the opportunity to add the DOI (digital object identifier), and also check to make sure there is nothing out of place. If the item is an edited volume with one or more editors qua authors (i.e. the volume’s editor is the primary citable contributor) , you can change the names to “editor” by clicking the drop down box, and then click the editor category, selecting “make editor primary contributor role.”

In the “preview” window, which I have set up to preview the citation according to my custom Chicago 16 Author Date bibliographic format rules–a slightly modified Chicago 15 AD–we see what the citation would look like in a bibliography. Once I fix things and click the edit button again, the updated citation will appear in the preview. Everything seems in place. Once we click “add reference,” we now see our document added to our library, and the PDF is readable in the reading window. If you use Sente on a synchronized iPad, this will automatically synchronize.

Screen Shot 2014-04-16 at 1.24.51 PM

Stay tuned on the next post on Sente, and in the meantime get started with your new library!

PDF Chaos? Digital Workflow Basics

Is PDF chaos on your mind? Is your Digital system insane in the membrane?

This is the second post of the series Digital Workflows for Academic Research on the Mac, and it’s, for lack of a better phrase, about taming your wild wild west world of unorganized PDFs, rogue USB drive sticks, and general lack of a organized digital system. You may not realize how much of a mess your digital world may be, and if you are a pristine PDF organizer already, offer your comments at the end of todays post on how you keep it all together–I’d like to encourage discussion on this since my way is not the only way. Here I offer some general principles. But first, let’s think to the old “analog” world of things.

If your desk looks like that, you may very well be a creative genius. You may very well be that person who can legitimately claim that “it looks messy, but there’s a method to the madness.” If your desk is always perfect, you’re likely to say, “what a mess! that is madness.” Now, even if such a pile is evidence of your genius; even if this pile did once contribute to a flourish of genius and a final product, in this current state of chaotic “pileness,” I hardly believe it would be any use to you now for future consultation without sorting, collating, and long term organizing. If you were ever to need these files again to generate new ideas, or for reference, you would need to sort through this pile of papers, notebooks, print outs, newspapers, and books again into some sort of filing system to make it serviceable it again for a project. Until that happened, and you need to restore order to the pile, the pile would literally be “on your mind.”

In fact, for David Allen, the Getting Things Done guru, “your mind is about having ideas not holding them,” and one of the first tasks he recommends in implementing his GTD system is for aspirants to stress-free productivity to physically account for every single object in their desk and home office space (check out his TED talk). For Allen, loose papers, and chaotic piles of reading material can constitute what he calls “open loops” that drain our mental focus and energy–and the small behaviors that rein them in can feel “awkward”, “unnatural,” and even “unnecessary.” They only cease to be overwhelming “open loops” when we put the things into buckets and containers, and consciously decide if they are important for a project, and then wheher or not they are actionable as a step in realizing that project. In Getting Things Done (2003, 13), Allen describes these generally in terms of “commitments,” but the general behaviors are worth mentioning for us in terms of PDF workflow basics–perhaps for us we could call them “reading, writing, and research commitments”:

Managing commitments well requires the implementation of some basic activities and behaviors: First of all, if it’s on your mind, your mind isn’t clear. Anything you consider unfinished in any way must be captured in a trusted system outside your mind, or what I call a collection bucket, that you know you’ll come back to regularly and sort through. Second, you must clarify exactly what your commitment is and decide what you have to do, if anything, to make progress toward fulfilling it. Third, once you’ve decided on all the actions you need to take, you must keep reminders of them organized in a system you review regularly.

For Allen, we are foolish to depend on our psyche as a system to manage the mess of projects and things of our creativity. If so we lose perspective and ability to put our focus on what we need; we lose both control and the stability we need to keep our attention on the processes and tasks necessary to accomplish (realize) our goals.

What system? You mean, like, an operating system?

From Wikipedia: Xerox Star workstation (1981) introduced the first GUI operating system

From Wikipedia: Xerox Star workstation (1981) introduced the first GUI operating system

Now since the whole development of the GUI (Graphical user interface) and WYSIWYG (What you see is what you get) interface back at the Xerox Parc research center, and even before that, Ivan Sutherland’s Sketchpad (1963)–which produced the concept behind the user interface we are so familiar with in the Microsoft Windows and Apple Macintosh operating systems–we have to remember that even leaving the whole problem of apps aside (see my previous post), the “operating system” was from the beginning entirely modeled off the the banality of “files,”  ”folders,” and note “tags” of the analog office process of the pre-digital, “paper pushing” era. Indeed this is still a valid model for all productivity and workflow (see Merlin Mann’s 43 Folders), especially in academics. In his Laws of Media: The New Science (1992), Marshall McLuhan, pointed out that the new media technology as tool forms are always an extension of our body or sensual organs. As much as a “mouse” is a bodily input device, so the folder is a digital collection bucket. In this sense, the “file,” “folder,” and “tag” concept, which Macademic has recently blogged on, is still very much an old concept–despite the recent and beneficial craze about tagging–and highly relevant to what I say here about bringing order to PDF chaos: just as you would need to have a system for naming files, filing them, categorizing them, and “tagging” them in an analog system–and depend on it consistently every single time–so you need it in its digital equivalent, without looking at the computer as something that it’s not: a semi-autonomous, automatic thing managing machine. When computer technology appears to our consciousness as a quasi autonomous interface or interactive experience–as it often does with our proliferation of apps, devices, and “multimedia”–rather than an extension of our bodies, we easily default to the Google sirens and fall in the gadget trap. We can confuse the “medium” with the “message.” That is, we believe we don’t need to manage our files and apply extensive rational intelligence to it as we would in the paper world. The ability to “Google” or “Spotlight Search” documents creates an illusion or simulacrum of a system, but it is only an operating system, that is an apparatus of files, folders etc. in which a user must use it intelligently and rationally,  deploying its tools and structure in regards to process and product. We need to have a process and deliberate collection system.

PDF Workflow. Making it work.


So, it’s okay if you do not have your PDF situation under control, but often a bad PDF situation doesn’t look like it would if it were a real bungle of paper mess, and you can easily be deluded about how orderly things really are, especially given the predilection (I discussed above) we have for thinking we can always “Google” things back into our conscious realm of creativity.

Two years ago I found myself literally drowning in PDFs: PDFs I’d downloaded from JSTOR or other online databases–like Ebsco, PDF excerpts of monographs or chapters of serial publications I’d scanned and OCRed using Abby Finereader in the library for my research, complete PDF books from Google Books or, and other forms of ebooks. I realized this chaos was the analog version of a paper mess all over the place (though it disappeared and was easy to hide and consider otherwise) that was interfering with my ability to remember, find, organize, think, and act on the ideas I wanted to develop in my papers and research. Let’s be honest: though the new amazing innovation of technology gave me greater access to obscenely useful quantities of information, I could hardly synthesize and keep my documents together in a way that actually helped me squeeze out a presentation or paper. I was miserable!

Since PDFs are the digital document standard, depending on the equipment used and the number of storage Media involved, it’s clear that you can accumulate a severity of PDF chaos in a short period of time even without much effort at all. Your digital workspace can begin, just by dint of not thinking of a system for Digital Workflows with PDFs, to turn into a big bunch of PDF research “open loops:” some scan stations, to make one example, prompt users to save their PDF files to USB sticks, send to email, or upload to Dropbox. If you’re like me, then you have normally done this on an ad hoc basis–lots of PDFs are somewhere in your Gmail inbox attached to messages, some of them are scattered in various folders of your computer, others here, there, and well, on three different USB sticks–one of which you may have left in a library terminal never to see again. This is further complicated by the fact that files you scan or download may have very diverse file names, some of them “PharosScan(1)” and others the random author or title descriptor you briefly entered in passing, and many simply random digits.

Often, downloading a file from the library begins with non-sensical filenames. For example, from the Columbia’s Library website, I find an article I’d like to download and read.

Screen-Shot-2014-04-08-at-2.57.20-PM-300x184What is the file name when I click to download it? Screen-Shot-2014-04-08-at-3.26.19-PM That’s not entirely helpful, and certainly not anything like PowleyEtAl.2009Enriching.PDF. I doubt I’d remember where it is or that I ever intended to look at it with this name. It’s also not helpful if I have multiple download locations for different browsers. Nor is it helpful if a PDF is not OCRed first, and yet I fail to OCR it, label, and file it immediately. In this case, it turns out my PDF is already OCRed. But if it were not, I’d need to open the document in Finereader or Adobe Acrobat Pro (Devonthink Pro Office, also has an OCR option, which we will discuss later).

Thou shalt always OCR your PDFs immediately

Every workflow needs to have a system for OCR scanning (optical character recognition), naming, and containing PDFs. Once you have your PDFs scanned, downloaded and OCRed, you also need to have a place where you keep, annotate, and use your PDFs and can name them according to the same parameters every single time (for example, author/date/title);  where they can be updated and stored coherently. It’s also important that they never disappear within the proprietary grasp of an application.

So to illustrate the point of (and the magnitude of the mess really depends on what system or lack thereof you have cultivated and how many files you have), I’ve assembled a smattering of PDF files from an old USB stick, and I will eventually show you, in the next post, how Sente can come to the rescue. The rest of this post will simply illustrate some considerations in preparing files for import into your permanent system. As you see, for this demonstration I’ve merely found and downloaded a couple of other files in an old USB stick back up I found on my desk. It perfectly illustrates my pre-workflow lack of a system.

How useful do you find all these random file names? Screen Shot 2014-04-11 at 2.55.57 PM Even if you don’t remember what the hell “filetmp_1389908321.pdf” is, Sente will not only help you remember, but will help you capture its bibliographic record, and allow you to set up a coherent library system for containing it.

But before we go there, let’s go ahead and OCR the files, if necessary. This is a must do, first of all, because it is impossible to effectively keyword search, read, annotate, and import the bibliographical information of un-OCRed PDFs, even leaving any applications aside. One of the greatest things about Sente is that it will allow you to almost instantly capture the correct bibliographic information of PDF documents from online databases, re-name, and file the PDFs for you into a usable library. Sure some other software does something similar, but I promise Sente gives you more power and control. In any case, even if you someday opt not to use Sente, OCRing your texts and renaming them is good practice. Now, also remembering our whole above discussion about folders and real files, you also need to create an ‘inbox’ for PDF documents  as part of your workflow (In a later post i’ll show you how Devonthink Pro Office lets you automatically import and OCR PDFs–and, additionally, create an intelligent and AI enhanced filing system).

Make a folder on your desktop or in your downloads folder where you’ll keep PDFs that need to be OCRed and then processed into Sente. I’ve called this now “To Add Sente,” in my example. Sente allows you to capture and download PDFs directly using its own browser, but you’ll want an inbox for “to add” files too since you will sometimes want to put them in different libraries or add them to Sente later, if needing OCRing, and if you’re not downloading them directly from a database.

If your PDF is not OCRed, I highly recommend using Acrobat Pro. The Academic license is only about $100, and it is well worth the cost.

OCR With Adobe Acrobat

Screen Shot 2014-04-10 at 11.35.13 AM Once in Acrobat, open your document. If the text does not highlight, or the text pasting comes out garbled, click Tools, then “text recognition.” Click “in this file.” Screen Shot 2014-04-10 at 11.38.39 AM For most needs, 300 dpi is the minimum acceptable on most grayscale texts, while 600 dpi is archive quality. Depending on the power of your CPU and the length of the documents, 600 dpi takes longer, but you will want to run it at 600 if your text has particularly small text and apparatus, in more than one language. Other software, like Abby Finereader on the PC offers options like 400, but since there’s no intermediate here, I usually choose 600. In the example, I select 600 dpi, and choose English as the language (Acrobat allows you to choose from a long list of languages!).

In my example folder, I’ve now OCRed this file. I’ve also checked that all my documents are OCRed. They are, but I come across a particular situation.

Thou shalt spilt your PDFs, and only then OCR them!

Here is a PDF (in German) that I worked from as translator of a monograph.

Screen Shot 2014-04-10 at 3.59.06 PM

The problem: every one page of the PDF is horizontal orientation, containing in two real book pages. We see the jump here from analog to digital though the scan medium, and the problems it can pose. Most people would find this annoying and think it doesn’t really matter. My opinion is that it does: if you don’t split the document, and re-OCR it you risk:

1) Garbling cited page numbers (John Sidiropoulos has thought about this a lot and it’s important). If you really use and annotate the document in the future, and you cite text on page two that might actually be either real page XXXII or page XXXIII, it’s better if each real page stands for only one page: pg. 2 PDF = XXXII, pg. 3 PDF = XXXIII. This cuts confusion. If you don’t split the PDF now, it is a real pain and gets in the way of your work later.

2) The second risk is that you copy a quote that spills across the entire horizontal page, so for every line of one paragraph on the right, you also get the adjacent line on the left. This is not always the case, but by golly, why bother with that kind of problem?!

It is essential for any workflow to split these dual-paged PDF files. A lot of scan stations automatically do this now, and Finereader does too in the DHC. But if you come across a PDF that is from before this was common, or otherwise find a scan like this, you just need to split it. When you want to cite the page of a PDF, no software will always perfectly number its pagination anyway. There may be a mismatch between the PDF page and the real page of the printed text, just split it so that you can be closer to solving this problem, and save extra confusion. We want things clean and für ewig, to invoke Goethe, which means something like “for all time” in German. Your PDF library is not an ad hoc matter, think about it like your real pristine print library–would want to use real printed books with messed up pagination? Not really.

You want to keep things clean and consistent.

Splitting a PDF

While Acrobat can also split PDFs, I recommend people do this task first always, and use a nifty freeware app called PDF scissors. 

Screen Shot 2014-04-10 at 3.31.37 PM   Open your file. Screen Shot 2014-04-10 at 4.02.05 PMChoose all together. It will then “stack” your file, and allow you to select “crop” into a single PDF. First though, you need to drag a rectangle around the borders of the pages. A safe place  to crop is outside the thick overlapped text. You will have to guess exactly where the center of the page is. Usually you will see a clearly darker line in the very center between pages, and in some cases this center is very clear because there is no text in it. Screen Shot 2014-04-10 at 4.09.05 PM Our new file is now clean, and ready! Once I save the file, here “singleeinleit,” and check it for accuracy, I should delete the double version! Why have a garbage file you might accidentally use then have to think, “where is my split file”? Screen Shot 2014-04-10 at 4.10.31 PM This file can then be OCRed in Acrobat and saved.

With all of our PDFs split and saved,  we are ready to build our Sente library.

In the next post of the series, Using Sente for PDF Management on the Mac and iPad (1): Capturing and Organizing PDFs, I’ll show you how we can use Sente to automatically capture the correct bibliographic information of each PDF, automatically re-name and file each by author, date, and title in a permanent system, and how to index that library bundle for broader use within your operating system, while protecting its integrity.

Introducing Digital Workflows for Academic Research on the Mac

Annotating on Sente 6

This is the first, introductory post of what will be a series of posts for the Digital Humanities Center on the topic of Digital Workflows for Academic Research for Mac.

Digital workflows? What does that mean? Does this involve apps and nifty tools, hacks, and tutorials?

Yes. Good news for the huddled masses staring at their device and computer screens night and day, including the e-masses huddled in Butler library: of course this is about apps–plenty of apps. Apps for your iPhone, iPad, and Mac; web-based apps, cross-platform apps, I may even mention some PC apps. “There’s an app for that” is now quasi-proverbial (Thanks Apple!), and I promise to deliver here some great approaches—nifty tools, hacks, and tutorials—to help Columbia Mac users go from the phase of information collection and capture between print and digital media of academic research, to the analysis, annotation, and creativity phase of article and publication production, all while navigating the many questions and time-wasting puzzles that inadvertently arise between our Gmail inboxes, dozens, hundreds, or thousands of (either OCRed or unOCRed) PDFs, our own notes, web snippets, nuggets of data, bibliographic inanities, and the blinking cursor of the word processor in writing projects—before anything arrives at a publisher, conference, or your adviser’s desk (ahem, I mean, Gmail inbox) in the form of a reviewable piece of your own academic writing.

Why Mac? This series focuses nearly exclusively on Mac for two reasons:

Over the last ten years, along with Apple’s rising popularity, the Mac share of the higher and secondary education market is now at a saturation point that dwarfs its previous educator friendly reputation of the late 1980s and 1990s. According to some studies, as many as 60-70% of undergraduate and graduate students at major institutions use Mac exclusively—and also increasingly iOS devices (iPads and iPhones)—as their primary devices for school and study, as well as for recreation and personal use. While the Mac share of the market is still nowhere nearly as large as the PC, Mac users are beginning to predominate in education, engineering, and science. iPads are also becoming a serious learning tool.  Columbia’s academic Mac users are no exception.

Because of this, there is now an active community of software developers and innovators—including many students, faculty, and scientists at the institutional level—working together to make new tools and approaches for improving methods for “digital” academic research (for example, Macademic) in the humanities, sciences, and social sciences on the Mac platform. Many of the hottest applications and workflows for academic research now are clustering around the Mac. This is not to say that there are not great tools or there is no development being done in the PC/Wintel arena, but this year we decided that we wanted to help the sizable Columbia Mac student/faculty community by introducing them to the possibilities that they may not know of, as well as offer them some support for exploring digital workflows on the Mac at the Butler Digital Humanities center.

Annotating on Sente 6 for iPad

Annotating PDFs with Sente for iPad

The fact that there’s an app for that is a good thing, because writing and researching is hard work. Mac software like Sente 6 (with Sente Assistant), Devonthink Pro Office, Scrivener, OmniOutlinerNovamind 5 Pro, harnessed with the power of open-source Apple Scripts, a conceptual understanding of OPML architecture–and a demonstration of the general principles behind the practice of the digital workflow, which I hope to introduce (albeit not exhaustively) in the series–will present the user who wants to improve his or her digital process with real solutions for accomplishing efficient, organized, and research, while saving time and frustration, and  avoiding organizational catastrophes (which can delay or even undermine the realization of creative work) between iPhones, iPads and the Mac. For example, how to use Sente to annotate and read PDFs on the go on an iPad, keep them synchronized with cloud-based libraries, and how to use Mac Sente to import coupled bibliographical information with your PDF libraries cleanly and also export your organized annotations and quotations for use on your thesis or book projects.

I’ll offer some very tailored approaches I’ve discovered in the process of doing my own dissertation research and preparing this series as a Digital Humanities Intern this year. Daniel Wessel’s blog, and now book by the same title, Organizing Creativity, however, is a must-see for anyone who is ready to dive into the theoretical and highly detailed practical considerations of realizing creative work in the digital world generally, and has been a major influence on the approach I have developed (Thanks, Daniel!). The highly technical experiments in scripting on the Mac with various applications for academic research by John Sidiropoulos over at OrganoGnosi, moreover, have also been formative in developing this series, and offer a glimpse (for both expert users and novices) into the very many difficulties faced in trying to make and streamline a series of processes, like simply organizing and keeping PDFs, bibliographic information, and annotations together across applications, according to various needs.

But I will not cover every possible app and solution, nor pretend to, because the flip side of the there’s an app for that culture, is that the proliferation of tools and infrastructure for productivity offers little coherent and cogent guidance in the best tuned methodologies for actually accomplishing sharable, realized research.

Not only does new scientific research show that that true “multitasking” is a myth–yes, you know that checking your email, your Facebook, texting, reading a book for the first time and writing a term paper on it simultaneously does not really work out–but also that the value of the multiverse of e-and-iWhatevers which promise us shortcuts and “streamlining” of many tasks may not be as helpful as all the hullabaloo makes them appear, especially not when doing serious intellectual work. As Ryan Kalember pointed out recently over at Quartz (along with many, many others), perhaps “The biggest productivity killer is that there’s an app for that … and that … and that, too.” We are not only drowning in a panoply of devices, apps, and WiFi connections, but these rapid advances in technology –and the cultural mindsets and practices which follow–as represented by the app mentality, can contribute both to the illusion of productivity and careful goal-driven work, and at the very least raise serious questions about the app mindset in regard to tackling serious intellectual challenges.  

Katherine Xue, in an article on the app youth culture in Harvard Magazine, cites a recent book by Howard Gardner and Katie Davis, The App Generation:

“This is a generation that expects and wants to have applications,” says Gardner. Applications, more commonly known as apps, are shortcuts designed for accomplishing specific tasks. They’re ubiquitous, powerful, and strongly structured, and the authors argue that they’re changing the way we think. “Young people growing up in our time are not only immersed in apps,” they write, “they’ve come to think of the world as an ensemble of apps, to see their lives as a string of ordered apps, or perhaps, in many cases, a single, extended, cradle-to-grave app.” The app mindset, they say, motivates youth to seek direct, quick, easy solutions—the kinds of answers an app would provide—and to shy away from questions, whether large or small, when there’s no “app for that.”

Now, those born in the 1980s or earlier are likely to remember a world in which the entire stuff of academic research—that is, reading and writing—were, well, analog! Doing research, as it had been done, since perhaps the time of Gutenberg–or even earlier in the scholastic libraries of the Middle Ages–meant the following experience (which most graduate students and faculty who work in Butler should be familiar with): going to a physical library, handling dozens of real (often heavy) books, selecting them, and reading them. With pen and paper—or some other form of recording device like vellum (skin)—collecting, analyzing, capturing, recording, and synthesizing information both as quoted and cited from reading texts and from one’s own thoughts and reactions to the quotes, facts, narratives, and textual varia of other “authors” and “texts,” often over long periods of time and on a variety of projects or subject matter. Fast forward to the 20th century, post 1970s, and this perhaps included making “photocopies” of articles, snippets, or the title pages of references on the “Xerox machine” (hot 1980s technology!) or using an electric Microfilm or Microfiche reader to locate archival information or back periodicals. The general principle of organization involved custom systems of note cards, labels, binders with dividers, individual notebooks, etc. Usually some form of handwritten material and a series of sources, with comments, were organized at a desk and eventually, the researcher drafted a creative, new piece of research by hand or on a typewriter. 

A workflow for academic research from around 1984, until around 1996-2000, when the internet began to go mainstream, retained most of the above aspects, and might have been “digital” in so far as it operated on some sort of x86 silicon chip, and involved a “word processor” with bibliographic management software like Endnote. The former was little more than a more versatile typewriter, and the latter a translation, so to speak, of a card catalog model of bibliographic reference keeping to a digital medium. 

Enter Digital Workflows for Academic Research circa 2014. In the world of Web 2.0, there are tools and apps galore, vast databases of digitized books, articles, shared information, websites, etc. The collection of information involves vastly greater quantities of text and even new media of citable and crucial information available through the internet and offering incredible new possibilities for research. However, the repository of history and the human sciences not only still predominantly exists on paper and in the library stacks–how many times have you had to scan something in the Digital Humanities Center? raise your hands–but research now not only involves a mixture of digital and analog materials, but a confusing complex in which we–the biological and rational creatures–must work between machines and digital media and yet still adhere to the rightly rigorous demands of linear information presentation and scholastic conventions in [the] production of papers, articles, and dissertations. In short, little has changed about the craft of writing and doing scholarship, but much has changed which makes doing focused and productive scholarship more challenging when working between many sources of digital and analog information, and its “online” and “offline”–so to speak—synthesis by the “knowledge worker.”

Using OPML based outlining in Novamind Pro

The situation we face today generally requires students, writers, scholars and academics not only to employ research methodologies appropriate to their disciplines (or even interdisciplinary methods to answer questions which now transcend traditional spheres of academic expertise) but research methods and “workflows” which bridge what Marshall McLuhan, the visionary prophet of the 1960s, called the “hot” and “cold” media of today.

My series will not only offer some solutions to the above challenges for Mac users, but will also implicitly show how the older methods of research known to the pre-1990sbunch of us can benefit Millennial “app” scholars as much as the latter can benefit those of us seeking to integrate new technology into the old business of knowledge production. 

Stay tuned for a series of posts through the end of the spring semester on topics such as PDF management, outlining, syncing annotations to PDFs by quote, page number, and comment, searching your own tagged annotations in Sente and Sente assistant, building outlines and storing information for dissertations and books in Devonthink Pro Office, writing in Scrivener, OPML and outliners, and using your Mac and iPad for succeeding in your academic writing pursuits.

Academic Database with Devonthink Pro Office

Database Trial: LGBT Thought and Culture

LGBT Thought and Culture is a new Alexander Street Press database which includes texts, letters, speeches, interviews, and ephemera covering the political evolution of gay rights as well as memoirs, biographies, poetry, and works of fiction that illuminate the lives of lesbians, gays, transgendered, and bisexual individuals and the community.

Our trial subscription to LGBT Thought and Culture ends on April 28, 2014.

Please send comments or questions to Sarah Witte, Gender and Women's Studies Librarian, at

Database Trial: Numerique Premium


We are currently trialing a new database of French e-books, Numérique Premium, through April 12, 2014. The collection contains nearly 850 full-text titles in a variety of fields, including history, religion, philosophy, politics, literature/literary theory, film, and architecture. Publishers include:  Belles Lettres, Canadian Scholars Press, CNRS éditions, ENS éditions, Gallimard,Flammarion, Nouveau Monde, Picard, Presses universitaires de McGill, Association française pour la recherche en histoire du cinéma, Association des Professeurs d’Histoire-Géographie, Société des études robespierristes, Fondation Napoléon, Fondation Charles de Gaulle, Institut Napoléon…

This resource is currently only available on-campus until April 12, 2014. Please send any comments or questions to Meredith Levin, Western European Humanities Librarian, at

Bonne Lecture!!!

Comics@Columbia events: “Celebrating Al Jaffee,” on Tuesday March 4, 7 PM.

With the new semester comes new events, and our first of the year is a corker: in honor of Al Jaffee's donation of his papers to our Rare Book and Manuscript Library, we're going to celebrate his life and career with a panel discussion.  Former DC Comics president (and sometime Columbia lecturer) Paul Levitz will moderate, and joining him will be cartoonist Peter Kuper, Mad magazine art director Sam Viviano, and–of course!–Al himself.

So mark your calendars for Tuesday March 4 at 7:00 PM, in room 523, Butler Library. 

A reception will follow the event.  As always, Comics@Columbia events are free and open to the public.

Comics@Columbia: “Brooklyn Comes to Morningside Heights” Weds. 11/13 6pm

 In the final comics event of the Fall semester, Butler Library welcomes three of Brooklyn's most talented cartoonists for a discussion of their work, their process, their origins, and their future. Please join Dash Shaw (The Mother's Mouth, Bottomless Belly Button, The Unclothed Man in the 35th Century A.D., BodyWorld, New School), Gabrielle Bell (When I'm Old and Other Stories, Lucky, Cecil and Jordan in New York, San Diego Diary, July Diary, The Voyeurs), and Lisa Hanawalt (My Dirty Dumb Eyes) in an informal discussion with Columbia's graphic novels librarian, Karen Green.   There will be book sales and signings following the panel. Light refreshments will be served.   We hope to see you there!

Online Music Scores

We're of course delighted when you visit the Music & Arts Library (701 Dodge) to browse and check out items from our extensive collection of printed music scores. But, there are those times that you may need some music in a pinch, or when we're closed. For those times, the availability of online scores can be very useful, and the Libraries make available several collections of online scores (also identified by the term "sheet music").

Here's a listing of the various collections which are currently available to full-time Columbia affiliates through the Libraries (these links will take you to the CLIO record for the database; click on the URL in the record to connect):

Classical Scores Library — "a collection of digitized scores of important classical music, manuscripts, and unpublished material."

Naxos Music Library. Sheet Music — "digital sheet music in all classical genres, spanning music from Medieval to the 21st century and composers from Bach to Arvo Part." This Naxos database also offers a downloadable software utility which can be used to transpose some content from one key into another (a feature often useful for singers), and to adjust printing options.

A-R Editions' Online Music Anthology — "a database of music scores containing representative vocal and instrumental compositions from antiquity through the nineteenth century."

Outside of these online resources available through the Libraries, mention must be made of the important and open online resource International Music Scores Library Project (IMSLP). This public project, established in 2006, states its primary goal as:

"… to gather all public domain music scores, in addition to the music scores of all contemporary composers (or their estates) who wish to release them to the public free of charge."

Over the last few years, this project has truly blossomed into a very valuable resource, including not only scores but also performance parts, audio recordings, and some commentary and analysis. It's interesting to note that IMSLP, the open resource, has a far larger volume of content than the subscription services mentioned above – but, not for in-copyright materials (one likely reason for the difference).

Note that there are many options for browsing and searching scores, and recordings can also be browsed, by composer or performer, via a link in the left sidebar. RSS feeds are available to keep track of new additions. Another interesting feature is the "Search By Melody" function, that allows users to input a melody string to search, using a pop-up keyboard. For the adventurous, a "score similarity" algorithm will attempt to match features of scores in the database, to find "similar" works.

Of course, as with any open-source project, there are always concerns with editorial control, and the editions available on the site range from scholarly editions to self-published arrangements with no explicit editorial responsibility or details. So, scrutinize your options carefully when choosing content. That said, there is a wealth of quality material which can be a lifesaver when you just need that score or part, for reference or for performance. And, you'll also find scans of rare or obscure repertoire, both in published editions and in manuscript.

Lastly, many libraries are now mounting extensive collections of digitized scores and sheet music, much of which is under public domain, for world-wide access. Many of these are concerted, scholarly efforts (for an example, see the Digital Mozart Edition) which warrant their own post, so stay tuned for an overview of those collections in a future post!

EVENTS: Comics@Columbia Supports NY Comic-Con with ElfQuest and X-Men

New York Comic-Con is upon us!  To join in the festivities, Butler Library presents two exciting comics events.

Both events are free and open to the public, but seating is limited.

1) Wednesday, October 9, 6 PM

523 Butler

To celebrate Columbia’s recent acquisition of the archives of Wendy and Richard Pini (aka WaRP Graphics), the creators behind the long-running comic ElfQuest join comics librarian Karen Green and moderator Sabrina Sondhi for a discussion of the origins and innovations of this popular series, and the contributions their archives can make to comics scholarship.

The newest ElfQuest story will be available for sale, and a signing and reception will follow.


2) Monday, October 14, 7 PM

523 Butler

Join long-time X-Men writer Chris Claremont, director Patrick Meaney, and producer Mike Phillips for a screening of their documentary “Comics in Focus: Chris Claremont’s X-Men.”  Claremont’s contributions to the X-Men mythology are wide-ranging and include the story-lines of the recent X-Men films.  A Q&A with the participants will follow.

Light refreshments will be served.

We hope you’ll join us for both of these events, and participate in Columbia University Libraries’ support for this fascinating artistic medium.