I am Rohit Bharadwaj, a graduate student to the Masters in Data Science program. I have worked as a software engineer at Factset for four years where I used machine learning and natural language processing techniques to handle information retrieval problems. I joined the Digital Social Science Center as a digital centers intern this fall. I work with Jeremiah and Eric in DSSC and we are trying to synthesize my technical knowledge with their domain expertise to come up with optimal solutions pertaining to a few technical problems related to DSSC.
Goal: To provide a comprehensive tool to all the digital libraries in Columbia that can facilitate the access to any api that a user wants to access
Digital centers are often faced with requests from users pertaining access to a particular API (Application Programming Interface) or a set of APIs through which they can access the data of interest. User requests can be very diverse, ranging from several attributes for a specific data point to several data points and all of their associated attributes. Normally, we have scripts catering to each API and the students are expected to run these scripts on their machine in order to get the data. Though this method works, there are limitations in terms of the access (for python scripts, an interpreter of python needs to to be installed). Furthermore, there are constraints on the input data to be of a specific format, and often, the output of the existing scripts are not clear and cannot be controlled.
We, as part of this internship, aim to build a web application that can be accessed from anywhere within the university. As part of this web application, we aim to provide a web page, where a student can enter the mandatory information required to get data from an API. For example, an API that returns geo-coordinates require the address of the location to be provided and the address should have few mandatory attributes such as street number, borough code etc. We aim to provide the flexibility of entering input in one of the three formats: a form, tab separated text file, or a comma separated text file, and also plan to support different output formats : display on the screen, tab separated text file, comma separated text file.
Currently, we have 3 APIs that we are looking to integrate into the web application. The APIs that we are planning to support initially and their use cases are listed below:
Geocode: This API provides geospatial coordinates of any given address in the New York city. This can potentially be used by students for location-based data analysis and pattern recognition tasks..
Human Rights Web Archive: This API enables access to human rights database index. Currently, this is used to verify if the citations of a journal article are valid or not. This also provides a link to the article in human rights index for validation purpose.
Internet archive: This API provides access to internet archive database. This is also used in the similar context as in (2).
Our first aim is to support these 3 APIs through our web application after which we will focus on extending the application to support any API that is of interest to a user. We are aiming to make the application as flexible as possible, wherein an user can configure the application with a new API with very few restrictions without negatively affecting the application’s usability.
Current Status: Developing the backend of our web-application that supports access to the first set of APIs is completed to a major extent, barring a few functions for different types of I/O operations. We are currently designing the user interface for the application and also refining the backend code base.
Rohit Bharadwaj G.