Author Archives: Chubing Tripepi

New Dataset: Annual Survey of Industries Detailed Unit Level Data with Common Factory-ID

Annual Survey of Industries (ASI) Detailed Unit Level Data with Common Factory-ID is now available to Columbia University students.  The Library has data from 1998 to 2014.

The Annual Survey of Industries (ASI) is the principal source of Industrial Statistics in
India. The ASI extends to the entire country except the States of Arunachal Pradesh and Mizoram and Union territory of Lakshadweep. It covers all factories registered under the section 2(m) (i) and 2(m) (ii) of the Factories Act 1948 i.e. those factories employing 10 or more workers using power; and those employing 20 or more workers without using power. The survey also covers Bidi and cigar manufacturing establishments registered under the Bidi and Cigar workers (Conditions of Employment) Act 1966 and electricity undertaking with coverage as above.

ASI provides data on a various vital aspects of the registered factories for use in the
estimation of National Income, studies of industrial structure and policy formulation.
Some of the important indicators generated based on ASI are number of factories,
employment, wages, invested capital, capital formation, input, output, depreciation and
value added on an annual basis.

With firm identifiers from the firm-level data, researchers can do longitudinal research
to track the same company over time periods as long as it is included in the ASI.

For more detailed information about this panel data structure, please contact

Investor Activism data template on Capital IQ!

In partnership with S&P’s Investment Banking team, Capital IQ just released a new template called Investor Activism for the S&P Capital IQ Office plug-in.

Investor Activism data is a popular, high quality data set that has been available on the CIQ platform for years. Many S&P Capital IQ clients, in particular in the corporate, investment management and investment banking segments, have asked to also have this data available in the Excel plug in.  Clients can now enter a subject Company ID or Activist ID and see all the campaigns against or initiated by that company.

Applicable Client Workflows:

Investment Managers, Investment Banks, and Corporate

How Clients Get This Template:  To download the template within the SP CIQ Office plug-in ribbon, go to Templates >> Get/Update Templates >> + New Templates or Transaction >> Investor Activism

Getting Started with TwitteR

Recently, I used TwitteR package to scrap data from Twitter for sentiment data analysis. Here is a step by step guide on how to get started.

  1. Download the TwitteR package and make it available in your R session 
  2. Set up Twitter account for API access
    • You need to have a twitter account.
    • Go to and sign on with your twitter account.
    • Once you have signed in you should see the following screen, and simply click on the button that says “Create New App”.
    • Once you click on the “Create New App” button you will go to the Create an Application screen. There are three fields, a click box and a button you need to click on this page. The three fields are Name, Description and Website. The name of the application must be unique so this may take a few tries. The description needs to be at least 10 character long, and put in a website. If you do not have one you can use Now click the “Yes, I agree” box for the license agreement and click the “Create your Twitter application”.
    • Once you successfully create an application you will be taken to the application page. Once there click on the “key and access token” tab. From that page you are going to need four things.
      1. Consumer Key (API Key)
      2. Consumer Secret (API Secret)
        Click the “Create my access token” button.
      3. Access Token
      4. Access Token Secret
  3. Now re-open your R session and enter the following code using those four pieces of information
    consumer_key <- “your_consumer_key”
    consumer_secret <- “your_consumer_secret”
    access_token <- “your_access_token”
    access_secret <- “your_access_secret”
    setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret) 
  4. Set up authentication
    destfile=”C:\\Users\\CHUBING\\Desktop\\Final Project\\text_mining_and_web_scraping\\cacert.pem”, method=”auto”)authenticate <- OAuthFactory$new(consumerKey=consumer_key,
    authURL=””)setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)save(authenticate, file=”twitter authentication.Rdata”)
  5. Scrap Twitter Data – 60000 raw tweets
    #latest tweets
    tweets_trump <- searchTwitter(‘Donald+Trump’, lang=”en”,n=2000,resultType=”recent”)# Loop over tweets and extract text
    donald = lapply(tweets_trump, function(t) t$getText())
  6. Once you have the twitter data file, it’s rather messy. First, we need to convert it to data frame to make it easier for ggplot.
    # Create dataframe from twitter result matrix
    tweet.df <- twListToDF(donald)
  7. Twitter no longer provides location data for each tweet, so to gather location information, you will have to use the location information provided by users in their user profiles.
    # Look up twitter users information by screen names
    users <- lookupUsers(tweet.df$screenName)
    # Create dataframe from twitter users matrix
    users_df <- twListToDF(users)
    # Merge two data frames together
    merge(tweet.df, users_df, by=”screenName”)
  8. Clean up messy data in Twitter data
    # Remove duplicates
    total <- unique(total)
    # Place NAs in empty cells in location column
    total$location <- replace(total$location, total$location==””,NA)
    # Remove location rows with NAs
    total <- total[complete.cases(total$location), ]
    # Clean up emojis
    total$text_clean <- str_replace_all(total$text,”[^[:graph:]]”, ” “)
    # Remove punctuation
    total$text_clean <- gsub(“[[:punct:]]”, “”, total$text_clean)
    # Remove control characters
    total$text_clean <- gsub(“[[:cntrl:]]”, “”, total$text_clean)
    # Remove digits
    total$text_clean <- gsub(‘\\d+’, ”, total$text_clean)
    # Remove URLs from string
    total$text_clean <- gsub(“(f|ht)tp(s?)://(.*)[.][a-z]+”, “”, total$text_clean)#The above expression explained:#? optional space
    #(f|ht) match “f” or “ht”
    #tp match “tp”
    #(s?) optionally match “s” if it’s there
    #(://) match “://”
    #(.*) match every character (everything) up to
    #[.|/] a period or a forward-slash
    #(.*) then everything after that

New Dataset: Infogroup’s Historical Business File

Columbia University has access to the Infogroup’s Historical Business File via WRDS.

Infogroup gathers geographic location-related business and residential data from various public data sources, such as local yellow pages, credit card billing data, etc. The Infogroup database at WRDS contains two packages: Historical Business and Historical Residential, both of which largely focus on US sectors.

The Historical Business data starts in 1997, while Historical Residential dates back to 2006. The Historical Business data is a calendar year-end snapshot of local business data. It contains business identification, location, industry, corporation hierarchy, employment, sales, and other fields.

For more information about this dataset, please visit this page

Figure 1



Using Python, R, MATLAB APIs for Datastream!

Do you know that in addition to Excel API, now you can also extract data easily from Datastream by using Python or MATLAB?

For Python:

PyDatastream is a Python interface to the Thomson Dataworks Enterprise (DWE) SOAP API (non free), with some convenience functions for retrieving Datastream data specifically. This package requires valid credentials for this API.

It is a Python API for time-series data which abstracts the database which is used to store the data, providing a powerful and unified API. It provides an easy way to insert time-series datapoints and automatically downsample them into multiple levels of granularity for efficient querying time-series data at various time scales.



Alternatives for other scientific computing languages:

Please note that only PhD students and faculty can request credentials. Please email if you had any questions.

New CBS Students Raffle Results!

Welcome to Columbia! Thank you all for attending the raffle hosted at the resource fair in Watson Library. We have our three winners here:

  • Amerder Chawlar – Winner of $10 Starbucks GC


  • Ruoyun Luo – Winner of $10 Starbucks GC


  • Pedro Barbara – Winner of $25 Starbucks GC


Congratulations again! Hope y’all have a wonderful semester!

[Metro SIG Meeting] Historical Corporate Annual Reports Digitization Project


This Wednesday, June 8 2016, Watson Library’s Interim Head Jeremiah Trinidad-Christensen was invited to make a presentation at the Metro’s Economics and Business Librarians SIG meeting about the Historical Corporate Annual Reports Digitization Project in Watson Library.

Jeremiah provided an excellent summary of this widely-used digitization project, which was conducted by a group of Watson Library staff, led by our Collection Development Librarian Yasmin Saira from 2007 to 2009. The audiences were hugely intrigued by the fascinating stories Jeremiah extracted from these reports; everyone was also amazed at the visualization he created in the end, using one of the historical NYC maps:


If you want to learn more about this historical corporate report collection, here are some relevant links:

If you would like to request a full copy of this presentation, or had questions regarding our collections, please reach out to us at