Category Archives: data

New Dataset: Annual Survey of Industries Detailed Unit Level Data with Common Factory-ID

Annual Survey of Industries (ASI) Detailed Unit Level Data with Common Factory-ID is now available to Columbia University students.  The Library has data from 1998 to 2014.

The Annual Survey of Industries (ASI) is the principal source of Industrial Statistics in
India. The ASI extends to the entire country except the States of Arunachal Pradesh and Mizoram and Union territory of Lakshadweep. It covers all factories registered under the section 2(m) (i) and 2(m) (ii) of the Factories Act 1948 i.e. those factories employing 10 or more workers using power; and those employing 20 or more workers without using power. The survey also covers Bidi and cigar manufacturing establishments registered under the Bidi and Cigar workers (Conditions of Employment) Act 1966 and electricity undertaking with coverage as above.

ASI provides data on a various vital aspects of the registered factories for use in the
estimation of National Income, studies of industrial structure and policy formulation.
Some of the important indicators generated based on ASI are number of factories,
employment, wages, invested capital, capital formation, input, output, depreciation and
value added on an annual basis.

With firm identifiers from the firm-level data, researchers can do longitudinal research
to track the same company over time periods as long as it is included in the ASI.

For more detailed information about this panel data structure, please contact

Getting Started with TwitteR

Recently, I used TwitteR package to scrap data from Twitter for sentiment data analysis. Here is a step by step guide on how to get started.

  1. Download the TwitteR package and make it available in your R session 
  2. Set up Twitter account for API access
    • You need to have a twitter account.
    • Go to and sign on with your twitter account.
    • Once you have signed in you should see the following screen, and simply click on the button that says “Create New App”.
    • Once you click on the “Create New App” button you will go to the Create an Application screen. There are three fields, a click box and a button you need to click on this page. The three fields are Name, Description and Website. The name of the application must be unique so this may take a few tries. The description needs to be at least 10 character long, and put in a website. If you do not have one you can use Now click the “Yes, I agree” box for the license agreement and click the “Create your Twitter application”.
    • Once you successfully create an application you will be taken to the application page. Once there click on the “key and access token” tab. From that page you are going to need four things.
      1. Consumer Key (API Key)
      2. Consumer Secret (API Secret)
        Click the “Create my access token” button.
      3. Access Token
      4. Access Token Secret
  3. Now re-open your R session and enter the following code using those four pieces of information
    consumer_key <- “your_consumer_key”
    consumer_secret <- “your_consumer_secret”
    access_token <- “your_access_token”
    access_secret <- “your_access_secret”
    setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret) 
  4. Set up authentication
    destfile=”C:\\Users\\CHUBING\\Desktop\\Final Project\\text_mining_and_web_scraping\\cacert.pem”, method=”auto”)authenticate <- OAuthFactory$new(consumerKey=consumer_key,
    authURL=””)setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)save(authenticate, file=”twitter authentication.Rdata”)
  5. Scrap Twitter Data – 60000 raw tweets
    #latest tweets
    tweets_trump <- searchTwitter(‘Donald+Trump’, lang=”en”,n=2000,resultType=”recent”)# Loop over tweets and extract text
    donald = lapply(tweets_trump, function(t) t$getText())
  6. Once you have the twitter data file, it’s rather messy. First, we need to convert it to data frame to make it easier for ggplot.
    # Create dataframe from twitter result matrix
    tweet.df <- twListToDF(donald)
  7. Twitter no longer provides location data for each tweet, so to gather location information, you will have to use the location information provided by users in their user profiles.
    # Look up twitter users information by screen names
    users <- lookupUsers(tweet.df$screenName)
    # Create dataframe from twitter users matrix
    users_df <- twListToDF(users)
    # Merge two data frames together
    merge(tweet.df, users_df, by=”screenName”)
  8. Clean up messy data in Twitter data
    # Remove duplicates
    total <- unique(total)
    # Place NAs in empty cells in location column
    total$location <- replace(total$location, total$location==””,NA)
    # Remove location rows with NAs
    total <- total[complete.cases(total$location), ]
    # Clean up emojis
    total$text_clean <- str_replace_all(total$text,”[^[:graph:]]”, ” “)
    # Remove punctuation
    total$text_clean <- gsub(“[[:punct:]]”, “”, total$text_clean)
    # Remove control characters
    total$text_clean <- gsub(“[[:cntrl:]]”, “”, total$text_clean)
    # Remove digits
    total$text_clean <- gsub(‘\\d+’, ”, total$text_clean)
    # Remove URLs from string
    total$text_clean <- gsub(“(f|ht)tp(s?)://(.*)[.][a-z]+”, “”, total$text_clean)#The above expression explained:#? optional space
    #(f|ht) match “f” or “ht”
    #tp match “tp”
    #(s?) optionally match “s” if it’s there
    #(://) match “://”
    #(.*) match every character (everything) up to
    #[.|/] a period or a forward-slash
    #(.*) then everything after that

New Dataset: Infogroup’s Historical Business File

Columbia University has access to the Infogroup’s Historical Business File via WRDS.

Infogroup gathers geographic location-related business and residential data from various public data sources, such as local yellow pages, credit card billing data, etc. The Infogroup database at WRDS contains two packages: Historical Business and Historical Residential, both of which largely focus on US sectors.

The Historical Business data starts in 1997, while Historical Residential dates back to 2006. The Historical Business data is a calendar year-end snapshot of local business data. It contains business identification, location, industry, corporation hierarchy, employment, sales, and other fields.

For more information about this dataset, please visit this page

Figure 1



Using Python, R, MATLAB APIs for Datastream!

Do you know that in addition to Excel API, now you can also extract data easily from Datastream by using Python or MATLAB?

For Python:

PyDatastream is a Python interface to the Thomson Dataworks Enterprise (DWE) SOAP API (non free), with some convenience functions for retrieving Datastream data specifically. This package requires valid credentials for this API.

It is a Python API for time-series data which abstracts the database which is used to store the data, providing a powerful and unified API. It provides an easy way to insert time-series datapoints and automatically downsample them into multiple levels of granularity for efficient querying time-series data at various time scales.



Alternatives for other scientific computing languages:

Please note that only PhD students and faculty can request credentials. Please email if you had any questions.

New database: Preqin Infrastructure Online

Preqin Logo - High Res

Preqin’s Infrastructure Online is “the leading source of intelligence on the infrastructure industry. This constantly updated resource includes details for all aspects of the asset class, including infrastructure transactions, fund managers, strategic investors and trade buyers, net-to-investor fund performance, fundraising information, institutional investor profiles and more.” (Publisher Information)


New database: Lipper Hedge Fund Database (TASS)


Columbia University Libraries now have access to Lipper Hedge Fund Database (TASS) via Wharton Research Data Services.  Please note that Lipper Hedge Fund Database is accessible via Thomson Reuters underneath “Current Subscriptions” on the Wharton Research Data Services homepage.

Publisher information:

“Thomson Reuters Lipper Hedge Fund database is an indispensable resource for institutional asset managers, high net worth investors and consultants who monitor the global hedge fund industry.
The Lipper Hedge Fund global database has been a reliable source of timely, high-quality hedge fund data for over 20 years.

Coverage from 1990
• Quantitative performance data on over 7,500 actively reporting Hedge funds and Funds of Hedge Funds
• Performance data on over 11,000 graveyard funds that have liquidated or stopped reporting

• Essential fund profile data including fund strategy, inception date, fund domicile and much more
• Full historical monthly price and performance, dating to fund inception
• Loaded with notes on value-add background information, detailing Manager Biographies, Fund Structure, etc.
• Analyze strategy trends by grouping funds into Investment, Sector, Geographical Focus
• Such data can be used by academics for granular research and publication purposes.

Lipper (formerly TASS) data is most frequently used Hedge Fund database for research due to its considerable coverage of live and dead hedge funds, compared with other hedge fund databases available for academic and commercial researchers.”

New database: Capital Cube


Capital Cube provides comprehensive stock analysis on over 45,000 stocks and ETFs including scores and in-depth reports on fundamental analysis, likely corporate actions, dividend quality, and earnings quality.

It is a great tool for fundamental analysis on companies as Capital Cube converts raw statistics into meaningful interpretations by comparing them to key competitors, and providing historical context.


New databases: CapEx & Prowess from CMIE


CapEx provides information and data on investment projects that involve the setting up of new capacities in India.

  • New capacities to be created in India
  • Location of plant or facilities being created
  • Costs involved in creating new capacities
  • Name and contact details of promoters, co-promoters and other associates
  • Progress of implementation based on
    • scanning of company releases
    • regular phone calls to promotors & contractors
    • scanning of media coverage
  • Expected date of completion of projects

Prowess is a database of the financial performance of over 27,000 companies. It includes all companies traded on the National Stock Exchange and the Bombay Stock Exchange, thousands of unlisted public limited companies and hundreds of private limited companies. It also includes a number of important business entities that are not registered companies. Prowess contains time-series data from 1989-90. It is updated continuously.

– from vendor web site