Annual Survey of Industries (ASI) Detailed Unit Level Data with Common Factory-ID is now available to Columbia University students. The Library has data from 1998 to 2014.
The Annual Survey of Industries (ASI) is the principal source of Industrial Statistics in
India. The ASI extends to the entire country except the States of Arunachal Pradesh and Mizoram and Union territory of Lakshadweep. It covers all factories registered under the section 2(m) (i) and 2(m) (ii) of the Factories Act 1948 i.e. those factories employing 10 or more workers using power; and those employing 20 or more workers without using power. The survey also covers Bidi and cigar manufacturing establishments registered under the Bidi and Cigar workers (Conditions of Employment) Act 1966 and electricity undertaking with coverage as above.
ASI provides data on a various vital aspects of the registered factories for use in the
estimation of National Income, studies of industrial structure and policy formulation.
Some of the important indicators generated based on ASI are number of factories,
employment, wages, invested capital, capital formation, input, output, depreciation and
value added on an annual basis.
With firm identifiers from the firm-level data, researchers can do longitudinal research
to track the same company over time periods as long as it is included in the ASI.
Once you have signed in you should see the following screen, and simply click on the button that says “Create New App”.
Once you click on the “Create New App” button you will go to the Create an Application screen. There are three fields, a click box and a button you need to click on this page. The three fields are Name, Description and Website. The name of the application must be unique so this may take a few tries. The description needs to be at least 10 character long, and put in a website. If you do not have one you can use https://bigcomputing.blogspot.com. Now click the “Yes, I agree” box for the license agreement and click the “Create your Twitter application”.
Once you successfully create an application you will be taken to the application page. Once there click on the “key and access token” tab. From that page you are going to need four things.
Consumer Key (API Key)
Consumer Secret (API Secret)
Click the “Create my access token” button.
Access Token Secret
Now re-open your R session and enter the following code using those four pieces of information consumer_key <- “your_consumer_key” consumer_secret <- “your_consumer_secret” access_token <- “your_access_token” access_secret <- “your_access_secret” setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
Set up authentication
destfile=”C:\\Users\\CHUBING\\Desktop\\Final Project\\text_mining_and_web_scraping\\cacert.pem”, method=”auto”)authenticate <- OAuthFactory$new(consumerKey=consumer_key,
authURL=”https://api.twitter.com/oauth/authorize”)setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)save(authenticate, file=”twitter authentication.Rdata”)
Scrap Twitter Data – 60000 raw tweets #latest tweets
tweets_trump <- searchTwitter(‘Donald+Trump’, lang=”en”,n=2000,resultType=”recent”)# Loop over tweets and extract text
donald = lapply(tweets_trump, function(t) t$getText())
Once you have the twitter data file, it’s rather messy. First, we need to convert it to data frame to make it easier for ggplot. # Create dataframe from twitter result matrix
tweet.df <- twListToDF(donald)
Twitter no longer provides location data for each tweet, so to gather location information, you will have to use the location information provided by users in their user profiles. # Look up twitter users information by screen names
users <- lookupUsers(tweet.df$screenName)
# Create dataframe from twitter users matrix
users_df <- twListToDF(users)
# Merge two data frames together
merge(tweet.df, users_df, by=”screenName”)
Clean up messy data in Twitter data
# Remove duplicates
total <- unique(total)
# Place NAs in empty cells in location column
total$location <- replace(total$location, total$location==””,NA)
# Remove location rows with NAs
total <- total[complete.cases(total$location), ]
# Clean up emojis
total$text_clean <- str_replace_all(total$text,”[^[:graph:]]”, ” “)
# Remove punctuation
total$text_clean <- gsub(“[[:punct:]]”, “”, total$text_clean)
# Remove control characters
total$text_clean <- gsub(“[[:cntrl:]]”, “”, total$text_clean)
# Remove digits
total$text_clean <- gsub(‘\\d+’, ”, total$text_clean)
# Remove URLs from string
total$text_clean <- gsub(“(f|ht)tp(s?)://(.*)[.][a-z]+”, “”, total$text_clean)#The above expression explained:#? optional space
#(f|ht) match “f” or “ht”
#tp match “tp”
#(s?) optionally match “s” if it’s there
#(://) match “://”
#(.*) match every character (everything) up to
#[.|/] a period or a forward-slash
#(.*) then everything after that
Columbia University has access to the Infogroup’s Historical Business File via WRDS.
Infogroup gathers geographic location-related business and residential data from various public data sources, such as local yellow pages, credit card billing data, etc. The Infogroup database at WRDS contains two packages: Historical Business and Historical Residential, both of which largely focus on US sectors.
The Historical Business data starts in 1997, while Historical Residential dates back to 2006. The Historical Business data is a calendar year-end snapshot of local business data. It contains business identification, location, industry, corporation hierarchy, employment, sales, and other fields.
For more information about this dataset, please visit this page
Do you know that in addition to Excel API, now you can also extract data easily from Datastream by using Python or MATLAB?
PyDatastream is a Python interface to the Thomson Dataworks Enterprise (DWE) SOAP API (non free), with some convenience functions for retrieving Datastream data specifically. This package requires valid credentials for this API.
It is a Python API for time-series data which abstracts the database which is used to store the data, providing a powerful and unified API. It provides an easy way to insert time-series datapoints and automatically downsample them into multiple levels of granularity for efficient querying time-series data at various time scales.
Preqin’s Infrastructure Online is “the leading source of intelligence on the infrastructure industry. This constantly updated resource includes details for all aspects of the asset class, including infrastructure transactions, fund managers, strategic investors and trade buyers, net-to-investor fund performance, fundraising information, institutional investor profiles and more.” (Publisher Information)
Columbia University Libraries now have access to Lipper Hedge Fund Database (TASS) via Wharton Research Data Services. Please note that Lipper Hedge Fund Database is accessible via Thomson Reuters underneath “Current Subscriptions” on the Wharton Research Data Services homepage.
“Thomson Reuters Lipper Hedge Fund database is an indispensable resource for institutional asset managers, high net worth investors and consultants who monitor the global hedge fund industry.
The Lipper Hedge Fund global database has been a reliable source of timely, high-quality hedge fund data for over 20 years. Coverage from 1990
• Quantitative performance data on over 7,500 actively reporting Hedge funds and Funds of Hedge Funds
• Performance data on over 11,000 graveyard funds that have liquidated or stopped reporting
• Essential fund profile data including fund strategy, inception date, fund domicile and much more
• Full historical monthly price and performance, dating to fund inception
• Loaded with notes on value-add background information, detailing Manager Biographies, Fund Structure, etc.
• Analyze strategy trends by grouping funds into Investment, Sector, Geographical Focus
• Such data can be used by academics for granular research and publication purposes.
Lipper (formerly TASS) data is most frequently used Hedge Fund database for research due to its considerable coverage of live and dead hedge funds, compared with other hedge fund databases available for academic and commercial researchers.”
Capital Cube provides comprehensive stock analysis on over 45,000 stocks and ETFs including scores and in-depth reports on fundamental analysis, likely corporate actions, dividend quality, and earnings quality.
It is a great tool for fundamental analysis on companies as Capital Cube converts raw statistics into meaningful interpretations by comparing them to key competitors, and providing historical context.
CapExprovides information and data on investment projects that involve the setting up of new capacities in India.
New capacities to be created in India
Location of plant or facilities being created
Costs involved in creating new capacities
Name and contact details of promoters, co-promoters and other associates
Progress of implementation based on
scanning of company releases
regular phone calls to promotors & contractors
scanning of media coverage
Expected date of completion of projects
Prowess is a database of the financial performance of over 27,000 companies. It includes all companies traded on the National Stock Exchange and the Bombay Stock Exchange, thousands of unlisted public limited companies and hundreds of private limited companies. It also includes a number of important business entities that are not registered companies. Prowess contains time-series data from 1989-90. It is updated continuously.