Memorial Day weekend

Watson Library of Business & Economics will be closing at 6pm today (Friday 5/26) and remain closed for the Memorial Day Weekend (Sat 5/27-Mon 5/29).  Regular Summer hours will resume at 8am on Tuesday 5/30.  https://hours.library.columbia.edu/?library=business

Getting Started with TwitteR

Recently, I used TwitteR package to scrap data from Twitter for sentiment data analysis. Here is a step by step guide on how to get started.

  1. Download the TwitteR package and make it available in your R session 
  2. Set up Twitter account for API access
    • You need to have a twitter account.
    • Go to https://apps.twitter.com and sign on with your twitter account.
    • Once you have signed in you should see the following screen, and simply click on the button that says “Create New App”.
    • Once you click on the “Create New App” button you will go to the Create an Application screen. There are three fields, a click box and a button you need to click on this page. The three fields are Name, Description and Website. The name of the application must be unique so this may take a few tries. The description needs to be at least 10 character long, and put in a website. If you do not have one you can use https://bigcomputing.blogspot.com. Now click the “Yes, I agree” box for the license agreement and click the “Create your Twitter application”.
    • Once you successfully create an application you will be taken to the application page. Once there click on the “key and access token” tab. From that page you are going to need four things.
      1. Consumer Key (API Key)
      2. Consumer Secret (API Secret)
        Click the “Create my access token” button.
      3. Access Token
      4. Access Token Secret
  3. Now re-open your R session and enter the following code using those four pieces of information
    consumer_key <- “your_consumer_key”
    consumer_secret <- “your_consumer_secret”
    access_token <- “your_access_token”
    access_secret <- “your_access_secret”
    setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret) 
  4. Set up authentication
    download.file(url=”http://curl.haxx.se/ca/cacert.pem”,
    destfile=”C:\\Users\\CHUBING\\Desktop\\Final Project\\text_mining_and_web_scraping\\cacert.pem”, method=”auto”)authenticate <- OAuthFactory$new(consumerKey=consumer_key,
    consumerSecret=consumer_secret,
    requestURL=”https://api.twitter.com/oauth/request_token”,
    accessURL=”https://api.twitter.com/oauth/access_token”,
    authURL=”https://api.twitter.com/oauth/authorize”)setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)save(authenticate, file=”twitter authentication.Rdata”)
  5. Scrap Twitter Data – 60000 raw tweets
    #latest tweets
    tweets_trump <- searchTwitter(‘Donald+Trump’, lang=”en”,n=2000,resultType=”recent”)# Loop over tweets and extract text
    library(plyr)
    donald = lapply(tweets_trump, function(t) t$getText())
  6. Once you have the twitter data file, it’s rather messy. First, we need to convert it to data frame to make it easier for ggplot.
    # Create dataframe from twitter result matrix
    tweet.df <- twListToDF(donald)
  7. Twitter no longer provides location data for each tweet, so to gather location information, you will have to use the location information provided by users in their user profiles.
    # Look up twitter users information by screen names
    users <- lookupUsers(tweet.df$screenName)
    # Create dataframe from twitter users matrix
    users_df <- twListToDF(users)
    # Merge two data frames together
    merge(tweet.df, users_df, by=”screenName”)
  8. Clean up messy data in Twitter data
    # Remove duplicates
    total <- unique(total)
    # Place NAs in empty cells in location column
    total$location <- replace(total$location, total$location==””,NA)
    # Remove location rows with NAs
    total <- total[complete.cases(total$location), ]
    # Clean up emojis
    total$text_clean <- str_replace_all(total$text,”[^[:graph:]]”, ” “)
    # Remove punctuation
    total$text_clean <- gsub(“[[:punct:]]”, “”, total$text_clean)
    # Remove control characters
    total$text_clean <- gsub(“[[:cntrl:]]”, “”, total$text_clean)
    # Remove digits
    total$text_clean <- gsub(‘\\d+’, ”, total$text_clean)
    # Remove URLs from string
    total$text_clean <- gsub(“(f|ht)tp(s?)://(.*)[.][a-z]+”, “”, total$text_clean)#The above expression explained:#? optional space
    #(f|ht) match “f” or “ht”
    #tp match “tp”
    #(s?) optionally match “s” if it’s there
    #(://) match “://”
    #(.*) match every character (everything) up to
    #[.|/] a period or a forward-slash
    #(.*) then everything after that

Restricted access period

From 8am to 7pm, Monday to Friday, April 24th to Thursday, May 11th, Watson Library is open to all current graduate students, undergraduates in economics, faculty and staff.
At all other times during these two weeks, i.e. after 7pm until close and all open hours on Saturday and Sunday, Watson Library is available to all University affiliates.

If you need materials or services during the 8am to 7pm period designated above and you are not in one of the groups listed please ask a library staff member for help.

Study space is available in Butler Library, Lehman Library (3rd floor of the International Affairs Building), and the Science and Engineering Library (campus level of the NorthWest Corner Building).

Watson Library closed – Tuesday 3/14

Due to the impending Winter storm expected to arrive tomorrow (Tuesday, March 14th), the Watson Library of Business & Economics will be closed.

Butler Library will open as study space from 9 a.m. to 9 p.m. with Public Safety presence only. All services, including stack access and circulation are closed. All other libraries are closed.

For more information regarding Columbia University campus closings, please visit…

preparedness.columbia.edu

For a schedule of Watson Library’s hours of operation, please visit…

https://hours.library.columbia.edu/?library=business

 

Extended Library hours for Mid-term Exams

Watson Library will remain open until Midnight from Sunday, March 5th, through Thursday, March 9th.  Here are our hours during the mid-term exam period…

Friday 3/03 : 8am-9pm

Saturday 3/04 : 10am-9pm

Sunday 3/05 : 10am-Midnight

Monday 3/06 : 8am-Midnight  [restricted access until 7pm]

Tuesday 3/07 : 8am-Midnight  [restricted access until 7pm]

Wednesday 3/08 : 8am-Midnight  [restricted access until 7pm]

Thursday 3/09 : 8am-Midnight  [restricted access until 7pm]

Friday 3/10 : 8am-5pm

Saturday 3/11 : CLOSED

Sunday 3/12 : CLOSED

Watson Library’s hours of operation for the rest of the Spring semester can be found here…

https://hours.library.columbia.edu/?library=business

 

 

Restricted Access Period for Mid-term Exams

From 8am to 7pm, Monday to Friday, February 20th through March 9th, Watson Library is open to all current graduate students, undergraduates in economics, faculty, and staff.

At all other times during these two weeks, i.e. after 7pm until close and all open hours on Saturday and Sunday, Watson Library is available to all University affiliates.

If you need materials or services during the 8am to 7pm period designated above and you are not in one of the groups listed please ask a library staff member for help.

Study space is available in Butler Library, Lehman Library (3rd floor of International Affairs Building), and the Science and Engineering Library (campus level of the NorthWest Corner Building).  http://library.columbia.edu/services/study-spaces.html

And check out the designated talk zones now available across the across the Columbia Libraries!  https://blogs.cul.columbia.edu/business/2015/09/01/talk-zones-come-to-columbia-libraries/

If you have any questions, please contact our staff at 212-854-7804 or business_circulation@library.columbia.edu.  Thanks.

ADI Hackathon – February 3rd-4th

The Application Development Initiative Group (ADI) will be hosting a Hackathon in the Watson Library of Business & Economics starting at 4pm this Friday, February 3rd, and ending at Noon on Saturday, February 4th.  During this time, use of the main ground floor reading room and study rooms (G01-G17) will be restricted to use by Hackathon participants.  The rest of the Library (including the computer workstations, 1M & 2M study spaces, and mezzanine study rooms) will remain accessible to all eligible patrons as usual.

To view the hours of operation for Watson Library, please visit this page… https://hours.library.columbia.edu/?library=business

To contact the Watson Library staff, please call 212-854-7804 during our open hours or send an  email to business_circulation@library.columbia.edu.  Thanks.

New Dataset: Infogroup’s Historical Business File

Columbia University has access to the Infogroup’s Historical Business File via WRDS.

Infogroup gathers geographic location-related business and residential data from various public data sources, such as local yellow pages, credit card billing data, etc. The Infogroup database at WRDS contains two packages: Historical Business and Historical Residential, both of which largely focus on US sectors.

The Historical Business data starts in 1997, while Historical Residential dates back to 2006. The Historical Business data is a calendar year-end snapshot of local business data. It contains business identification, location, industry, corporation hierarchy, employment, sales, and other fields.

For more information about this dataset, please visit this page

Figure 1

Table1

Table2

Using Python, R, MATLAB APIs for Datastream!

Do you know that in addition to Excel API, now you can also extract data easily from Datastream by using Python, R, or MATLAB?

For Python:

PyDatastream is a Python interface to the Thomson Dataworks Enterprise (DWE) SOAP API (non free), with some convenience functions for retrieving Datastream data specifically. This package requires valid credentials for this API.

It is a Python API for time-series data which abstracts the database which is used to store the data, providing a powerful and unified API. It provides an easy way to insert time-series datapoints and automatically downsample them into multiple levels of granularity for efficient querying time-series data at various time scales.

 

 

Alternatives for other scientific computing languages:

Please note that only PhD students and faculty can request credentials. Please email business@library.columbia.edu if you had any questions.