Getting Started with TwitteR

Recently, I used TwitteR package to scrap data from Twitter for sentiment data analysis. Here is a step by step guide on how to get started.

  1. Download the TwitteR package and make it available in your R session 
  2. Set up Twitter account for API access
    • You need to have a twitter account.
    • Go to https://apps.twitter.com and sign on with your twitter account.
    • Once you have signed in you should see the following screen, and simply click on the button that says “Create New App”.
    • Once you click on the “Create New App” button you will go to the Create an Application screen. There are three fields, a click box and a button you need to click on this page. The three fields are Name, Description and Website. The name of the application must be unique so this may take a few tries. The description needs to be at least 10 character long, and put in a website. If you do not have one you can use https://bigcomputing.blogspot.com. Now click the “Yes, I agree” box for the license agreement and click the “Create your Twitter application”.
    • Once you successfully create an application you will be taken to the application page. Once there click on the “key and access token” tab. From that page you are going to need four things.
      1. Consumer Key (API Key)
      2. Consumer Secret (API Secret)
        Click the “Create my access token” button.
      3. Access Token
      4. Access Token Secret
  3. Now re-open your R session and enter the following code using those four pieces of information
    consumer_key <- “your_consumer_key”
    consumer_secret <- “your_consumer_secret”
    access_token <- “your_access_token”
    access_secret <- “your_access_secret”
    setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret) 
  4. Set up authentication
    download.file(url=”http://curl.haxx.se/ca/cacert.pem”,
    destfile=”C:\\Users\\CHUBING\\Desktop\\Final Project\\text_mining_and_web_scraping\\cacert.pem”, method=”auto”)authenticate <- OAuthFactory$new(consumerKey=consumer_key,
    consumerSecret=consumer_secret,
    requestURL=”https://api.twitter.com/oauth/request_token”,
    accessURL=”https://api.twitter.com/oauth/access_token”,
    authURL=”https://api.twitter.com/oauth/authorize”)setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)save(authenticate, file=”twitter authentication.Rdata”)
  5. Scrap Twitter Data – 60000 raw tweets
    #latest tweets
    tweets_trump <- searchTwitter(‘Donald+Trump’, lang=”en”,n=2000,resultType=”recent”)# Loop over tweets and extract text
    library(plyr)
    donald = lapply(tweets_trump, function(t) t$getText())
  6. Once you have the twitter data file, it’s rather messy. First, we need to convert it to data frame to make it easier for ggplot.
    # Create dataframe from twitter result matrix
    tweet.df <- twListToDF(donald)
  7. Twitter no longer provides location data for each tweet, so to gather location information, you will have to use the location information provided by users in their user profiles.
    # Look up twitter users information by screen names
    users <- lookupUsers(tweet.df$screenName)
    # Create dataframe from twitter users matrix
    users_df <- twListToDF(users)
    # Merge two data frames together
    merge(tweet.df, users_df, by=”screenName”)
  8. Clean up messy data in Twitter data
    # Remove duplicates
    total <- unique(total)
    # Place NAs in empty cells in location column
    total$location <- replace(total$location, total$location==””,NA)
    # Remove location rows with NAs
    total <- total[complete.cases(total$location), ]
    # Clean up emojis
    total$text_clean <- str_replace_all(total$text,”[^[:graph:]]”, ” “)
    # Remove punctuation
    total$text_clean <- gsub(“[[:punct:]]”, “”, total$text_clean)
    # Remove control characters
    total$text_clean <- gsub(“[[:cntrl:]]”, “”, total$text_clean)
    # Remove digits
    total$text_clean <- gsub(‘\\d+’, ”, total$text_clean)
    # Remove URLs from string
    total$text_clean <- gsub(“(f|ht)tp(s?)://(.*)[.][a-z]+”, “”, total$text_clean)#The above expression explained:#? optional space
    #(f|ht) match “f” or “ht”
    #tp match “tp”
    #(s?) optionally match “s” if it’s there
    #(://) match “://”
    #(.*) match every character (everything) up to
    #[.|/] a period or a forward-slash
    #(.*) then everything after that