Introduction to R ‘plm’ package (1)

This blog is an introduction to use ‘plm’ package for panel data analysis. Panel data means datasets with the same observations (respondents) and variables across different time units (such as year, month). And it’s common for researchers to have an unbalanced panel dataset in practice (for example, GDP data could be missing in different years for different countries if you check on the World Bank website).

This blog will not focus on statistical theories, so please read ‘’Panel Data Econometrics in R: The plm Package” written by Croissant and Millo for more theoretical details.

There are several built-in datasets in ‘plm’ package which you could use after installing and loading the package.

install.packages(“plm”)

library(plm)

data(“EmplUK”, package = “plm”)

data(“Produc”, package = “plm”)

data(“Grunfeld”, package = “plm”)

data(“Wages”, package = “plm”)

In order to define panel data in R, you need both observation ID and time ID, then use the function pdata.frame().

For dataset “EmplUK”, we have known that observation ID is “firm”, and time ID is “year”, so the panel dataset is defined as following:

EmplUK_panel <- pdata.frame(EmplUK, index = c(“firm”, “year”))

For dataset “Wages”, both observation ID and time ID are missing, but we know it’s a well-balanced dataset (that means no missing observation in any time unit) including 7-year data of 595 heads of household which is already sorted in id and time. Then we could define this panel data by simply indicating the number of observations, 595, like following:  

Wages_panel <- pdata.frame(Wages, index = 595)

R Open Lab Fall 2018 – Text data in R

R is also a powerful tool to deal with text data. This time we first clarified the concept between character and string by practicing some tricky examples and some basic ideas that we can play with our text data such as substring, combining and replacing. Then a txt file was introduced to let attendee play with it. The result really interested them a lot.

Here is the link to our open lab’s GitHub repository: https://github.com/wbh0912/R-Open-Lab-Fall-2018

If you have further questions regarding topics covered in the material, please feel free to drop in during consultation hours or leave a comment.

Introduction to R ‘survey’ package (4)

In the previous 3 blogs, I have introduced how to define survey data and do descriptive statistics (here are the links for R ‘survey’ package blog (1) (2) (3)). Today, I am going to introduce basic regression syntax in this package.

svyglm() # generalized linear regression using survey data

Let’s use the two-stage cluster sample (we have introduced in blog (2)) “apiclus2” as an example. Let’s assume api00 is the dependent variable, ell, meals and mobility are independent variables, survey data is defined using svydesign() function, named as “dclus2”. The syntax of this generalized linear model is written as following.

svyglm(api00 ~ ell + meals + mobility, design = dclus2)

# The default family is linear regression, if you aim for non-linear regression, for example binomial logistic regression,  the syntax could be modified as following. stype is the dependent variable in this model.

svyglm(stype ~ ell + meals + mobility, design = dclus2, family=binomial)

 

The full version of manual about package ‘survey’ is here. Please check more functions and detailed descriptions, arguments and examples in this link. If you would like to discuss any further questions based on this blog, feel free to email data@library.columbia.edu.

R Open Lab Fall 2018 – More visualization

Today we will explore more about the advanced data visualization in R. First, we will review the basic graphical functions covered in the last open lab and learn how to use additional parameters to achieve different goals. Then, we will focus on the powerful package ggplot2.

Here is the link to our open lab’s GitHub repository: https://github.com/wbh0912/R-Open-Lab-Fall-2018

If you have further questions regarding topics covered in the material, please feel free to drop in during consultation hours or leave a comment.

Introduction to R ‘survey’ package (3)

After defining your survey dataset (please refer back to ‘survey’ package blog (1) & (2) ), you could use the functions below to describe your survey data and estimate population.

Let’s still use apiclus1 data. After svydesign() function, you have a designed survey dataset, dclus1, which we designed in the last week. In this dataset, there are several variables we are going to mention in the following syntax.

api00: continuous variable, integer

api99: continuous variable, integer

enroll: continuous variable, integer

sch.wide: categorical variable, which is also recognized as factor in R

stype: categorical variable, which is also recognized as factor in R

  • svymean()

svymean(~api00, dclus1) #calculate survey mean of variable api00 in defined survey dataset dclus1

  • svyby()

svyby(~api99, ~stype, dclus1, svymean)# calculate survey mean of variable api99 by variable stype

  • svychisq()

svychisq(~sch.wide+stype, dclus1) #contingency tables and chisquared tests between sch.wide and stype. The default (statistic=”F”) is the Rao-Scott second-order correction. And there are other options for “statistics”, such as “Wald”, “Lincom”.

  • svyhist()

svyhist(~enroll, dclus1, main=”Survey weighted”,col=”purple”,ylim=c(0,1.3e-3)) #create a weighted histogram graph for variable enroll, named as “Survey weighted”, colored as purple, range of y axis is from zero to 0.0013

  • svyboxplot()

svyboxplot(enroll~stype,dclus1,all.outliers=TRUE) #create a boxplot for variable enroll, grouped by variable stype, and keep all the outliers

  • svyplot()

svyplot(api00~api99, design=dclus1, style=”bubble”) # create a scatter plot graph for api00 and api99 using bubble as the scatter shape

R Open Lab Fall 2018 – Dataframe and basic visualization

This week we stepped into the most basic but important data structure called dataframe, several ways of constructing dataframes and importing dataframes are introduced. At the mean time, we reviewed the basic idea of extracting data by index/condition by giving some exercises to practice. Then, we focused on how to show the general picture of a dataset in numeric and graphic way at first glance.

Here is the link to our open lab’s GitHub repository: https://github.com/wbh0912/R-Open-Lab-Fall-2018

If you have further questions regarding topics covered in the material, please feel free to drop in during consultation hours or leave a comment.

Introduction to R package ‘survey’ (2)

Here are more types of survey data except the case (simple random sample) we introduced before.

The ‘survey’ package contains several sample datasets from the California Academic Performance Index. After installing and loading the ‘survey’ package, you could import these data samples using command: data(api). And you will see 5 datasets are loaded in R, including apipop, apisrs, apistrat, apiclus1, apiclus2.

For these datasets, the variable to identify survey strata is called “stype”, the variable for sampling weights is called “pw”, and the “fpc” variable contains the population size for the stratum. These terms are all related to survey design methodology. Usually you could find details in data documentation.

The apistrat data is a stratified independent sample. If we define survey data called “dstrat”, the survey design syntax is like following.

dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)

The apiclus1 data is a cluster random sample (without strata). If we define survey data called “dclus1”, the survey design syntax is like following.

dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)

The apiclus2 data is a two-stage cluster sample. If we define survey data called “dclus2”, the survey design syntax is like following.

dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2)

The examples given above are for your reference. In reality, your svydesign() syntax is customized based on your data sample. svydesign() is the first and important step in R for survey data analysis. Before doing this step, you must understand your survey data design methodology well. If you have any further questions, feel free to reach out to the Research Data Services (data@library.columbia.edu).

In the next two weeks, I will introduce how to use ‘survey’ package for descriptive statistics and regressions.

Python Open Lab, October 5

In the Python Open Lab of this week, we learned list, dictionary, string.

For list, it can store multiple elements and many useful functions about list are introduced.

  • append(x) —   put an element to the tail of a list
  • insert(x) —   insert an element to specific position of a list
  • count(x) —   count number of a element in the list
  • remove(x) —   remove the first x in the list
  • sort()      —   sort a list
  • extend(List b) — put another list to the tail of present list

Slice of list and two-dimension list are also learned.

 

For dictionary, it is different from list because it appears in the form of “key: value” pairs. The keys need to be unique. It is need to be declared by using “{}”. The storage of dictionary is frank, just like “dict[“key”] = value”, so we finish storing one pair into the dictionary. To fetch the value in the dictionary, we just need to use its key, like “print(dict[“key”])”. Dictionary is very flexible and any value of a dictionary can be a list or dictionary. Complex operations can be done by dictionary.

 

For string, the concept string is that anything between double quotes and a single quote is a string. String concatenation is very simple, just using the operator “+”. Element fetch and slice of a string is same as that in a list.

 

Python has great built-in types like list, dictionary and string. They are powerful and time-saving for programmers. A good master of them can enhance coding efficiency and avoid mistakes.

 

Kang

Python Open Lab, September 28

This week is the first week of Python Open Lab in fall 2018. We talked about fundamental concepts about Python and basic types and operations in Python.

We first looked at the basic concept of programming languages and introduced python and its usage. The installation of Python for windows users were included so students can use Python in their terminal to write basic commands.

We learned the variable in Python and their different types. Types include int, float and bool. Function type() can be used to make judgement of type of variables. int, float have basic operations of addition, subtraction, multiplication, division. bool has operations of ‘and’ and ‘or’.

At last, list is introduced as an advanced data structure which can handle things that int,float and bool can not solve.

 

Basic types like int, float and bool are very important to programmers because they are used at a very high frequency. Understanding their concepts and operations can give students a good foundation for learning other advanced types.

 

The GitHub link of Python Open Lab is https://github.com/kangsun666/PythonOpenLab. After installing Git, students can use terminal or git bash enter a new folder and use command “git clone https://github.com/kangsun666/PythonOpenLab”, after this operation, all files will be downloaded automatically. When we post new files, students can just use terminal to enter their git folder and use command “git pull” to get new files.

 

Kang

R Open Lab Fall 2018 – Functions, environment, and apply

The topic of this week is functions, environment, and apply family in R. We first cover the method of defining your own function in R, then we bring in the concept of environment since they are relevant. At last, we go over the apply family. Recall that we learned loops as one of the basic concepts at the very beginning; you can review it from the Starter Kit and the Lab featuring More Fundamentals. Although loop is conceptually simple and intuitive, it is inefficient. The apply family comes in handy in this case.

Here is the link to our open lab’s GitHub repository: https://github.com/wbh0912/R-Open-Lab-Fall-2018/blob/master/function%2C%20environment%2C%20apply.R

If you have further questions regarding topics covered in the material, please feel free to drop in during consultation hours or leave a comment.