Category Archives: Statistics & Data

Introduction to R ‘plm’ package (3)

In the ‘plm’ package blog (2),  we’ve gotten regression outputs for both fixed and random effect models. One common question after getting regression output is to figure out which model should be chosen using Hausman test. The fixed effect output is names as “grun.fe” and the random effect output is names as “”. The function of Hausman test is phtest(). 

  • Hausman test


The null hypothesis of Hausman Test is that the preferred model is random effects, and the alternative hypothesis is that the model is fixed effects. If p-value is less than 0.05, reject null hypothesis. In this example, p-value is greater than 0.05, so we choose random effects model.

  • Test of individual and/or time effects

This is used only after running the pooled OLS model. To test the presence of individual and time effects in the Grunfeld example, we will firstly run a pooled OLS regression

olsmodel <- plm(inv ~ value + capital, data = Grunfeld, model = “pooling”)

#please notice that in this regression, we don’t define panel data but use the original dataset directly because it’s not a panel data analysis

plmtest(olsmodel, effect = “twoways”, type = “ghm”)

Or we could combine the two steps above together, and write syntax as following.

plmtest(inv ~ value + capital, data = Grunfeld, effect = “twoways”, type = “ghm”)

Introduction to R ‘plm’ package (2)

The first blog for “plm” package provides basic information about how to define panel data. This blog aims to introduce syntax for both fixed and random effects regression models.

The dataset “Grunfeld” is a balanced panel of 10 observational units (firms) from 1935 to 1954, and we are going to use this dataset to run both fixed and random effects models. You could go back to the first blog and know how to load this dataset.

Firstly, we define the panel data as “Grunfeld_panel”

Grunfeld_panel <- pdata.frame(Grunfeld, index = c(“firm”, “year”))

  • Fixed effects
    grun.fe <- plm(inv ~ value + capital, data = Grunfeld_panel, model = “within”)

  • Random effects <- plm(inv ~ value + capital, data = Grunfeld_panel, model = “random”)

The model argument here, could be “within” (for the fixed effects model), “random” (for the random effects model), “pooling” (for the pooled OLS model), “fd” (for the first-differences model) and “between” (for the between model).


Introduction to R ‘plm’ package (1)

This blog is an introduction to use ‘plm’ package for panel data analysis. Panel data means datasets with the same observations (respondents) and variables across different time units (such as year, month). And it’s common for researchers to have an unbalanced panel dataset in practice (for example, GDP data could be missing in different years for different countries if you check on the World Bank website).

This blog will not focus on statistical theories, so please read ‘’Panel Data Econometrics in R: The plm Package” written by Croissant and Millo for more theoretical details.

There are several built-in datasets in ‘plm’ package which you could use after installing and loading the package.



data(“EmplUK”, package = “plm”)

data(“Produc”, package = “plm”)

data(“Grunfeld”, package = “plm”)

data(“Wages”, package = “plm”)

In order to define panel data in R, you need both observation ID and time ID, then use the function pdata.frame().

For dataset “EmplUK”, we have known that observation ID is “firm”, and time ID is “year”, so the panel dataset is defined as following:

EmplUK_panel <- pdata.frame(EmplUK, index = c(“firm”, “year”))

For dataset “Wages”, both observation ID and time ID are missing, but we know it’s a well-balanced dataset (that means no missing observation in any time unit) including 7-year data of 595 heads of household which is already sorted in id and time. Then we could define this panel data by simply indicating the number of observations, 595, like following:  

Wages_panel <- pdata.frame(Wages, index = 595)

Introduction to R ‘survey’ package (4)

In the previous 3 blogs, I have introduced how to define survey data and do descriptive statistics (here are the links for R ‘survey’ package blog (1) (2) (3)). Today, I am going to introduce basic regression syntax in this package.

svyglm() # generalized linear regression using survey data

Let’s use the two-stage cluster sample (we have introduced in blog (2)) “apiclus2” as an example. Let’s assume api00 is the dependent variable, ell, meals and mobility are independent variables, survey data is defined using svydesign() function, named as “dclus2”. The syntax of this generalized linear model is written as following.

svyglm(api00 ~ ell + meals + mobility, design = dclus2)

# The default family is linear regression, if you aim for non-linear regression, for example binomial logistic regression,  the syntax could be modified as following. stype is the dependent variable in this model.

svyglm(stype ~ ell + meals + mobility, design = dclus2, family=binomial)


The full version of manual about package ‘survey’ is here. Please check more functions and detailed descriptions, arguments and examples in this link. If you would like to discuss any further questions based on this blog, feel free to email

Introduction to R ‘survey’ package (3)

After defining your survey dataset (please refer back to ‘survey’ package blog (1) & (2) ), you could use the functions below to describe your survey data and estimate population.

Let’s still use apiclus1 data. After svydesign() function, you have a designed survey dataset, dclus1, which we designed in the last week. In this dataset, there are several variables we are going to mention in the following syntax.

api00: continuous variable, integer

api99: continuous variable, integer

enroll: continuous variable, integer

sch.wide: categorical variable, which is also recognized as factor in R

stype: categorical variable, which is also recognized as factor in R

  • svymean()

svymean(~api00, dclus1) #calculate survey mean of variable api00 in defined survey dataset dclus1

  • svyby()

svyby(~api99, ~stype, dclus1, svymean)# calculate survey mean of variable api99 by variable stype

  • svychisq()

svychisq(~sch.wide+stype, dclus1) #contingency tables and chisquared tests between sch.wide and stype. The default (statistic=”F”) is the Rao-Scott second-order correction. And there are other options for “statistics”, such as “Wald”, “Lincom”.

  • svyhist()

svyhist(~enroll, dclus1, main=”Survey weighted”,col=”purple”,ylim=c(0,1.3e-3)) #create a weighted histogram graph for variable enroll, named as “Survey weighted”, colored as purple, range of y axis is from zero to 0.0013

  • svyboxplot()

svyboxplot(enroll~stype,dclus1,all.outliers=TRUE) #create a boxplot for variable enroll, grouped by variable stype, and keep all the outliers

  • svyplot()

svyplot(api00~api99, design=dclus1, style=”bubble”) # create a scatter plot graph for api00 and api99 using bubble as the scatter shape

Introduction to R package ‘survey’ (2)

Here are more types of survey data except the case (simple random sample) we introduced before.

The ‘survey’ package contains several sample datasets from the California Academic Performance Index. After installing and loading the ‘survey’ package, you could import these data samples using command: data(api). And you will see 5 datasets are loaded in R, including apipop, apisrs, apistrat, apiclus1, apiclus2.

For these datasets, the variable to identify survey strata is called “stype”, the variable for sampling weights is called “pw”, and the “fpc” variable contains the population size for the stratum. These terms are all related to survey design methodology. Usually you could find details in data documentation.

The apistrat data is a stratified independent sample. If we define survey data called “dstrat”, the survey design syntax is like following.

dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)

The apiclus1 data is a cluster random sample (without strata). If we define survey data called “dclus1”, the survey design syntax is like following.

dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)

The apiclus2 data is a two-stage cluster sample. If we define survey data called “dclus2”, the survey design syntax is like following.

dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2)

The examples given above are for your reference. In reality, your svydesign() syntax is customized based on your data sample. svydesign() is the first and important step in R for survey data analysis. Before doing this step, you must understand your survey data design methodology well. If you have any further questions, feel free to reach out to the Research Data Services (

In the next two weeks, I will introduce how to use ‘survey’ package for descriptive statistics and regressions.

Introduction to R package ‘survey’ (1)

If you are using R for survey data analysis, you might find the ‘survey’ package is useful for you.

I assume that you have already known how to read/import data in R, so this blog will skip the steps of data cleaning and loading. After importing survey data in R, here are some functions you must know for survey data analysis.

All the functions introduced in this blog are with prefix “svy”. The first step is to define your survey data. The command is svydesign(). This week’s blog will introduce how to design simple random sampling (SRS) data, and in the next week, I will post more information about how to design stratified and clustered survey data.

Simple Random Sampling: The sample subjects are selected by an equal random chance.

svydesign(ids = ~1, strata = NULL, fpc = rep(N, n), data = dat)

ids = ~1 means there is no clustering
strata = NULL means there was no stratification
fpc = rep(N, n) N is population size, n is sample size
data = dat dat is your survey dataset name

You could assign a new name to your survey data, so that you could use it in the following data analysis steps.

For example, you received 305 surveys randomly from all the 6291 residents in a neighborhood about their basic demographic and socio-economic characteristics (such as age, gender, race, household income) and attitudes of online shopping. You have a survey dataset named ‘shopping’, with n=305 and N=6291, and you are going to define this dataset as survey data called “shoppingsvy”.

Firstly, import ‘shopping’ dataset in R. Then define your data as a simple random sampling data named as “shoppingsvy” as following:
shoppingsvy <- svydesign(ids=~1, strata=NULL, fpc = rep(6291, 305), data=shopping)

Survey Documentation and Analysis (SDA)

Survey Documentation and Analysis (SDA) is a web based interface that allows access and analysis of data. The data can be accessed from IPUMS or from  the Inter-university Consortium for Political and Social Research (ICPSR).

SDA allows you to:

  • Browse the codebook describing a datasetsda
  • Calculate frequencies or crosstabulation (with charts)
  • Do comparison of means
  • Calculate a correlation matrix
  • Compare correlations
  • Perform multiple regression
  • Perform logit/probit regression
  • List values of individual cases
  • Recode variables (into public work area)
  • Compute a new variable
  • List/delete derived variables
  • Download a customized subset

SDA allows you to analyze data at a level appropriate to your level of experience. SDA can only analyze datasets that reside on an SDA server. If you would like to test drive SDA, or to see if SDA is useful for your research check out their General Information page.

R Open Labs – Basic Syntax

Hi all, last Wednesday we kicked off the first session of R Open Lab in the DSSC( based in Lehman Library). We started with basic syntax and briefly discussed how to explore the features of our datasets. We used data from Wal-Mart and we will continue exploring this dataset for the next few sessions. Beginners are welcome to join!


See you this Wednesday  10/12/2016 at 10:00 AM!

Python Open Labs Session – 2

In the second session of Python Open Labs, the focus was on conditional statements. Topics covered include : Conditional Operators, Conditional Statements, Boolean Expressions, Python’s obsession with indentation (and the idea of scope!), Two Way Decisions, Multi-way Decisions (using if-else, and elif).

The code on slides is in Python 2.7 – So if you use anything > 2.7, please remember to replace raw_input with input and add parenthesis  to your print statements!

The course material from all Python Open Labs Session is available on Lion Mail Drive link here.
PS: Google drive link requires Columbia UNI login, in case you don’t have one or you prefer using another email ID,  please request on the drive link to grant access and Kunal will do so.