This blog is an introduction to use ‘plm’ package for panel data analysis. Panel data means datasets with the same observations (respondents) and variables across different time units (such as year, month). And it’s common for researchers to have an unbalanced panel dataset in practice (for example, GDP data could be missing in different years for different countries if you check on the World Bank website).
This blog will not focus on statistical theories, so please read ‘’Panel Data Econometrics in R: The plm Package” written by Croissant and Millo for more theoretical details.
There are several built-in datasets in ‘plm’ package which you could use after installing and loading the package.
install.packages(“plm”)
library(plm)
data(“EmplUK”, package = “plm”)
data(“Produc”, package = “plm”)
data(“Grunfeld”, package = “plm”)
data(“Wages”, package = “plm”)
In order to define panel data in R, you need both observation ID and time ID, then use the function pdata.frame().
For dataset “EmplUK”, we have known that observation ID is “firm”, and time ID is “year”, so the panel dataset is defined as following:
EmplUK_panel <- pdata.frame(EmplUK, index = c(“firm”, “year”))
For dataset “Wages”, both observation ID and time ID are missing, but we know it’s a well-balanced dataset (that means no missing observation in any time unit) including 7-year data of 595 heads of household which is already sorted in id and time. Then we could define this panel data by simply indicating the number of observations, 595, like following:
Wages_panel <- pdata.frame(Wages, index = 595)