Introduction to R package ‘survey’ (2)

Here are more types of survey data except the case (simple random sample) we introduced before.
The ‘survey’ package contains several sample datasets from the California Academic Performance Index. After installing and loading the ‘survey’ package, you could import these data samples using command: data(api). And you will see 5 datasets are loaded in R, including apipop, apisrs, apistrat, apiclus1, apiclus2.
For these datasets, the variable to identify survey strata is called “stype”, the variable for sampling weights is called “pw”, and the “fpc” variable contains the population size for the stratum. These terms are all related to survey design methodology. Usually you could find details in data documentation.
The apistrat data is a stratified independent sample. If we define survey data called “dstrat”, the survey design syntax is like following.
dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
The apiclus1 data is a cluster random sample (without strata). If we define survey data called “dclus1”, the survey design syntax is like following.
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
The apiclus2 data is a two-stage cluster sample. If we define survey data called “dclus2”, the survey design syntax is like following.
dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2)
The examples given above are for your reference. In reality, your svydesign() syntax is customized based on your data sample. svydesign() is the first and important step in R for survey data analysis. Before doing this step, you must understand your survey data design methodology well. If you have any further questions, feel free to reach out to the Research Data Services (data@library.columbia.edu).
In the next two weeks, I will introduce how to use ‘survey’ package for descriptive statistics and regressions.