Python Open Lab November 9

This week we learned pandas, which is a package built on top of Numpy. It has Dataframe as its core data structure which is very useful for dealing with table data. Dataframe is made up of multidimensional arrays with rows and columns. It supports heterogeneous types and missing data, which is a great feature.

Pandas is great for loading files. It has different functions for reading csv, excel, html and sql. For example, if we want to load csv file, just use function read_csv().

import pandas as pd

taxi_data = pd.read_csv('./files/green_tripdata_2018-02.csv')

After this, the file will be loaded automatically as a dataframe. We can use function head() to check the first five lines of dataframe. This function will show column indexs and row indexs too. Another statistical function is describe(), it will show statistics like mean, std, min, 25%, 50%, 75%, max on each column.

To learn dataframe, we should have a good understanding of Series first. Series is very similar to List. The difference only is that Series has index for each element.

code:

series_data = pd.Series([0.25, 0.5, 0.75, 1.0])

print(series_data)


console:

0    0.25

1    0.50

2    0.75

3    1.00

dtype: float64

For the above example, we can see that the index for value 0.25 is 0, the index for value 0.5 is 1,…  That’s what index of Series looks like.

The index of Series can also be characters.

code:

series_data = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])

print(series_data)


console:

a    0.25b    0.50c    0.75d    1.00dtype: float64

To access value 0.50, we can use “series_data[‘b’]”.

Dataframe is similar to Series, the major difference is that it is multidimensional.

 code:

 df = pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])

 print(df)




 console:

             foo        bar

  a          0.77       0.52

  b          0.34       0.27

  c          0.69       0.51

If we want to get the second column, we can use df.bar or df[‘bar’]. To access a specific cell, there are two ways too. The first approach is to use function loc() which accesses cells by row index and column index. The second approach is to use function iloc() which accesses by row position and column position. For example, if we want to get value 0.27 in the above dataframe, we can use df.loc[‘b’,’bar’] or df.iloc[1,1]

Other features in pandas are also introduced, like unique() or dealing with missing values by function fillna(). In all, pandas is a very useful tool to deal with table data and do the analysis.

Introduction to R ‘plm’ package (3)

In the ‘plm’ package blog (2),  we’ve gotten regression outputs for both fixed and random effect models. One common question after getting regression output is to figure out which model should be chosen using Hausman test. The fixed effect output is names as “grun.fe” and the random effect output is names as “grun.re”. The function of Hausman test is phtest(). 

  • Hausman test

phtest(grun.fe, grun.re)

The null hypothesis of Hausman Test is that the preferred model is random effects, and the alternative hypothesis is that the model is fixed effects. If p-value is less than 0.05, reject null hypothesis. In this example, p-value is greater than 0.05, so we choose random effects model.

  • Test of individual and/or time effects

This is used only after running the pooled OLS model. To test the presence of individual and time effects in the Grunfeld example, we will firstly run a pooled OLS regression

olsmodel <- plm(inv ~ value + capital, data = Grunfeld, model = “pooling”)

#please notice that in this regression, we don’t define panel data but use the original dataset directly because it’s not a panel data analysis

plmtest(olsmodel, effect = “twoways”, type = “ghm”)

Or we could combine the two steps above together, and write syntax as following.

plmtest(inv ~ value + capital, data = Grunfeld, effect = “twoways”, type = “ghm”)

Python Open Lab November 2

This week we learned about File IO. IO means input and output. So the content is basically about reading and writing file.

Before doing any operations on file, we need to open the file. The command is open(filename, mode). ‘filename’ need to include the path of file. There are two ways to show the path, absolute path and relative path. Absolute path sees the file from the global view while relative path shows the file from the position of present script. In relative path, we use ‘.’ to show the current directory the script is in and use ‘..’ to show the parent directory of current directory.

Mode is about how we open the file. Common mode are ‘r’, ‘w’ and ‘a’. ‘r’ means we open the file to read only. ‘w’ means we open the file for writing. ‘a’ is similar to ‘w’. The different between ‘a’ and ‘w’ is that ‘w’ will erase previous content of the file to write and ‘a’ just append to the tail of previous content.

An example of opening file is :  afile = open(‘a.txt’,’r’)    afile is the file object we get from opening file a.txt in the read mode.

After opening the file, we can begin to do operations on it. We learned reading file first. To read write, we must open the file in the ‘r’ mode. Function read() can get all contents of the file. Function readline() can read file line by line. Function readlines() can get all lines of the file and return a list.

To write to file, we use function write(astring) to implement that.  Pass a string parameter to write() function, and the string will be written to file.

An example of reading file and writing file:

     afile = open(‘a.txt’,’r’)
       
     #try to read all content of file a.txt to content
     content = afile.read()

     print(content)

     afile.close()

     afile = open(‘a.txt’,’w’)
             
     #erase all content in a.txt, and write "hellow world" to it
     afile.write(“hello world”)

     afile.close()

Python Open Lab October 26

This week we learned functions, which is very important for programmers. Functions are useful for procedural decomposition, maximize code reuse and minimize redundancy.

Functions should be declared like a variable before using.

def function(parameter1, parameter2…):

    do something

    return value

‘def’ is the keyword to show that we are defining a function. ‘function’ can be replaced by the function name.

An example :

def printHelloWorld():

    print(“hello world”)

    printHelloWorld()

After declaring a function, we call it when we want to use it. In the above example, we define a function called ‘printHelloWorld’ in line1 and line2. In the line3, we call it by its name.

Function parameters are values passed into the function when we call the function. By using parameters, we can introduce variables outside into the function.

 

The return value is to show the result of function to the main program. So main program assigns a task to the function and function executes the task. After the execution, function gives the result to main program.

An example of using function parameter and return(get the bigger number from two sums):

def sum(x, y):

    return float(x)+float(y)

    num1 = sum(1.0, 2.5) #num1 = 3.5

    num2 = sum(2.4, 1.6) #num2 = 4.0

    if num1 > num2:

        print(num1)

    else:

        print(num2)

Python Open Lab October 19

This week we mainly learned about condition statements.

First we learned how to read user-input from console by using function input(). Input() can introduce user input to our program so user can define some values and program can get that.

Then we looked at the definition of condition statement, which means when condition is met, code will be executed, otherwise the next statement will be executed. ‘If’ statement is introduced first. If the condition of the ‘if’ statement is satisfied, code in the ‘if’ structure is executed. ‘else’ statement can appear below the ‘if’ statement. When the condition of ‘if’ statement is not met, condition ‘else’ statement is stratified naturally.  ‘elif’ statement is to help program decide on different conditions. It is similar to ‘else’ statement but we can write conditions in the ‘elif’ statement. The structure is like this:

                        if <some condition>:

                               do A

                        elif <some condition>:

                               do B

                        else:

                               do C

We learned how to apply ‘if-elif-else’ statement in the loop to do more complicated task.  Then we got an idea about what is ‘continue’ and ‘break’ statements. ‘continue’ skips the rest code of present iteration in the loop. ‘break’ jumps out of the present loop(end it). Nested loop is introduced for students to have a better understanding of ‘continue’ and ‘break’ statements.

Python Open Lab October 12

In this week, we continue to learn string, which is very important. Loop is also introduced. Examples like loop for a list, loop for a dictionary or loop for a string are taught.

We learned some useful functions of string.

  • len(str) — find the length of present string
  • str.find(“ab”) — search a string in present string
  • str.rstrip() — remove whitespace
  • str.replace(“red”,”green”) — replacement
  • str.split(“,”) — split
  • str.isdigit() — decide whether string is all digit
  • lower(), upper() — change string to uppercase or lowercase
  • str.endswith(“hello”) — test whether present string ends with another string

We talked about type conversion, like changing variable type from int to string or string to int.

Then we learned loop, which is about repeat steps/statements. We looked at the ‘while’ loop. ‘While’ loop can use an iteration variable to control the loop. There are generally two types of loop, finite loop and infinite loop. Finite loop stops when the termination condition is satisfied any more. Infinite loop never stops because the termination condition is never met. We learned ‘for’ loop then, which is very useful for iterating over a sequence. ‘for’ loop in a list is iterating over all elements in the list. ‘for’ loop in a dictionary is iterating over all keys of elements in the dictionary. ‘for’ loop in a string is iterating over all characters in a string.

With loop, we have the tool to scan data structures like list, dictionary and string without writing duplicate code.

Introduction to R ‘plm’ package (2)

The first blog for “plm” package provides basic information about how to define panel data. This blog aims to introduce syntax for both fixed and random effects regression models.

The dataset “Grunfeld” is a balanced panel of 10 observational units (firms) from 1935 to 1954, and we are going to use this dataset to run both fixed and random effects models. You could go back to the first blog and know how to load this dataset.

Firstly, we define the panel data as “Grunfeld_panel”

Grunfeld_panel <- pdata.frame(Grunfeld, index = c(“firm”, “year”))

  • Fixed effects
    grun.fe <- plm(inv ~ value + capital, data = Grunfeld_panel, model = “within”)
    summary(grun.fe)

  • Random effects
    grun.re <- plm(inv ~ value + capital, data = Grunfeld_panel, model = “random”)
    summary(grun.re)

The model argument here, could be “within” (for the fixed effects model), “random” (for the random effects model), “pooling” (for the pooled OLS model), “fd” (for the first-differences model) and “between” (for the between model).

 

Introduction to R ‘plm’ package (1)

This blog is an introduction to use ‘plm’ package for panel data analysis. Panel data means datasets with the same observations (respondents) and variables across different time units (such as year, month). And it’s common for researchers to have an unbalanced panel dataset in practice (for example, GDP data could be missing in different years for different countries if you check on the World Bank website).

This blog will not focus on statistical theories, so please read ‘’Panel Data Econometrics in R: The plm Package” written by Croissant and Millo for more theoretical details.

There are several built-in datasets in ‘plm’ package which you could use after installing and loading the package.

install.packages(“plm”)

library(plm)

data(“EmplUK”, package = “plm”)

data(“Produc”, package = “plm”)

data(“Grunfeld”, package = “plm”)

data(“Wages”, package = “plm”)

In order to define panel data in R, you need both observation ID and time ID, then use the function pdata.frame().

For dataset “EmplUK”, we have known that observation ID is “firm”, and time ID is “year”, so the panel dataset is defined as following:

EmplUK_panel <- pdata.frame(EmplUK, index = c(“firm”, “year”))

For dataset “Wages”, both observation ID and time ID are missing, but we know it’s a well-balanced dataset (that means no missing observation in any time unit) including 7-year data of 595 heads of household which is already sorted in id and time. Then we could define this panel data by simply indicating the number of observations, 595, like following:  

Wages_panel <- pdata.frame(Wages, index = 595)

R Open Lab Fall 2018 – Text data in R

R is also a powerful tool to deal with text data. This time we first clarified the concept between character and string by practicing some tricky examples and some basic ideas that we can play with our text data such as substring, combining and replacing. Then a txt file was introduced to let attendee play with it. The result really interested them a lot.

Here is the link to our open lab’s GitHub repository: https://github.com/wbh0912/R-Open-Lab-Fall-2018

If you have further questions regarding topics covered in the material, please feel free to drop in during consultation hours or leave a comment.

Introduction to R ‘survey’ package (4)

In the previous 3 blogs, I have introduced how to define survey data and do descriptive statistics (here are the links for R ‘survey’ package blog (1) (2) (3)). Today, I am going to introduce basic regression syntax in this package.

svyglm() # generalized linear regression using survey data

Let’s use the two-stage cluster sample (we have introduced in blog (2)) “apiclus2” as an example. Let’s assume api00 is the dependent variable, ell, meals and mobility are independent variables, survey data is defined using svydesign() function, named as “dclus2”. The syntax of this generalized linear model is written as following.

svyglm(api00 ~ ell + meals + mobility, design = dclus2)

# The default family is linear regression, if you aim for non-linear regression, for example binomial logistic regression,  the syntax could be modified as following. stype is the dependent variable in this model.

svyglm(stype ~ ell + meals + mobility, design = dclus2, family=binomial)

 

The full version of manual about package ‘survey’ is here. Please check more functions and detailed descriptions, arguments and examples in this link. If you would like to discuss any further questions based on this blog, feel free to email data@library.columbia.edu.