Monthly Archives: November 2018

R Open Lab Fall 2018 – Randomness and linear regression

In today’s open lab, we didn’t cover a lot. We first looked at how to generate random samples with certain conditions. Then we did an easy example of linear regression in R. The purpose of this lab is to let attendants understand how randomness works in R and how to use linear regression model for their own scenario.

Here is the link to our open lab’s GitHub repository:

If you have further questions regarding topics covered in the material, please feel free to drop in during consultation hours or leave a comment.


R Open Lab Fall 2018 – R Shiny

If you still haven’t got any idea of how amazing and powerful R can be, here is the time. In this open lab, I introduce my favourite part of R —- R shiny, which is a package to bulid interactive web app for data visualization, dashboard, map interaction and so on. I start from showing several fancy example from shiny official gallery, then ui.R and server.R are introduced separately and also how they connect. The example of control widgets of input and output are given and practiced by attendee. The topic of leaflet —- a package to build map in R are discussed. At the end of the lab, all attendee can build one simple app on Shiny which really satisfy them.

Here is the link to our open lab’s GitHub repository:

If you have further questions regarding topics covered in the material, please feel free to drop in during consultation hours or leave a comment.

Python Open Lab November 9

This week we learned pandas, which is a package built on top of Numpy. It has Dataframe as its core data structure which is very useful for dealing with table data. Dataframe is made up of multidimensional arrays with rows and columns. It supports heterogeneous types and missing data, which is a great feature.

Pandas is great for loading files. It has different functions for reading csv, excel, html and sql. For example, if we want to load csv file, just use function read_csv().

import pandas as pd

taxi_data = pd.read_csv('./files/green_tripdata_2018-02.csv')

After this, the file will be loaded automatically as a dataframe. We can use function head() to check the first five lines of dataframe. This function will show column indexs and row indexs too. Another statistical function is describe(), it will show statistics like mean, std, min, 25%, 50%, 75%, max on each column.

To learn dataframe, we should have a good understanding of Series first. Series is very similar to List. The difference only is that Series has index for each element.


series_data = pd.Series([0.25, 0.5, 0.75, 1.0])



0    0.25

1    0.50

2    0.75

3    1.00

dtype: float64

For the above example, we can see that the index for value 0.25 is 0, the index for value 0.5 is 1,…  That’s what index of Series looks like.

The index of Series can also be characters.


series_data = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])



a    0.25b    0.50c    0.75d    1.00dtype: float64

To access value 0.50, we can use “series_data[‘b’]”.

Dataframe is similar to Series, the major difference is that it is multidimensional.


 df = pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])



             foo        bar

  a          0.77       0.52

  b          0.34       0.27

  c          0.69       0.51

If we want to get the second column, we can use or df[‘bar’]. To access a specific cell, there are two ways too. The first approach is to use function loc() which accesses cells by row index and column index. The second approach is to use function iloc() which accesses by row position and column position. For example, if we want to get value 0.27 in the above dataframe, we can use df.loc[‘b’,’bar’] or df.iloc[1,1]

Other features in pandas are also introduced, like unique() or dealing with missing values by function fillna(). In all, pandas is a very useful tool to deal with table data and do the analysis.

Introduction to R ‘plm’ package (3)

In the ‘plm’ package blog (2),  we’ve gotten regression outputs for both fixed and random effect models. One common question after getting regression output is to figure out which model should be chosen using Hausman test. The fixed effect output is names as “grun.fe” and the random effect output is names as “”. The function of Hausman test is phtest(). 

  • Hausman test


The null hypothesis of Hausman Test is that the preferred model is random effects, and the alternative hypothesis is that the model is fixed effects. If p-value is less than 0.05, reject null hypothesis. In this example, p-value is greater than 0.05, so we choose random effects model.

  • Test of individual and/or time effects

This is used only after running the pooled OLS model. To test the presence of individual and time effects in the Grunfeld example, we will firstly run a pooled OLS regression

olsmodel <- plm(inv ~ value + capital, data = Grunfeld, model = “pooling”)

#please notice that in this regression, we don’t define panel data but use the original dataset directly because it’s not a panel data analysis

plmtest(olsmodel, effect = “twoways”, type = “ghm”)

Or we could combine the two steps above together, and write syntax as following.

plmtest(inv ~ value + capital, data = Grunfeld, effect = “twoways”, type = “ghm”)

R Open Lab Fall 2018 – Data manipulation

Today we covered the topic of data manipulation. We first reviewed the basic ways to subset data frames such as logical expression and subset function. Then, we looked at ways to combine, merge, and split data frames. Finally, we covered the usage of package plyr.

Here is the link to our open lab’s GitHub repository:

If you have further questions regarding topics covered in the material, please feel free to drop in during consultation hours or leave a comment.

Python Open Lab November 2

This week we learned about File IO. IO means input and output. So the content is basically about reading and writing file.

Before doing any operations on file, we need to open the file. The command is open(filename, mode). ‘filename’ need to include the path of file. There are two ways to show the path, absolute path and relative path. Absolute path sees the file from the global view while relative path shows the file from the position of present script. In relative path, we use ‘.’ to show the current directory the script is in and use ‘..’ to show the parent directory of current directory.

Mode is about how we open the file. Common mode are ‘r’, ‘w’ and ‘a’. ‘r’ means we open the file to read only. ‘w’ means we open the file for writing. ‘a’ is similar to ‘w’. The different between ‘a’ and ‘w’ is that ‘w’ will erase previous content of the file to write and ‘a’ just append to the tail of previous content.

An example of opening file is :  afile = open(‘a.txt’,’r’)    afile is the file object we get from opening file a.txt in the read mode.

After opening the file, we can begin to do operations on it. We learned reading file first. To read write, we must open the file in the ‘r’ mode. Function read() can get all contents of the file. Function readline() can read file line by line. Function readlines() can get all lines of the file and return a list.

To write to file, we use function write(astring) to implement that.  Pass a string parameter to write() function, and the string will be written to file.

An example of reading file and writing file:

     afile = open(‘a.txt’,’r’)
     #try to read all content of file a.txt to content
     content =



     afile = open(‘a.txt’,’w’)
     #erase all content in a.txt, and write "hellow world" to it
     afile.write(“hello world”)


Python Open Lab October 26

This week we learned functions, which is very important for programmers. Functions are useful for procedural decomposition, maximize code reuse and minimize redundancy.

Functions should be declared like a variable before using.

def function(parameter1, parameter2…):

    do something

    return value

‘def’ is the keyword to show that we are defining a function. ‘function’ can be replaced by the function name.

An example :

def printHelloWorld():

    print(“hello world”)


After declaring a function, we call it when we want to use it. In the above example, we define a function called ‘printHelloWorld’ in line1 and line2. In the line3, we call it by its name.

Function parameters are values passed into the function when we call the function. By using parameters, we can introduce variables outside into the function.


The return value is to show the result of function to the main program. So main program assigns a task to the function and function executes the task. After the execution, function gives the result to main program.

An example of using function parameter and return(get the bigger number from two sums):

def sum(x, y):

    return float(x)+float(y)

    num1 = sum(1.0, 2.5) #num1 = 3.5

    num2 = sum(2.4, 1.6) #num2 = 4.0

    if num1 > num2:




Python Open Lab October 19

This week we mainly learned about condition statements.

First we learned how to read user-input from console by using function input(). Input() can introduce user input to our program so user can define some values and program can get that.

Then we looked at the definition of condition statement, which means when condition is met, code will be executed, otherwise the next statement will be executed. ‘If’ statement is introduced first. If the condition of the ‘if’ statement is satisfied, code in the ‘if’ structure is executed. ‘else’ statement can appear below the ‘if’ statement. When the condition of ‘if’ statement is not met, condition ‘else’ statement is stratified naturally.  ‘elif’ statement is to help program decide on different conditions. It is similar to ‘else’ statement but we can write conditions in the ‘elif’ statement. The structure is like this:

                        if <some condition>:

                               do A

                        elif <some condition>:

                               do B


                               do C

We learned how to apply ‘if-elif-else’ statement in the loop to do more complicated task.  Then we got an idea about what is ‘continue’ and ‘break’ statements. ‘continue’ skips the rest code of present iteration in the loop. ‘break’ jumps out of the present loop(end it). Nested loop is introduced for students to have a better understanding of ‘continue’ and ‘break’ statements.

Python Open Lab October 12

In this week, we continue to learn string, which is very important. Loop is also introduced. Examples like loop for a list, loop for a dictionary or loop for a string are taught.

We learned some useful functions of string.

  • len(str) — find the length of present string
  • str.find(“ab”) — search a string in present string
  • str.rstrip() — remove whitespace
  • str.replace(“red”,”green”) — replacement
  • str.split(“,”) — split
  • str.isdigit() — decide whether string is all digit
  • lower(), upper() — change string to uppercase or lowercase
  • str.endswith(“hello”) — test whether present string ends with another string

We talked about type conversion, like changing variable type from int to string or string to int.

Then we learned loop, which is about repeat steps/statements. We looked at the ‘while’ loop. ‘While’ loop can use an iteration variable to control the loop. There are generally two types of loop, finite loop and infinite loop. Finite loop stops when the termination condition is satisfied any more. Infinite loop never stops because the termination condition is never met. We learned ‘for’ loop then, which is very useful for iterating over a sequence. ‘for’ loop in a list is iterating over all elements in the list. ‘for’ loop in a dictionary is iterating over all keys of elements in the dictionary. ‘for’ loop in a string is iterating over all characters in a string.

With loop, we have the tool to scan data structures like list, dictionary and string without writing duplicate code.

Introduction to R ‘plm’ package (2)

The first blog for “plm” package provides basic information about how to define panel data. This blog aims to introduce syntax for both fixed and random effects regression models.

The dataset “Grunfeld” is a balanced panel of 10 observational units (firms) from 1935 to 1954, and we are going to use this dataset to run both fixed and random effects models. You could go back to the first blog and know how to load this dataset.

Firstly, we define the panel data as “Grunfeld_panel”

Grunfeld_panel <- pdata.frame(Grunfeld, index = c(“firm”, “year”))

  • Fixed effects
    grun.fe <- plm(inv ~ value + capital, data = Grunfeld_panel, model = “within”)

  • Random effects <- plm(inv ~ value + capital, data = Grunfeld_panel, model = “random”)

The model argument here, could be “within” (for the fixed effects model), “random” (for the random effects model), “pooling” (for the pooled OLS model), “fd” (for the first-differences model) and “between” (for the between model).