Python Open Lab November 9 – Research Data Services Blog

This week we learned pandas, which is a package built on top of Numpy. It has Dataframe as its core data structure which is very useful for dealing with table data. Dataframe is made up of multidimensional arrays with rows and columns. It supports heterogeneous types and missing data, which is a great feature.

Pandas is great for loading files. It has different functions for reading csv, excel, html and sql. For example, if we want to load csv file, just use function read_csv().

import pandas as pd

taxi_data = pd.read_csv('./files/green_tripdata_2018-02.csv')

After this, the file will be loaded automatically as a dataframe. We can use function head() to check the first five lines of dataframe. This function will show column indexs and row indexs too. Another statistical function is describe(), it will show statistics like mean, std, min, 25%, 50%, 75%, max on each column.

To learn dataframe, we should have a good understanding of Series first. Series is very similar to List. The difference only is that Series has index for each element.

code:

series_data = pd.Series([0.25, 0.5, 0.75, 1.0])

print(series_data)


console:

0    0.25

1    0.50

2    0.75

3    1.00

dtype: float64

For the above example, we can see that the index for value 0.25 is 0, the index for value 0.5 is 1,… That’s what index of Series looks like.

The index of Series can also be characters.

code:

series_data = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])

print(series_data)


console:

a    0.25b    0.50c    0.75d    1.00dtype: float64

To access value 0.50, we can use “series_data[‘b’]”.

Dataframe is similar to Series, the major difference is that it is multidimensional.

 code:

 df = pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])

 print(df)




 console:

             foo        bar

  a          0.77       0.52

  b          0.34       0.27

  c          0.69       0.51

If we want to get the second column, we can use df.bar or df[‘bar’]. To access a specific cell, there are two ways too. The first approach is to use function loc() which accesses cells by row index and column index. The second approach is to use function iloc() which accesses by row position and column position. For example, if we want to get value 0.27 in the above dataframe, we can use df.loc[‘b’,’bar’] or df.iloc[1,1]

Other features in pandas are also introduced, like unique() or dealing with missing values by function fillna(). In all, pandas is a very useful tool to deal with table data and do the analysis.

Leave a Reply