Research Data Services, jointly supported by the Libraries and CUIT, provides support and consulting for research data needs at Columbia University. Helping with many aspects of the research data lifecycle including research data management, finding data, recommendations for cleaning and understanding data, mapping and data visualisation.
As we move towards the end of Spring semester, having covered most of the basics in Python, the recent sessions have been focusing on introducing python modules requested by attendees.
Last week had a second session on web-scraping with BeautifulSoup, I have updated the practice code for the same in Session-17 folder of the google drive link mentioned below.
This week, on April 7, 2017 I introduced the Python CSV module for reading and writing data from csv files. A very easy module primarily for reading CSV data, it requires the user to understand only a few of the details. The relevant sample code and a practise csv file can be found on the google drive link below, under Session-18 folder.
For those who have been following this blog series, sorry for a late post on the updates about Python Open Labs.
Last week we covered some basics about web scraping with python, but before I start let me make a customary disclaimer.
So, getting along with the updates. In a nutshell web scraping can be described as a way of extracting useful relevant information from web pages i.e html pages. This can be abstracted into following steps:
Downloading the web page content (user urllib or requests module in python)
View page source in a web browser to examine the html structure of web page and locate information of interest for your task at hand
Try to figure out the html structuring such as class,id, html tag etc that will help your python script locate the information.
Use the beautifulsoup python module to parse and reach as close as possible to the relevant information in the html page structure and then extract the information using string methods.
The steps 2 – 4 go hand in hand, i.e one helps you build more upon the other. For example, the more you understand about the html structure surrounding your page the more specific inputs you can provide to beautifulsoup methods to extract out the information.
For the previous I have uploaded the sample python files with commented code lines on the Google Drive link mentioned below which you can access under Session – 16 folder. Make sure you work through those. Doubts, queries, feedbacks are always welcome 🙂
During the first 20-30 minutes of yesterday’s open lab, we talked about how to merge datasets and filter data using base R and dplyr package. The rest of the open lab were free discussions between participants and instructors.
Thank you to all who showed up!
Welcome to explore the materials I used for the open lab:
In the 15th session of Python Open Labs, this week we looked at some miscellaneous topics and revision of basic concepts of file reading and string handling from previous sessions. We also briefly looked into format strings / format specifiers for string construction in Python. The relevant slides are available on the Session – 15 folder on the google drive link mentioned below.
Today we introduced readr package. It is a package used for reading csv/xls/txt etc. data. It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.
We covered the functionality of the package and the difference between this package and base R.
Next week we will talk about apply family.
See you next Wednesday from 10 am – 12 pm at DSSC (Lehman Social Science Library Room 215)!