A gentle introduction to Pandas with Python

I made this presentation for a PyZürich meetup. It serves as a brief introduction to loading and processing data with Pandas, the open source Data Analysis Library for Python. You can get a copy of the iPython notebook and the open data used here: http://bit.ly/pyzurich-04092014

Also check out Felix Zumstein’s notebook and presentation on xlwings, which followed this demo as a real world example of Python data analytics with Excel integration.

The best way to use this tutorial is interactively, editing the examples and writing your own. For this you’ll need to set up your environment, and thankfully this is easy no matter if you are on a Mac, Windows or Linux computer. Please leave a note on Twitter if you get stuck or have feedback.

Setting up

I find that getting started with powerful data analytics in Python is easiest with a download and installation of Anaconda from Continuum Analytics. Just download the file and run it:

$ bash Anaconda-2.0.1-Mac-x86.sh
$ ipython notebook

(..or equivalent on Linux / double click the executable on Windows)

If you want to have virtualenv-like control over your installation and use up less hard drive space*, use Miniconda:

$ conda create -n my_pandas python=2 pandas
... Proceed ([y]/n)? y ...
$ source activate my_pandas
(my_pandas)$ conda install ipython pyzmq tornado jinja2 matplotlib
(my_pandas)$ ipython notebook

* If you have multiple projects, you do not need to create separate environments for each of them.

If you’re on Windows, a popular free alternative is Python(x,y).

First chirps

OK, you’ve got your data analytics toolkit deployed. You’ve started ipython notebook and should now see a web page open up called IPy: Notebook. Click on the New Notebook button, or navigate to a copy of this notebook that you downloaded, and you’ll be set to try this out for yourself. Another popular way of exploring Python is ReInteract.

Let’s start off by making some pretty graphics with the Matplotlib library. If you have ever used R, MATLAB or Mathematica, using this might feel familiar. This package tries to make data plotting as painless as possible, and comes with no strings attached. If you want more control over your graphics, check out the list of Plotting libraries on python.org. We are going to now add Matplotlib and its external links to our environment, then draw a chart from some open data.

%matplotlib inline
Note: you can also use the GUI backends here, e.g. %matplotlib qt, if you prefer and your system supports them. Also note, that to some people this reliance on external dependencies is exactly the reason why some people choose to not use Matplotlib, as it makes your code less portable. Pick your battles.

Now we import into our namespace:

import matplotlib.pyplot as plt
import numpy as np
Here is a six second beat pattern of the Mountain Bluebird (Sialia currucoides, which resembles the Twitter logo), extracted using Audacity from a recording obtained on xeno-canto (Creative Commons license BY-NC-SA). I’m sure it is possible to do this in Python as well, but it’s hard to replace a good tool once it’s already in your belt. The resulting rhythm is a series of numbers that can be just copy-pasted into a Python array. Transforming into a NumPy array gives our bird-song superpowers. Or at least makes it a lot more efficient to process later on:
birdsong_data = [0.080000, 0.444000, 0.448000, 0.449000, 0.667000, 0.670000, 0.932000, 0.938000, 1.384000, 2.338000, 2.428000, 2.434000, 3.979000, 4.043000, 4.094000, 4.100000, 4.174000, 4.301000, 4.303000, 4.311000, 4.421000, 4.631000, 4.833000, 4.961000, 4.965000, 5.094000, 5.097000, 5.178000, 5.183000, 5.473000, 5.478000, 5.766000, 5.767000, 5.772000, 5.841000, 5.843000]
birdsong = np.array(birdsong_data)
We can plot this sequence in time with the handy scatter graph:
plt.scatter(birdsong, [1]*len(birdsong), s=50, alpha=0.5)
plt.title('A scatter chirp-o-gram')
Screenshot from 2014-10-07 23:34:52

Or in linear space (read: line graph) using the ubiquitous plot function.

x = np.linspace(0, 6, len(birdsong))
plt.plot(x, birdsong)
plt.title('A simple chirp-o-gram');
Screenshot from 2014-10-07 23:34:58

And that’s like the little snow-heap at the top of the iceberg of the kinds of charts you can do with Python. Then again, if you’re planning to publish your data on the Web a better option might be to use something like D3.js or Raphael.js for crispness and interactivity. It still makes sense to explore your data in a powerful environment first, ensuring your data is clean and your approach to visualizing and explaining it rock solid.

Let’s go for another example.

Velo flows

As you may know, Stadt Zürich has a portal that puts together data from all kinds of departments and makes it more accessible to the general public. Let’s use this to learn something about this lovely city. Do you like riding your bike through town – or maybe you would like to know just where all those cyclists are rushing off to in the morning?

You can get readings of traffic measured as bikes pass through special lanes in the neat collection Daten der permanenten Velozählstellen. This data has already been explored by my esteemed colleague Dr. Ralph Straumann using R in a series of blog posts: Teil 1, Teil 2, Teil 3. We’re not going to go into as much detail here, but let’s see how Python with Pandas compares for processing these 30 MB of open data in CSV format.

Start by importing Pandas as we did with Matplotlib:

import pandas as pd
We can now read in the CSV file we downloaded – you could just as well grab it directly from the Web:
bike_data = pd.read_csv('/opt/localdev/SciPy/data/TAZ_velozaehlung_stundenwerte.csv')
Now let’s look at a few rows to make sure we loaded what we expected:
# Equivalent to bike_data[-6:-1]
# For the top 5 use bike_data.head(5)
bike_data.tail(5)
Out[9]:
Zählstelle (Code) Jahr Monat (Code) Tag Stunde Datum Wochentag (Code) Ausfall Gezählte Velofahrten
655623 VZS_ZOLL_E 2014 6 30 19 30.6.2014 Mo 0 57
655624 VZS_ZOLL_E 2014 6 30 20 30.6.2014 Mo 0 48
655625 VZS_ZOLL_E 2014 6 30 21 30.6.2014 Mo 0 39
655626 VZS_ZOLL_E 2014 6 30 22 30.6.2014 Mo 0 24
655627 VZS_ZOLL_E 2014 6 30 23 30.6.2014 Mo 0 10
That was easy! Remember loading CSV by hand? What Pandas just did for us was not only automatically detect the delimiter and read in the entire file, but create an indexed, columnar data structure called a DataFrame, which can be accessed by column name and will serve us well in the later analysis. You can tweak a lot of things in this command, such as specifying an alternate primary key or helping the date parser – see read_csv().

There are some handy functions to get a feeling for the data before digging in:

bike_data.describe()
Out[10]:
Jahr Monat (Code) Tag Stunde Ausfall Gezählte Velofahrten
count 655628.000000 655628.000000 655628.000000 655628.000000 655628.000000 655628.000000
mean 2012.197685 6.384383 15.784086 11.501101 0.023098 17.724211
std 1.239074 3.453618 8.803354 6.921837 0.150216 29.829774
min 2009.000000 1.000000 1.000000 0.000000 0.000000 0.000000
25% 2011.000000 3.000000 8.000000 6.000000 0.000000 1.000000
50% 2012.000000 6.000000 16.000000 12.000000 0.000000 8.000000
75% 2013.000000 9.000000 23.000000 18.000000 0.000000 22.000000
max 2014.000000 12.000000 31.000000 23.000000 1.000000 714.000000
At a glance, we can see that we have a nicely distributed data set, with no apparent gaps – except for anything explicitly labelled as “Ausfall”; using nulls instead of 0’s would be better, something Ralph already noted in his blog.
We can also quickly check other basics, like a possible skew in the dataset, with Pandas statistics functions:
bike_data['Stunde'].median()
Output:
12.0
Input:
bike_data['Zählstelle (Code)'].unique()
Output:
array(['VZS_ANDR_A', 'VZS_ANDR_E', 'VZS_BERT_A', 'VZS_BERT_E',
       'VZS_BINZ_A', 'VZS_BINZ_E', 'VZS_BUCH_A', 'VZS_BUCH_E',
       'VZS_HOFW_A', 'VZS_HOFW_E', 'VZS_LANG_A', 'VZS_LANG_E',
       'VZS_LIMM_A', 'VZS_LIMM_E', 'VZS_LUXG_A', 'VZS_LUXG_E',
       'VZS_MILI_A', 'VZS_MILI_E', 'VZS_MUEH_A', 'VZS_MUEH_E',
       'VZS_MYTH_A', 'VZS_MYTH_E', 'VZS_SAUM_A', 'VZS_SAUM_E',
       'VZS_SCHE_A', 'VZS_SCHE_E', 'VZS_SCHU_A', 'VZS_SCHU_E',
       'VZS_SIHL_A', 'VZS_SIHL_E', 'VZS_TALS_A', 'VZS_TALS_E',
       'VZS_TOED_A', 'VZS_TOED_E', 'VZS_ZOLL_A', 'VZS_ZOLL_E'], dtype=object)

Pandas interacts well with Matplotlib, as we shall see next:

pd.options.display.mpl_style = 'default' # the 'look nicer' option
plt.scatter(bike_data['Stunde'], bike_data['Gezählte Velofahrten'])
Output:
Screenshot from 2014-10-07 23:35:09
Scatterplots are a brute force approach to plotting data. We had a brief conversation about this during the meet-up concerning the relative merits and performance optimization in Python and R. If you care for this kind of thing, consider this article, and try Julia. For a faster scatter also see ggplot. For a prettier scatter go with seaborn.

Here is an actual histogram of actual counts using the plot() function through the DataFrame itself:

bike_data.groupby(['Stunde'])['Gezählte Velofahrten'].sum().plot(kind='bar')
Screenshot from 2014-10-07 23:35:13
And here we can see the hourly histogram for each recorded station using the hist() function:
fig = bike_data['Gezählte Velofahrten'].hist(by=bike_data['Zählstelle (Code)'], grid=False, figsize=(26,26))

Screenshot from 2014-10-07 23:35:21

Last words

Hope this has given you a good taster for what Pandas, together with Numpy and Matplotlib, can do for your data. It’s an outstanding alternative to expensive proprietary tools, and while perhaps less “intimate” than manipulating numbers in arrays, its a foundation in statistical science that lets us work with data at an undoubtedly faster pace.

For more knowledge on data exploration in Python, have a look at some of these links for additional references:

Thanks for checking this out, please send questions and improvements, let’s chat at a future meetup!
Creative Commons Licence
The works on this blog are licensed under a Creative Commons Attribution 4.0 International License.