data

A gentle introduction to Pandas with Python

Oleg

Sep 8, 2014 • 7 min read

I made this presentation for a PyZürich meetup. It serves as a brief introduction to loading and processing data with Pandas, the open source Data Analysis Library for Python. You can get a copy of the iPython notebook and the open data used here: http://bit.ly/pyzurich-04092014

Also check out Felix Zumstein’s notebook and presentation on xlwings, which followed this demo as a real world example of Python data analytics with Excel integration.

The best way to use this tutorial is interactively, editing the examples and writing your own. For this you’ll need to set up your environment, and thankfully this is easy no matter if you are on a Mac, Windows or Linux computer. Please leave a note on Twitter if you get stuck or have feedback.

Setting up

I find that getting started with powerful data analytics in Python is easiest with a download and installation of Anaconda from Continuum Analytics. Just download the file and run it:

$ bash Anaconda-2.0.1-Mac-x86.sh
$ ipython notebook

(..or equivalent on Linux / double click the executable on Windows)

If you want to have virtualenv-like control over your installation and use up less hard drive space*, use Miniconda:

$ conda create -n my_pandas python=2 pandas
... Proceed ([y]/n)? y ...
$ source activate my_pandas
(my_pandas)$ conda install ipython pyzmq tornado jinja2 matplotlib
(my_pandas)$ ipython notebook

* If you have multiple projects, you do not need to create separate environments for each of them.

If you’re on Windows, a popular free alternative is Python(x,y).

First chirps

OK, you’ve got your data analytics toolkit deployed. You’ve started ipython notebook and should now see a web page open up called IPy: Notebook. Click on the New Notebook button, or navigate to a copy of this notebook that you downloaded, and you’ll be set to try this out for yourself. Another popular way of exploring Python is ReInteract.

Let’s start off by making some pretty graphics with the Matplotlib library. If you have ever used R, MATLAB or Mathematica, using this might feel familiar. This package tries to make data plotting as painless as possible, and comes with no strings attached. If you want more control over your graphics, check out the list of Plotting libraries on python.org. We are going to now add Matplotlib and its external links to our environment, then draw a chart from some open data.

%matplotlib inline

Note: you can also use the GUI backends here, e.g. %matplotlib qt, if you prefer and your system supports them. Also note, that to some people this reliance on external dependencies is exactly the reason why some people choose to not use Matplotlib, as it makes your code less portable. Pick your battles.

Now we import into our namespace:

import matplotlib.pyplot as plt
import numpy as np

Here is a six second beat pattern of the Mountain Bluebird (Sialia currucoides, which resembles the Twitter logo), extracted using Audacity from a recording obtained on xeno-canto (Creative Commons license BY-NC-SA). I’m sure it is possible to do this in Python as well, but it’s hard to replace a good tool once it’s already in your belt. The resulting rhythm is a series of numbers that can be just copy-pasted into a Python array. Transforming into a NumPy array gives our bird-song superpowers. Or at least makes it a lot more efficient to process later on:

birdsong_data = [0.080000, 0.444000, 0.448000, 0.449000, 0.667000, 0.670000, 0.932000, 0.938000, 1.384000, 2.338000, 2.428000, 2.434000, 3.979000, 4.043000, 4.094000, 4.100000, 4.174000, 4.301000, 4.303000, 4.311000, 4.421000, 4.631000, 4.833000, 4.961000, 4.965000, 5.094000, 5.097000, 5.178000, 5.183000, 5.473000, 5.478000, 5.766000, 5.767000, 5.772000, 5.841000, 5.843000]
birdsong = np.array(birdsong_data)

We can plot this sequence in time with the handy scatter graph:

plt.scatter(birdsong, [1]*len(birdsong), s=50, alpha=0.5)
plt.title('A scatter chirp-o-gram')

Or in linear space (read: line graph) using the ubiquitous plot function.

x = np.linspace(0, 6, len(birdsong))
plt.plot(x, birdsong)
plt.title('A simple chirp-o-gram');

And that’s like the little snow-heap at the top of the iceberg of the kinds of charts you can do with Python. Then again, if you’re planning to publish your data on the Web a better option might be to use something like D3.js or Raphael.js for crispness and interactivity. It still makes sense to explore your data in a powerful environment first, ensuring your data is clean and your approach to visualizing and explaining it rock solid.

Let’s go for another example.

Velo flows

As you may know, Stadt Zürich has a portal that puts together data from all kinds of departments and makes it more accessible to the general public. Let’s use this to learn something about this lovely city. Do you like riding your bike through town – or maybe you would like to know just where all those cyclists are rushing off to in the morning?

You can get readings of traffic measured as bikes pass through special lanes in the neat collection Daten der permanenten Velozählstellen. This data has already been explored by my esteemed colleague Dr. Ralph Straumann using R in a series of blog posts: Teil 1, Teil 2, Teil 3. We’re not going to go into as much detail here, but let’s see how Python with Pandas compares for processing these 30 MB of open data in CSV format.

Start by importing Pandas as we did with Matplotlib:

import pandas as pd

We can now read in the CSV file we downloaded – you could just as well grab it directly from the Web:

bike_data = pd.read_csv('/opt/localdev/SciPy/data/TAZ_velozaehlung_stundenwerte.csv')

Now let’s look at a few rows to make sure we loaded what we expected:

# Equivalent to bike_data[-6:-1]
# For the top 5 use bike_data.head(5)
bike_data.tail(5)

Out[9]:

	Zählstelle (Code)	Jahr	Monat (Code)	Tag	Stunde	Datum	Wochentag (Code)	Gezählte Velofahrten
655623	VZS_ZOLL_E	2014	6	30	19	30.6.2014	Mo	57
655624	VZS_ZOLL_E	2014	6	30	20	30.6.2014	Mo	48
655625	VZS_ZOLL_E	2014	6	30	21	30.6.2014	Mo	39
655626	VZS_ZOLL_E	2014	6	30	22	30.6.2014	Mo	24
655627	VZS_ZOLL_E	2014	6	30	23	30.6.2014	Mo	10

That was easy! Remember loading CSV by hand? What Pandas just did for us was not only automatically detect the delimiter and read in the entire file, but create an indexed, columnar data structure called a DataFrame, which can be accessed by column name and will serve us well in the later analysis. You can tweak a lot of things in this command, such as specifying an alternate primary key or helping the date parser – see read_csv().

There are some handy functions to get a feeling for the data before digging in:

bike_data.describe()

Out[10]:

	Jahr	Monat (Code)	Tag	Stunde	Ausfall	Gezählte Velofahrten
count	655628.000000	655628.000000	655628.000000	655628.000000	655628.000000	655628.000000
mean	2012.197685	6.384383	15.784086	11.501101	0.023098	17.724211
std	1.239074	3.453618	8.803354	6.921837	0.150216	29.829774
min	2009.000000	1.000000	1.000000	0.000000	0.000000	0.000000
25%	2011.000000	3.000000	8.000000	6.000000	0.000000	1.000000
50%	2012.000000	6.000000	16.000000	12.000000	0.000000	8.000000
75%	2013.000000	9.000000	23.000000	18.000000	0.000000	22.000000
max	2014.000000	12.000000	31.000000	23.000000	1.000000	714.000000

At a glance, we can see that we have a nicely distributed data set, with no apparent gaps – except for anything explicitly labelled as “Ausfall”; using nulls instead of 0’s would be better, something Ralph already noted in his blog.

We can also quickly check other basics, like a possible skew in the dataset, with Pandas statistics functions:

bike_data['Stunde'].median()

Output:

12.0

Input:

bike_data['Zählstelle (Code)'].unique()

Output:

array(['VZS_ANDR_A', 'VZS_ANDR_E', 'VZS_BERT_A', 'VZS_BERT_E',
       'VZS_BINZ_A', 'VZS_BINZ_E', 'VZS_BUCH_A', 'VZS_BUCH_E',
       'VZS_HOFW_A', 'VZS_HOFW_E', 'VZS_LANG_A', 'VZS_LANG_E',
       'VZS_LIMM_A', 'VZS_LIMM_E', 'VZS_LUXG_A', 'VZS_LUXG_E',
       'VZS_MILI_A', 'VZS_MILI_E', 'VZS_MUEH_A', 'VZS_MUEH_E',
       'VZS_MYTH_A', 'VZS_MYTH_E', 'VZS_SAUM_A', 'VZS_SAUM_E',
       'VZS_SCHE_A', 'VZS_SCHE_E', 'VZS_SCHU_A', 'VZS_SCHU_E',
       'VZS_SIHL_A', 'VZS_SIHL_E', 'VZS_TALS_A', 'VZS_TALS_E',
       'VZS_TOED_A', 'VZS_TOED_E', 'VZS_ZOLL_A', 'VZS_ZOLL_E'], dtype=object)

Pandas interacts well with Matplotlib, as we shall see next:

pd.options.display.mpl_style = 'default' # the 'look nicer' option
plt.scatter(bike_data['Stunde'], bike_data['Gezählte Velofahrten'])

Output:

Scatterplots are a brute force approach to plotting data. We had a brief conversation about this during the meet-up concerning the relative merits and performance optimization in Python and R. If you care for this kind of thing, consider this article, and try Julia. For a faster scatter also see ggplot. For a prettier scatter go with seaborn.

Here is an actual histogram of actual counts using the plot() function through the DataFrame itself:

bike_data.groupby(['Stunde'])['Gezählte Velofahrten'].sum().plot(kind='bar')

And here we can see the hourly histogram for each recorded station using the hist() function:

fig = bike_data['Gezählte Velofahrten'].hist(by=bike_data['Zählstelle (Code)'], grid=False, figsize=(26,26))

Last words

Hope this has given you a good taster for what Pandas, together with Numpy and Matplotlib, can do for your data. It’s an outstanding alternative to expensive proprietary tools, and while perhaps less “intimate” than manipulating numbers in arrays, its a foundation in statistical science that lets us work with data at an undoubtedly faster pace.

For more knowledge on data exploration in Python, have a look at some of these links for additional references:

..and helpful tutorials:

Some more notebooks like this one:

Thanks for checking this out, please send questions and improvements, let’s chat at a future meetup!

Setting up

First chirps

Velo flows

Last words

Sign up for more like this.