Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,003 changes: 0 additions & 1,003 deletions 01 Tutorial.ipynb

This file was deleted.

299 changes: 299 additions & 0 deletions 01_tutorial.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,299 @@
# -*- coding: utf-8 -*-
"""01 Tutorial.ipynb

Automatically generated by Colaboratory.

Original file is located at
https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/01%20Tutorial.ipynb

# Introduction to Data Analysis with Pandas

- [Getting the data into Python](#Getting-the-data-into-Python)
- using `read_csv` and dealing with missing data
- [Accessing columns](#Accessing-the-columns)
- using dot notation and square brackets
- setting the index
- using `loc`
- [Sorting and filtering](#Sorting-and-filtering)
- the `sort_values` function
- how to get documentation
- default arguments
- passing a Boolean to `loc[]`
- compound filters
- [Summary statistics](#Summary-statistics)
- not so useful for this data set but good to know
- [Investigating relationships](#Investigating-relationships)
- drawing scatter plots in `pandas`
- drawing better scatter plots in `seaborn`
- getting the correlation coefficient
- [Time series](#Time-Series)
- plotting simple time series
- applying a calculation and creating new columns
"""

# Commented out IPython magic to ensure Python compatibility.
# We tend to abbreviate the pandas library as pd
import pandas as pd
# Stop pandas from abbreviating tables to fit in the notebook
pd.options.display.max_columns = 1000
pd.options.display.max_rows = 1000
# Display graphs in the notebook
# %matplotlib inline

"""## Getting the data into Python

The `pandas` library stores data in what it calls a *dataframe*, which is really just a smart table.

We use the `read_csv` function to read in data from a csv file. In this case it's data about London Boroughs.

Don't forget to run each cell when you get to it with either `ctrl`+`enter` or `shift`+`enter`
"""

# read in our csv file, and automatically change missing values (a dot in the csv) into NaN
#boroughs = pd.read_csv('boroughs.csv', na_values = ['.',' '], thousands=',')
boroughs = pd.read_csv('https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/boroughs.csv', na_values = ['.',' '], thousands=',')
# Use the head function to see the first few rows
#boroughs.head(5)
boroughs.dtypes

"""### Q1

> What do you think `NaN` stands for?

## Accessing the columns

A single column of the data is accessible using Python dot notation
"""

boroughs.Anxiety

"""Or we can use square brackets, a bit like with a Python list or dictionary."""

boroughs['Population']

"""### Q2

> Try out both ways of accessing columns.
>
> This isn't as helpful as it could be. Why not?

Square brackets are more flexible. We can give them a list of headings.
"""

# note the nested brackets
boroughs[['Borough','Population','Happy']]

"""This is better. But it would be nice if we didn't have to keep including the `Borough` column. So let's make that our *index*"""

boroughs = boroughs.set_index(boroughs.Borough)
boroughs.head(5)

"""### Q3

> What changed?

Now, when we ask for column, we'll get the borough for free
"""

boroughs[['Age','WorkAge']]

"""Now we can also use the `loc` function (which uses square brackets, too) to *filter* the data and *locate* the index Haringey."""

boroughs.loc['Haringey']

"""### Q4

> Pick another borough to retreive the data for. Compare it to Haringey.
"""

boroughs.loc[['Haringey','Hackney']]

"""## Sorting and filtering

Let's find out which boroughs have the highest population.

`pandas` dataframes have a `sort_values` function.

### Q5

Remember in a jupyter notebook, you can put the cursor in the function brackets and hit `shift`+`tab` to bring up documentation for that function.

> Make the sort_values function below work, to put the boroughs in order of population
>
> Now put them in *descending* order
>
> Which borough has the largest population?
"""

# *** broken ***
boroughs.sort_values()

"""What if we wanted to only include **innerLondon** boroughs?"""

boroughs.loc[boroughs["InnerOuter"]=='Inner London']

"""So we can pass a Boolean into those square brackets to *filter* the data. `pandas` square brackets are clearly a bit more powerful than regular Python square brackets.

### Q6

> Filter the data to show only Outer London boroughs
>
> Apply `sort_values` to give the Outer London boroughs in descending order of population
"""

boroughs.loc[boroughs["InnerOuter"]=="Outer London"].sort_values("Population")[["Area","Age"]]

"""If you want to combine two Booleans into one filter you'll need to put both into parentheses *for reasons*. For example,"""

boroughs.loc[(boroughs.InnerOuter=="Inner London") | (boroughs.InnerOuter=="Outer London")]

"""It might be useful to come back to this table of *just* the individual boroughs, so let's assign that to a variable `justBoroughs`"""

justBoroughs = boroughs.loc[(boroughs.InnerOuter=="Inner London") | (boroughs.InnerOuter=="Outer London")]
justBoroughs.head()

"""### Note

There is a subtle catch here that is worth thinking about when you're trying to do more advanced stuff with `pandas`.

`boroughs[]` and `boroughs.loc[]` can appear to do the same thing, but they don't. In general it is better to use `loc`.

See [this article](https://www.dataquest.io/blog/settingwithcopywarning/) later if you want more details.

## Summary statistics

The dataframe has built in functions for statistical measures like `mean`, `std`, `quantile` but you need to be careful whether using them makes sense.
"""

# you can give loc a row label and a column label
boroughs.loc['London','Age']

justBoroughs['Age'].mean()

# you can give loc a row label and a column label
boroughs.loc['London','Age']

"""### Q7

> Why is the mean of the average ages not the same as the London average age?

So use the Inner London, Outer London and London averages from the main table rather than applying `mean` to a column.

## Investigating relationships

We would expect there to be an obvious relationship between unemployment rates and employment rates
"""

justBoroughs.plot.scatter("Employ", "Unemploy");

"""Let's quantify that by asking for the correlation coefficient"""

justBoroughs.Employ.corr(justBoroughs.Unemploy)

"""### Q8

> How would you interpret this?
>
> What *correlation coefficient* is it using?
>
> Why isn't it a perfect correlation?
>
> Look for correlation between some other pairs of variables. Use a scatter plot first, then get the correlation coefficient

The `seaborn` library has some nice options for scatter plots, so let's import that and then see an example.
"""

# pyplot is the grandparent of all python plotting packages
import matplotlib.pyplot as plt
# seaborn is based on pyplot but makes it easier to use
import seaborn as sns
# I don't know why we abbreviate seaborn as sns

"""Now an example,"""

# by default seaborn plots come out a bit small, so make ours 8in by 8in
plt.figure(figsize=(8,8))
# sns.scatterplot has options for controlling colour and dot size so we can use four variables on one graph
sns.scatterplot(data=justBoroughs.loc[justBoroughs.Borough != "City of London"],
x="Employ",
y="Medianpay",
hue="Conservative",
palette="RdBu")
#plt.axvline(justBoroughs.Employ.mean(), linestyle="--", alpha=0.6)
#plt.axhline(justBoroughs.Unemploy.mean(), linestyle="--", alpha=0.6)
plt.title("My beautiful scatter plot")
# where to put the legend
plt.legend(loc='upper right');

boroughs["PopThousands"] = boroughs["Population"]/1000

boroughs["AvgHouseholdSize"] = boroughs["Population"]/boroughs["Households"]
boroughs.sort_values("AvgHouseholdSize", ascending=False)["AvgHouseholdSize"]

justBoroughs.corr()

sns.lmplot(data=boroughs,
x="Pay",
y="Happy");

"""# Time Series

The other `csv` files all contain time series. Let's look at how recycling has changed over recent years.
"""

recycling = pd.read_csv('recycling.csv')
recycling

pd.to_datetime(recycling.Year,format="%Y")

"""This time we'll make `Year` the index"""

recycling = recycling.set_index("Year")

"""Now we can draw a time series graph"""

recycling.Barnet.plot(c="red")
recycling["Barking and Dagenham"].plot(c="green");

"""It would be helpful to be able to show that Barking and Dagenham has improved by more *as a proportion* of their starting point than Barnet has.

We can make a new a column, call it BarnetIndexed say, and fill it with the percentages scaled to 1 at 2004. And the same for Barking and Dagenham.
"""

recycling["BarnetIndexed"] = recycling.Barnet/recycling.Barnet[2004]
recycling["Barking and DagenhamIndexed"] =recycling["Barking and Dagenham"]/recycling["Barking and Dagenham"][2004]

"""Things to note about the above

* you can make a new column just by saying `recycling["New column name"]=`
* you can divide every number in a column by the value in 2004 by just doing `recycling.Barnet/recycling.Barnet[2004]`
"""

recycling.BarnetIndexed.plot(c="red")
recycling["Barking and DagenhamIndexed"].plot(c="blue");
# note the `c` for colour

"""In fact, let's go ahead and do that for all the boroughs. We can use a `for` loop over all the columns (remember that in this dataframe it's the boroughs that are columns and the years are rows.)"""

for column in recycling.columns:
recycling["{}Indexed".format(column)] = recycling[column]/recycling[column][2004]
recycling.head()

recycling["Newham"].plot(c="green")
recycling["NewhamIndexed"].plot(c="blue")
recycling["Barnet"].plot(c="orange")
recycling["BarnetIndexed"].plot(c="red")
plt.title("Recycling in Newham and Barnet");

"""There was a small fudge in here. If you check `recycling.dtypes` you'll see that `Year` was an `int64` (an integer) which worked okay for us this time, but in future we'll want to explicit turn it into a `datetime` object instead, so `pandas` knows we're dealing with time. We'll do that with [`to_datetime`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html).

Documentation for [`pandas` is here](http://pandas.pydata.org/pandas-docs/stable/).

We've installed several visualisation libraries that you might find useful

* [`pyplot`](https://matplotlib.org/)
* [`seaborn`](https://seaborn.pydata.org/)
* [`bokeh`](https://bokeh.pydata.org/)
* [`chartify`](https://labs.spotify.com/2018/11/15/introducing-chartify-easier-chart-creation-in-python-for-data-scientists/)
* [`geopandas`](http://geopandas.org/)
"""
Loading