adaapp · james-innes · Sep 30, 2019 · Sep 30, 2019
diff --git a/01 Tutorial.ipynb b/01 Tutorial.ipynb
diff --git a/01_tutorial.py b/01_tutorial.py
@@ -0,0 +1,299 @@
+# -*- coding: utf-8 -*-
+"""01 Tutorial.ipynb
+
+Automatically generated by Colaboratory.
+
+Original file is located at
+    https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/01%20Tutorial.ipynb
+
+# Introduction to Data Analysis with Pandas
+
+- [Getting the data into Python](#Getting-the-data-into-Python)
+    -  using `read_csv` and dealing with missing data
+- [Accessing columns](#Accessing-the-columns)
+    -  using dot notation and square brackets
+    -  setting the index
+    -  using `loc`
+- [Sorting and filtering](#Sorting-and-filtering)
+    -  the `sort_values` function
+    -  how to get documentation
+    -  default arguments
+    -  passing a Boolean to `loc[]`
+    -  compound filters
+- [Summary statistics](#Summary-statistics)
+    -  not so useful for this data set but good to know
+- [Investigating relationships](#Investigating-relationships)
+    -  drawing scatter plots in `pandas`
+    -  drawing better scatter plots in `seaborn`
+    -  getting the correlation coefficient
+- [Time series](#Time-Series)
+    -  plotting simple time series
+    -  applying a calculation and creating new columns
+"""
+
+# Commented out IPython magic to ensure Python compatibility.
+# We tend to abbreviate the pandas library as pd
+import pandas as pd
+# Stop pandas from abbreviating tables to fit in the notebook
+pd.options.display.max_columns = 1000
+pd.options.display.max_rows = 1000
+# Display graphs in the notebook
+# %matplotlib inline
+
+"""## Getting the data into Python
+
+The `pandas` library stores data in what it calls a *dataframe*, which is really just a smart table.
+
+We use the `read_csv` function to read in data from a csv file. In this case it's data about London Boroughs.
+
+Don't forget to run each cell when you get to it with either `ctrl`+`enter` or `shift`+`enter`
+"""
+
+# read in our csv file, and automatically change missing values (a dot in the csv) into NaN
+#boroughs = pd.read_csv('boroughs.csv', na_values = ['.',' '], thousands=',')
+boroughs = pd.read_csv('https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/boroughs.csv', na_values = ['.',' '], thousands=',')
+# Use the head function to see the first few rows
+#boroughs.head(5)
+boroughs.dtypes
+
+"""### Q1
+
+> What do you think `NaN` stands for?
+
+## Accessing the columns
+
+A single column of the data is accessible using Python dot notation
+"""
+
+boroughs.Anxiety
+
+"""Or we can use square brackets, a bit like with a Python list or dictionary."""
+
+boroughs['Population']
+
+"""### Q2
+
+> Try out both ways of accessing columns.
+>
+> This isn't as helpful as it could be. Why not?
+
+Square brackets are more flexible. We can give them a list of headings.
+"""
+
+# note the nested brackets
+boroughs[['Borough','Population','Happy']]
+
+"""This is better. But it would be nice if we didn't have to keep including the `Borough` column. So let's make that our *index*"""
+
+boroughs = boroughs.set_index(boroughs.Borough)
+boroughs.head(5)
+
+"""### Q3 
+
+> What changed?
+
+Now, when we ask for column, we'll get the borough for free
+"""
+
+boroughs[['Age','WorkAge']]
+
+"""Now we can also use the `loc` function (which uses square brackets, too) to *filter* the data and *locate* the index Haringey."""
+
+boroughs.loc['Haringey']
+
+"""### Q4
+
+> Pick another borough to retreive the data for. Compare it to Haringey.
+"""
+
+boroughs.loc[['Haringey','Hackney']]
+
+"""## Sorting and filtering
+
+Let's find out which boroughs have the highest population.
+
+`pandas` dataframes have a `sort_values` function.
+
+### Q5
+
+Remember in a jupyter notebook, you can put the cursor in the function brackets and hit `shift`+`tab` to bring up documentation for that function.
+
+> Make the sort_values function below work, to put the boroughs in order of population
+>
+> Now put them in *descending* order
+>
+> Which borough has the largest population?
+"""
+
+# *** broken ***
+boroughs.sort_values()
+
+"""What if we wanted to only include **innerLondon** boroughs?"""
+
+boroughs.loc[boroughs["InnerOuter"]=='Inner London']
+
+"""So we can pass a Boolean into those square brackets to *filter* the data. `pandas` square brackets are clearly a bit more powerful than regular Python square brackets.
+
+### Q6
+
+> Filter the data to show only Outer London boroughs
+>
+> Apply `sort_values` to give the Outer London boroughs in descending order of population
+"""
+
+boroughs.loc[boroughs["InnerOuter"]=="Outer London"].sort_values("Population")[["Area","Age"]]
+
+"""If you want to combine two Booleans into one filter you'll need to put both into parentheses *for reasons*. For example,"""
+
+boroughs.loc[(boroughs.InnerOuter=="Inner London") | (boroughs.InnerOuter=="Outer London")]
+
+"""It might be useful to come back to this table of *just* the individual boroughs, so let's assign that to a variable `justBoroughs`"""
+
+justBoroughs = boroughs.loc[(boroughs.InnerOuter=="Inner London") | (boroughs.InnerOuter=="Outer London")]
+justBoroughs.head()
+
+"""### Note
+
+There is a subtle catch here that is worth thinking about when you're trying to do more advanced stuff with `pandas`.
+
+`boroughs[]` and `boroughs.loc[]` can appear to do the same thing, but they don't. In general it is better to use `loc`.
+
+See [this article](https://www.dataquest.io/blog/settingwithcopywarning/) later if you want more details.
+
+## Summary statistics
+
+The dataframe has built in functions for statistical measures like `mean`, `std`, `quantile` but you need to be careful whether using them makes sense.
+"""
+
+# you can give loc a row label and a column label
+boroughs.loc['London','Age']
+
+justBoroughs['Age'].mean()
+
+# you can give loc a row label and a column label
+boroughs.loc['London','Age']
+
+"""### Q7
+
+> Why is the mean of the average ages not the same as the London average age?
+
+So use the Inner London, Outer London and London averages from the main table rather than applying `mean` to a column.
+
+## Investigating relationships
+
+We would expect there to be an obvious relationship between unemployment rates and employment rates
+"""
+
+justBoroughs.plot.scatter("Employ", "Unemploy");
+
+"""Let's quantify that by asking for the correlation coefficient"""
+
+justBoroughs.Employ.corr(justBoroughs.Unemploy)
+
+"""### Q8
+
+> How would you interpret this?
+>
+> What *correlation coefficient* is it using?
+>
+> Why isn't it a perfect correlation?
+>
+> Look for correlation between some other pairs of variables. Use a scatter plot first, then get the correlation coefficient
+
+The `seaborn` library has some nice options for scatter plots, so let's import that and then see an example.
+"""
+
+# pyplot is the grandparent of all python plotting packages
+import matplotlib.pyplot as plt
+# seaborn is based on pyplot but makes it easier to use
+import seaborn as sns
+# I don't know why we abbreviate seaborn as sns
+
+"""Now an example,"""
+
+# by default seaborn plots come out a bit small, so make ours 8in by 8in
+plt.figure(figsize=(8,8))
+# sns.scatterplot has options for controlling colour and dot size so we can use four variables on one graph
+sns.scatterplot(data=justBoroughs.loc[justBoroughs.Borough != "City of London"],
+                x="Employ",
+                y="Medianpay",
+                hue="Conservative",
+                palette="RdBu")
+#plt.axvline(justBoroughs.Employ.mean(), linestyle="--", alpha=0.6)
+#plt.axhline(justBoroughs.Unemploy.mean(), linestyle="--", alpha=0.6)
+plt.title("My beautiful scatter plot")
+# where to put the legend
+plt.legend(loc='upper right');
+
+boroughs["PopThousands"] = boroughs["Population"]/1000
+
+boroughs["AvgHouseholdSize"] = boroughs["Population"]/boroughs["Households"]
+boroughs.sort_values("AvgHouseholdSize", ascending=False)["AvgHouseholdSize"]
+
+justBoroughs.corr()
+
+sns.lmplot(data=boroughs,
+          x="Pay",
+          y="Happy");
+
+"""# Time Series
+
+The other `csv` files all contain time series. Let's look at how recycling has changed over recent years.
+"""
+
+recycling = pd.read_csv('recycling.csv')
+recycling
+
+pd.to_datetime(recycling.Year,format="%Y")
+
+"""This time we'll make `Year` the index"""
+
+recycling = recycling.set_index("Year")
+
+"""Now we can draw a time series graph"""
+
+recycling.Barnet.plot(c="red")
+recycling["Barking and Dagenham"].plot(c="green");
+
+"""It would be helpful to be able to show that Barking and Dagenham has improved by more *as a proportion* of their starting point than Barnet has.
+
+We can make a new a column, call it BarnetIndexed say, and fill it with the percentages scaled to 1 at 2004. And the same for Barking and Dagenham.
+"""
+
+recycling["BarnetIndexed"] = recycling.Barnet/recycling.Barnet[2004]
+recycling["Barking and DagenhamIndexed"] =recycling["Barking and Dagenham"]/recycling["Barking and Dagenham"][2004]
+
+"""Things to note about the above
+
+* you can make a new column just by saying `recycling["New column name"]=`
+* you can divide every number in a column by the value in 2004 by just doing `recycling.Barnet/recycling.Barnet[2004]`
+"""
+
+recycling.BarnetIndexed.plot(c="red")
+recycling["Barking and DagenhamIndexed"].plot(c="blue");
+# note the `c` for colour
+
+"""In fact, let's go ahead and do that for all the boroughs. We can use a `for` loop over all the columns (remember that in this dataframe it's the boroughs that are columns and the years are rows.)"""
+
+for column in recycling.columns:
+    recycling["{}Indexed".format(column)] = recycling[column]/recycling[column][2004]
+recycling.head()
+
+recycling["Newham"].plot(c="green")
+recycling["NewhamIndexed"].plot(c="blue")
+recycling["Barnet"].plot(c="orange")
+recycling["BarnetIndexed"].plot(c="red")
+plt.title("Recycling in Newham and Barnet");
+
+"""There was a small fudge in here. If you check `recycling.dtypes` you'll see that `Year` was an `int64` (an integer) which worked okay for us this time, but in future we'll want to explicit turn it into a `datetime` object instead, so `pandas` knows we're dealing with time. We'll do that with [`to_datetime`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html).
+
+Documentation for [`pandas` is here](http://pandas.pydata.org/pandas-docs/stable/).
+
+We've installed several visualisation libraries that you might find useful
+
+* [`pyplot`](https://matplotlib.org/)
+* [`seaborn`](https://seaborn.pydata.org/)
+* [`bokeh`](https://bokeh.pydata.org/)
+* [`chartify`](https://labs.spotify.com/2018/11/15/introducing-chartify-easier-chart-creation-in-python-for-data-scientists/)
+* [`geopandas`](http://geopandas.org/)
+"""