diff --git a/01 Tutorial.ipynb b/01 Tutorial.ipynb deleted file mode 100644 index ed222a7..0000000 --- a/01 Tutorial.ipynb +++ /dev/null @@ -1,1003 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "# Introduction to Data Analysis with Pandas" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "source": [ - "- [Getting the data into Python](#Getting-the-data-into-Python)\n", - " - using `read_csv` and dealing with missing data\n", - "- [Accessing columns](#Accessing-the-columns)\n", - " - using dot notation and square brackets\n", - " - setting the index\n", - " - using `loc`\n", - "- [Sorting and filtering](#Sorting-and-filtering)\n", - " - the `sort_values` function\n", - " - how to get documentation\n", - " - default arguments\n", - " - passing a Boolean to `loc[]`\n", - " - compound filters\n", - "- [Summary statistics](#Summary-statistics)\n", - " - not so useful for this data set but good to know\n", - "- [Investigating relationships](#Investigating-relationships)\n", - " - drawing scatter plots in `pandas`\n", - " - drawing better scatter plots in `seaborn`\n", - " - getting the correlation coefficient\n", - "- [Time series](#Time-Series)\n", - " - plotting simple time series\n", - " - applying a calculation and creating new columns" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "# We tend to abbreviate the pandas library as pd\n", - "import pandas as pd\n", - "# Stop pandas from abbreviating tables to fit in the notebook\n", - "pd.options.display.max_columns = 1000\n", - "pd.options.display.max_rows = 1000\n", - "# Display graphs in the notebook\n", - "%matplotlib inline" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## Getting the data into Python\n", - "\n", - "The `pandas` library stores data in what it calls a *dataframe*, which is really just a smart table.\n", - "\n", - "We use the `read_csv` function to read in data from a csv file. In this case it's data about London Boroughs.\n", - "\n", - "Don't forget to run each cell when you get to it with either `ctrl`+`enter` or `shift`+`enter`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "outputs": [], - "source": [ - "# read in our csv file, and automatically change missing values (a dot in the csv) into NaN\n", - "#boroughs = pd.read_csv('boroughs.csv', na_values = ['.',' '], thousands=',')\n", - "boroughs = pd.read_csv('https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/boroughs.csv', na_values = ['.',' '], thousands=',')\n", - "# Use the head function to see the first few rows\n", - "#boroughs.head(5)\n", - "boroughs.dtypes" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Q1\n", - "\n", - "> What do you think `NaN` stands for?" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## Accessing the columns" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "source": [ - "A single column of the data is accessible using Python dot notation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "boroughs.Anxiety" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Or we can use square brackets, a bit like with a Python list or dictionary." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "boroughs['Population']" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Q2\n", - "\n", - "> Try out both ways of accessing columns.\n", - ">\n", - "> This isn't as helpful as it could be. Why not?" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Square brackets are more flexible. We can give them a list of headings." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "# note the nested brackets\n", - "boroughs[['Borough','Population','Happy']]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "This is better. But it would be nice if we didn't have to keep including the `Borough` column. So let's make that our *index*" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "boroughs = boroughs.set_index(boroughs.Borough)\n", - "boroughs.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Q3 \n", - "\n", - "> What changed?" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Now, when we ask for column, we'll get the borough for free" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "boroughs[['Age','WorkAge']]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Now we can also use the `loc` function (which uses square brackets, too) to *filter* the data and *locate* the index Haringey." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "boroughs.loc['Haringey']" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "### Q4\n", - "\n", - "> Pick another borough to retreive the data for. Compare it to Haringey." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "boroughs.loc[['Haringey','Hackney']]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## Sorting and filtering" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "source": [ - "Let's find out which boroughs have the highest population.\n", - "\n", - "`pandas` dataframes have a `sort_values` function." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "### Q5\n", - "\n", - "Remember in a jupyter notebook, you can put the cursor in the function brackets and hit `shift`+`tab` to bring up documentation for that function.\n", - "\n", - "> Make the sort_values function below work, to put the boroughs in order of population\n", - ">\n", - "> Now put them in *descending* order\n", - ">\n", - "> Which borough has the largest population?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "# *** broken ***\n", - "boroughs.sort_values()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "What if we wanted to only include **innerLondon** boroughs?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "boroughs.loc[boroughs[\"InnerOuter\"]=='Inner London']" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "source": [ - "So we can pass a Boolean into those square brackets to *filter* the data. `pandas` square brackets are clearly a bit more powerful than regular Python square brackets." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "### Q6\n", - "\n", - "> Filter the data to show only Outer London boroughs\n", - ">\n", - "> Apply `sort_values` to give the Outer London boroughs in descending order of population" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "boroughs.loc[boroughs[\"InnerOuter\"]==\"Outer London\"].sort_values(\"Population\")[[\"Area\",\"Age\"]]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "If you want to combine two Booleans into one filter you'll need to put both into parentheses *for reasons*. For example," - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "boroughs.loc[(boroughs.InnerOuter==\"Inner London\") | (boroughs.InnerOuter==\"Outer London\")]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "It might be useful to come back to this table of *just* the individual boroughs, so let's assign that to a variable `justBoroughs`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "justBoroughs = boroughs.loc[(boroughs.InnerOuter==\"Inner London\") | (boroughs.InnerOuter==\"Outer London\")]\n", - "justBoroughs.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "### Note\n", - "\n", - "There is a subtle catch here that is worth thinking about when you're trying to do more advanced stuff with `pandas`.\n", - "\n", - "`boroughs[]` and `boroughs.loc[]` can appear to do the same thing, but they don't. In general it is better to use `loc`.\n", - "\n", - "See [this article](https://www.dataquest.io/blog/settingwithcopywarning/) later if you want more details. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## Summary statistics\n", - "\n", - "The dataframe has built in functions for statistical measures like `mean`, `std`, `quantile` but you need to be careful whether using them makes sense." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "# you can give loc a row label and a column label\n", - "boroughs.loc['London','Age']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "justBoroughs['Age'].mean()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "# you can give loc a row label and a column label\n", - "boroughs.loc['London','Age']" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Q7\n", - "\n", - "> Why is the mean of the average ages not the same as the London average age?" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "So use the Inner London, Outer London and London averages from the main table rather than applying `mean` to a column." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## Investigating relationships" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "source": [ - "We would expect there to be an obvious relationship between unemployment rates and employment rates" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "justBoroughs.plot.scatter(\"Employ\", \"Unemploy\");" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Let's quantify that by asking for the correlation coefficient" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "justBoroughs.Employ.corr(justBoroughs.Unemploy)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Q8\n", - "\n", - "> How would you interpret this?\n", - ">\n", - "> What *correlation coefficient* is it using?\n", - ">\n", - "> Why isn't it a perfect correlation?\n", - ">\n", - "> Look for correlation between some other pairs of variables. Use a scatter plot first, then get the correlation coefficient" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "The `seaborn` library has some nice options for scatter plots, so let's import that and then see an example." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "# pyplot is the grandparent of all python plotting packages\n", - "import matplotlib.pyplot as plt\n", - "# seaborn is based on pyplot but makes it easier to use\n", - "import seaborn as sns\n", - "# I don't know why we abbreviate seaborn as sns" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Now an example," - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "# by default seaborn plots come out a bit small, so make ours 8in by 8in\n", - "plt.figure(figsize=(8,8))\n", - "# sns.scatterplot has options for controlling colour and dot size so we can use four variables on one graph\n", - "sns.scatterplot(data=justBoroughs.loc[justBoroughs.Borough != \"City of London\"],\n", - " x=\"Employ\",\n", - " y=\"Medianpay\",\n", - " hue=\"Conservative\",\n", - " palette=\"RdBu\")\n", - "#plt.axvline(justBoroughs.Employ.mean(), linestyle=\"--\", alpha=0.6)\n", - "#plt.axhline(justBoroughs.Unemploy.mean(), linestyle=\"--\", alpha=0.6)\n", - "plt.title(\"My beautiful scatter plot\")\n", - "# where to put the legend\n", - "plt.legend(loc='upper right');" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "boroughs[\"PopThousands\"] = boroughs[\"Population\"]/1000" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "boroughs[\"AvgHouseholdSize\"] = boroughs[\"Population\"]/boroughs[\"Households\"]\n", - "boroughs.sort_values(\"AvgHouseholdSize\", ascending=False)[\"AvgHouseholdSize\"]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "justBoroughs.corr()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sns.lmplot(data=boroughs,\n", - " x=\"Pay\",\n", - " y=\"Happy\");" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "# Time Series\n", - "\n", - "The other `csv` files all contain time series. Let's look at how recycling has changed over recent years." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "recycling = pd.read_csv('recycling.csv')\n", - "recycling" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pd.to_datetime(recycling.Year,format=\"%Y\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "This time we'll make `Year` the index" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "recycling = recycling.set_index(\"Year\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Now we can draw a time series graph" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "recycling.Barnet.plot(c=\"red\")\n", - "recycling[\"Barking and Dagenham\"].plot(c=\"green\");" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "It would be helpful to be able to show that Barking and Dagenham has improved by more *as a proportion* of their starting point than Barnet has.\n", - "\n", - "We can make a new a column, call it BarnetIndexed say, and fill it with the percentages scaled to 1 at 2004. And the same for Barking and Dagenham." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "recycling[\"BarnetIndexed\"] = recycling.Barnet/recycling.Barnet[2004]\n", - "recycling[\"Barking and DagenhamIndexed\"] =recycling[\"Barking and Dagenham\"]/recycling[\"Barking and Dagenham\"][2004]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "Things to note about the above\n", - "\n", - "* you can make a new column just by saying `recycling[\"New column name\"]=`\n", - "* you can divide every number in a column by the value in 2004 by just doing `recycling.Barnet/recycling.Barnet[2004]`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "slideshow": { - "slide_type": "subslide" - } - }, - "outputs": [], - "source": [ - "recycling.BarnetIndexed.plot(c=\"red\")\n", - "recycling[\"Barking and DagenhamIndexed\"].plot(c=\"blue\");\n", - "# note the `c` for colour" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "In fact, let's go ahead and do that for all the boroughs. We can use a `for` loop over all the columns (remember that in this dataframe it's the boroughs that are columns and the years are rows.)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "for column in recycling.columns:\n", - " recycling[\"{}Indexed\".format(column)] = recycling[column]/recycling[column][2004]\n", - "recycling.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "slideshow": { - "slide_type": "subslide" - } - }, - "outputs": [], - "source": [ - "recycling[\"Newham\"].plot(c=\"green\")\n", - "recycling[\"NewhamIndexed\"].plot(c=\"blue\")\n", - "recycling[\"Barnet\"].plot(c=\"orange\")\n", - "recycling[\"BarnetIndexed\"].plot(c=\"red\")\n", - "plt.title(\"Recycling in Newham and Barnet\");" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "There was a small fudge in here. If you check `recycling.dtypes` you'll see that `Year` was an `int64` (an integer) which worked okay for us this time, but in future we'll want to explicit turn it into a `datetime` object instead, so `pandas` knows we're dealing with time. We'll do that with [`to_datetime`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Documentation for [`pandas` is here](http://pandas.pydata.org/pandas-docs/stable/).\n", - "\n", - "We've installed several visualisation libraries that you might find useful\n", - "\n", - "* [`pyplot`](https://matplotlib.org/)\n", - "* [`seaborn`](https://seaborn.pydata.org/)\n", - "* [`bokeh`](https://bokeh.pydata.org/)\n", - "* [`chartify`](https://labs.spotify.com/2018/11/15/introducing-chartify-easier-chart-creation-in-python-for-data-scientists/)\n", - "* [`geopandas`](http://geopandas.org/)" - ] - } - ], - "metadata": { - "celltoolbar": "Slideshow", - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/01_tutorial.py b/01_tutorial.py new file mode 100644 index 0000000..7024fcc --- /dev/null +++ b/01_tutorial.py @@ -0,0 +1,299 @@ +# -*- coding: utf-8 -*- +"""01 Tutorial.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/01%20Tutorial.ipynb + +# Introduction to Data Analysis with Pandas + +- [Getting the data into Python](#Getting-the-data-into-Python) + - using `read_csv` and dealing with missing data +- [Accessing columns](#Accessing-the-columns) + - using dot notation and square brackets + - setting the index + - using `loc` +- [Sorting and filtering](#Sorting-and-filtering) + - the `sort_values` function + - how to get documentation + - default arguments + - passing a Boolean to `loc[]` + - compound filters +- [Summary statistics](#Summary-statistics) + - not so useful for this data set but good to know +- [Investigating relationships](#Investigating-relationships) + - drawing scatter plots in `pandas` + - drawing better scatter plots in `seaborn` + - getting the correlation coefficient +- [Time series](#Time-Series) + - plotting simple time series + - applying a calculation and creating new columns +""" + +# Commented out IPython magic to ensure Python compatibility. +# We tend to abbreviate the pandas library as pd +import pandas as pd +# Stop pandas from abbreviating tables to fit in the notebook +pd.options.display.max_columns = 1000 +pd.options.display.max_rows = 1000 +# Display graphs in the notebook +# %matplotlib inline + +"""## Getting the data into Python + +The `pandas` library stores data in what it calls a *dataframe*, which is really just a smart table. + +We use the `read_csv` function to read in data from a csv file. In this case it's data about London Boroughs. + +Don't forget to run each cell when you get to it with either `ctrl`+`enter` or `shift`+`enter` +""" + +# read in our csv file, and automatically change missing values (a dot in the csv) into NaN +#boroughs = pd.read_csv('boroughs.csv', na_values = ['.',' '], thousands=',') +boroughs = pd.read_csv('https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/boroughs.csv', na_values = ['.',' '], thousands=',') +# Use the head function to see the first few rows +#boroughs.head(5) +boroughs.dtypes + +"""### Q1 + +> What do you think `NaN` stands for? + +## Accessing the columns + +A single column of the data is accessible using Python dot notation +""" + +boroughs.Anxiety + +"""Or we can use square brackets, a bit like with a Python list or dictionary.""" + +boroughs['Population'] + +"""### Q2 + +> Try out both ways of accessing columns. +> +> This isn't as helpful as it could be. Why not? + +Square brackets are more flexible. We can give them a list of headings. +""" + +# note the nested brackets +boroughs[['Borough','Population','Happy']] + +"""This is better. But it would be nice if we didn't have to keep including the `Borough` column. So let's make that our *index*""" + +boroughs = boroughs.set_index(boroughs.Borough) +boroughs.head(5) + +"""### Q3 + +> What changed? + +Now, when we ask for column, we'll get the borough for free +""" + +boroughs[['Age','WorkAge']] + +"""Now we can also use the `loc` function (which uses square brackets, too) to *filter* the data and *locate* the index Haringey.""" + +boroughs.loc['Haringey'] + +"""### Q4 + +> Pick another borough to retreive the data for. Compare it to Haringey. +""" + +boroughs.loc[['Haringey','Hackney']] + +"""## Sorting and filtering + +Let's find out which boroughs have the highest population. + +`pandas` dataframes have a `sort_values` function. + +### Q5 + +Remember in a jupyter notebook, you can put the cursor in the function brackets and hit `shift`+`tab` to bring up documentation for that function. + +> Make the sort_values function below work, to put the boroughs in order of population +> +> Now put them in *descending* order +> +> Which borough has the largest population? +""" + +# *** broken *** +boroughs.sort_values() + +"""What if we wanted to only include **innerLondon** boroughs?""" + +boroughs.loc[boroughs["InnerOuter"]=='Inner London'] + +"""So we can pass a Boolean into those square brackets to *filter* the data. `pandas` square brackets are clearly a bit more powerful than regular Python square brackets. + +### Q6 + +> Filter the data to show only Outer London boroughs +> +> Apply `sort_values` to give the Outer London boroughs in descending order of population +""" + +boroughs.loc[boroughs["InnerOuter"]=="Outer London"].sort_values("Population")[["Area","Age"]] + +"""If you want to combine two Booleans into one filter you'll need to put both into parentheses *for reasons*. For example,""" + +boroughs.loc[(boroughs.InnerOuter=="Inner London") | (boroughs.InnerOuter=="Outer London")] + +"""It might be useful to come back to this table of *just* the individual boroughs, so let's assign that to a variable `justBoroughs`""" + +justBoroughs = boroughs.loc[(boroughs.InnerOuter=="Inner London") | (boroughs.InnerOuter=="Outer London")] +justBoroughs.head() + +"""### Note + +There is a subtle catch here that is worth thinking about when you're trying to do more advanced stuff with `pandas`. + +`boroughs[]` and `boroughs.loc[]` can appear to do the same thing, but they don't. In general it is better to use `loc`. + +See [this article](https://www.dataquest.io/blog/settingwithcopywarning/) later if you want more details. + +## Summary statistics + +The dataframe has built in functions for statistical measures like `mean`, `std`, `quantile` but you need to be careful whether using them makes sense. +""" + +# you can give loc a row label and a column label +boroughs.loc['London','Age'] + +justBoroughs['Age'].mean() + +# you can give loc a row label and a column label +boroughs.loc['London','Age'] + +"""### Q7 + +> Why is the mean of the average ages not the same as the London average age? + +So use the Inner London, Outer London and London averages from the main table rather than applying `mean` to a column. + +## Investigating relationships + +We would expect there to be an obvious relationship between unemployment rates and employment rates +""" + +justBoroughs.plot.scatter("Employ", "Unemploy"); + +"""Let's quantify that by asking for the correlation coefficient""" + +justBoroughs.Employ.corr(justBoroughs.Unemploy) + +"""### Q8 + +> How would you interpret this? +> +> What *correlation coefficient* is it using? +> +> Why isn't it a perfect correlation? +> +> Look for correlation between some other pairs of variables. Use a scatter plot first, then get the correlation coefficient + +The `seaborn` library has some nice options for scatter plots, so let's import that and then see an example. +""" + +# pyplot is the grandparent of all python plotting packages +import matplotlib.pyplot as plt +# seaborn is based on pyplot but makes it easier to use +import seaborn as sns +# I don't know why we abbreviate seaborn as sns + +"""Now an example,""" + +# by default seaborn plots come out a bit small, so make ours 8in by 8in +plt.figure(figsize=(8,8)) +# sns.scatterplot has options for controlling colour and dot size so we can use four variables on one graph +sns.scatterplot(data=justBoroughs.loc[justBoroughs.Borough != "City of London"], + x="Employ", + y="Medianpay", + hue="Conservative", + palette="RdBu") +#plt.axvline(justBoroughs.Employ.mean(), linestyle="--", alpha=0.6) +#plt.axhline(justBoroughs.Unemploy.mean(), linestyle="--", alpha=0.6) +plt.title("My beautiful scatter plot") +# where to put the legend +plt.legend(loc='upper right'); + +boroughs["PopThousands"] = boroughs["Population"]/1000 + +boroughs["AvgHouseholdSize"] = boroughs["Population"]/boroughs["Households"] +boroughs.sort_values("AvgHouseholdSize", ascending=False)["AvgHouseholdSize"] + +justBoroughs.corr() + +sns.lmplot(data=boroughs, + x="Pay", + y="Happy"); + +"""# Time Series + +The other `csv` files all contain time series. Let's look at how recycling has changed over recent years. +""" + +recycling = pd.read_csv('recycling.csv') +recycling + +pd.to_datetime(recycling.Year,format="%Y") + +"""This time we'll make `Year` the index""" + +recycling = recycling.set_index("Year") + +"""Now we can draw a time series graph""" + +recycling.Barnet.plot(c="red") +recycling["Barking and Dagenham"].plot(c="green"); + +"""It would be helpful to be able to show that Barking and Dagenham has improved by more *as a proportion* of their starting point than Barnet has. + +We can make a new a column, call it BarnetIndexed say, and fill it with the percentages scaled to 1 at 2004. And the same for Barking and Dagenham. +""" + +recycling["BarnetIndexed"] = recycling.Barnet/recycling.Barnet[2004] +recycling["Barking and DagenhamIndexed"] =recycling["Barking and Dagenham"]/recycling["Barking and Dagenham"][2004] + +"""Things to note about the above + +* you can make a new column just by saying `recycling["New column name"]=` +* you can divide every number in a column by the value in 2004 by just doing `recycling.Barnet/recycling.Barnet[2004]` +""" + +recycling.BarnetIndexed.plot(c="red") +recycling["Barking and DagenhamIndexed"].plot(c="blue"); +# note the `c` for colour + +"""In fact, let's go ahead and do that for all the boroughs. We can use a `for` loop over all the columns (remember that in this dataframe it's the boroughs that are columns and the years are rows.)""" + +for column in recycling.columns: + recycling["{}Indexed".format(column)] = recycling[column]/recycling[column][2004] +recycling.head() + +recycling["Newham"].plot(c="green") +recycling["NewhamIndexed"].plot(c="blue") +recycling["Barnet"].plot(c="orange") +recycling["BarnetIndexed"].plot(c="red") +plt.title("Recycling in Newham and Barnet"); + +"""There was a small fudge in here. If you check `recycling.dtypes` you'll see that `Year` was an `int64` (an integer) which worked okay for us this time, but in future we'll want to explicit turn it into a `datetime` object instead, so `pandas` knows we're dealing with time. We'll do that with [`to_datetime`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html). + +Documentation for [`pandas` is here](http://pandas.pydata.org/pandas-docs/stable/). + +We've installed several visualisation libraries that you might find useful + +* [`pyplot`](https://matplotlib.org/) +* [`seaborn`](https://seaborn.pydata.org/) +* [`bokeh`](https://bokeh.pydata.org/) +* [`chartify`](https://labs.spotify.com/2018/11/15/introducing-chartify-easier-chart-creation-in-python-for-data-scientists/) +* [`geopandas`](http://geopandas.org/) +""" \ No newline at end of file diff --git a/02 Blackbirds.ipynb b/02 Blackbirds.ipynb deleted file mode 100644 index 911ee52..0000000 --- a/02 Blackbirds.ipynb +++ /dev/null @@ -1,719 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "# 02 Blackbirds\n", - "\n", - "- [Practice of day 01 techniques](#Practice)\n", - "- [Setting a column to datetime format](#Setting-a-column-to-datetime-format)\n", - "- [Introducing groupby](#Introducing-groupby)\n", - "- [Distribution plots with `distplot`](#Distribution-plots)\n", - " - Includes, using `subplots` to get two graphs in one figure\n", - "- [Hypothesis testing](#Hypothesis-testing)\n", - " - Using `scipy.stats` to run a t-test\n", - "- [Box plots](#Boxplots)\n", - "- [Ordinal data](#Ordinal-data)\n", - " - The age categories happened to make sense in alphabetical order. What if they didn't?\n", - "- [Time series](#Time-series)\n", - " - Now we've grouped by year we can aggregate by mean to plot a time series" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## Practice\n", - "\n", - "![Blackbird](https://www.rspb.org.uk/globalassets/images/birds-and-wildlife/bird-species-illustrations/blackbird_male_1200x675.jpg?preset=landscape_mobile)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "%matplotlib inline" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "1. Import the `blackbirds.csv` data into a `pandas` dataframe.\n", - "1. How many rows are there in your dataframe? (Try `len()`)\n", - "1. Is there a sensible index in the dataframe?\n", - "1. What do each of the columns represent? What do you think the age values mean?\n", - "1. Find the mean and standard deviation (`std`) of the wing span and weight columns.\n", - "1. Use the documentation to check *which* standard deviation you're getting.\n", - "1. Use the `quantile` function to find the median and the IQR too.\n", - "1. Is there a relationship between wing span and weight? Visualise it and measure it.\n", - "1. Use the `hue`, `size`, `style` and `markers` of the `seaborn` [scatterplot function](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) to distinguish between the different kinds of blackbird in your plot.\n", - "1. Find the mean and standard deviation weight and wing span of adult female and male blackbirds separately.\n", - "1. What other questions could you ask of this data set?" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Q1 and Q2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": false, - "slideshow": { - "slide_type": "subslide" - } - }, - "outputs": [], - "source": [ - "#blackbirds = pd.read_csv(\"blackbirds.csv\")\n", - "blackbirds = pd.read_csv(\"https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/blackbirds.csv\")\n", - "len(blackbirds)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "### Q3\n", - "\n", - "There isn't anything unique in the columns to index by." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "### Q4\n", - "\n", - "The values in the age column are Juvenile, First year, Adult and Unknown" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Q5" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "blackbirds.Weight.std()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "### Q6\n", - "\n", - "The default `ddof` argument is 1, which means the denominator will be $n-1$, so this is sample standard deviation by default." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Q7" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "print(\"Weight: The median is {}, with IQR {}\".format(blackbirds.Weight.quantile(0.5),\n", - " blackbirds.Weight.quantile(0.75)-blackbirds.Weight.quantile(0.25)))\n", - "print(\"Wing: The median is {}, with IQR {}\".format(blackbirds.Wing.quantile(0.5),\n", - " blackbirds.Wing.quantile(0.75)-blackbirds.Wing.quantile(0.25)))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "But actually," - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": false, - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "blackbirds.describe()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Setting a column to datetime format\n", - "\n", - "The `Year` column shouldn't really work like that. If you check `blackbirds.dtypes` you'll see why." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "blackbirds.Year = pd.to_datetime(blackbirds.Year,format=\"%Y\")\n", - "blackbirds.dtypes" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "source": [ - "Check `blackbirds.dtypes` again." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Q8" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "blackbirds.plot.scatter(\"Wing\",\"Weight\")\n", - "blackbirds.corr()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Q9" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "hide_input": false, - "scrolled": false, - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "plt.figure(figsize=(12,6))\n", - "\n", - "sns.scatterplot(data=blackbirds,x=\"Wing\",y=\"Weight\",hue=\"Age\", style=\"Sex\", palette=\"hot\", alpha=0.6);" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Q10" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "outputs": [], - "source": [ - "blackbirds.loc[(blackbirds.Sex=='M')&(blackbirds.Age=='A')].describe()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "outputs": [], - "source": [ - "blackbirds.loc[(blackbirds.Sex=='F')&(blackbirds.Age=='A')].describe()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Introducing groupby\n", - "\n", - "But this feels like good opportunity to see the `groupby` function:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "blackbirds.groupby([\"Sex\",\"Age\"])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "By itself, `groupby` doesn't do much except make a groupby object. Just like with a pivot table, we need to tell it what to *aggregate* by..." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "outputs": [], - "source": [ - "blackbirds.groupby([\"Age\",\"Sex\"]).mean()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Distribution plots\n", - "\n", - "`seaborn` has a `distplot` function the combines a histogram with an estimate of the continuous distribution shape" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "# What error message do you get without the dropna?\n", - "sns.distplot(blackbirds.Weight.dropna());" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": false, - "slideshow": { - "slide_type": "subslide" - } - }, - "outputs": [], - "source": [ - "# fig is the whole figure, axs is a list of two sets of axes\n", - "fig,axs = plt.subplots(1,2)\n", - "fig.suptitle(\"Distribution of weight and wing span\")\n", - "# I don't care about the numbers on the y-axis\n", - "axs[0].get_yaxis().set_visible(False)\n", - "axs[1].get_yaxis().set_visible(False)\n", - "# Pass the axes to seaborn to tell it where to plot each graph\n", - "sns.distplot(blackbirds.Weight.dropna(), bins=10, ax=axs[0])\n", - "sns.distplot(blackbirds.Wing.dropna(), bins=10, ax=axs[1]);" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Use `distplot` to compare the distribution of weight and the wing span for female and male blackbirds" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "fig, axs = plt.subplots(1,2)\n", - "fig.suptitle(\"Weight and wingspan distribution by sex\")\n", - "\n", - "axs[0].get_yaxis().set_visible(False)\n", - "sns.distplot(blackbirds[blackbirds.Sex=='M'].Wing.dropna(),color=\"goldenrod\", ax=axs[0], label='M', bins=10)\n", - "sns.distplot(blackbirds[blackbirds.Sex=='F'].Wing.dropna(),color=\"rebeccapurple\", ax=axs[0], label='F', bins=10)\n", - "\n", - "axs[1].get_yaxis().set_visible(False)\n", - "sns.distplot(blackbirds[blackbirds.Sex=='M'].Weight.dropna(),color=\"goldenrod\", ax=axs[1], label='M')\n", - "sns.distplot(blackbirds[blackbirds.Sex=='F'].Weight.dropna(),color=\"rebeccapurple\", ax=axs[1], label='F')\n", - "axs[0].legend(loc=\"lower left\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "source": [ - "What does this suggest?" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Hypothesis testing\n", - "\n", - "It looks like the mean wing span for female blackbirds is different from the mean for males. How should we test that?" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "The `scipy` package has a function for doing t-tests\n", - "\n", - "https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_ind.html" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "from scipy import stats" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "outputs": [], - "source": [ - "stats.ttest_ind(blackbirds.loc[blackbirds.Sex == 'M',\"Weight\"].dropna(),\n", - " blackbirds.loc[blackbirds.Sex == 'F',\"Weight\"].dropna(),\n", - " equal_var=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "source": [ - "What can we conclude? Was this a one or a two-tailed test? Does it matter?" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Boxplots\n", - "\n", - "We can use grouped boxplots to see how weight and wing span change with age" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "# Make a figure with two subplots with a shared y-axis\n", - "fig, axs = plt.subplots(1,2, sharey=True)\n", - "# axs is a list so we can get the first subplot with ax[0]\n", - "sns.boxplot(x=\"Wing\",y=\"Age\",data=blackbirds, ax=axs[0], whis=3)\n", - "# and the second with ax[1]\n", - "sns.boxplot(x=\"Weight\",y=\"Age\",data=blackbirds, ax=axs[1], whis=2);" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Investigate the optional arguments for boxplots. What definition of outlier is used?" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Ordinal data\n", - "\n", - "It so happened that A, F, J and U worked quite well because they're in alphabetical order. But it would be better to tell `pandas` what order we really mean them to come in." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "blackbirds.Age = pd.Categorical(blackbirds.Age, categories=[\"U\",\"J\",\"F\",\"A\"])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "slideshow": { - "slide_type": "subslide" - } - }, - "outputs": [], - "source": [ - "fig, axs = plt.subplots(2,1,sharex=True)\n", - "sns.boxplot(x=\"Wing\",y=\"Age\",data=blackbirds, ax=axs[0])\n", - "sns.boxplot(x=\"Weight\",y=\"Age\",data=blackbirds, ax=axs[1]);" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Time series\n", - "\n", - "Let's look at how weight and wing span have varied over time" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "# A groupby by itself doesn't do very much\n", - "blackbirds.groupby(by=\"Year\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "outputs": [], - "source": [ - "blackbirds.groupby(by=\"Year\").mean()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": false, - "slideshow": { - "slide_type": "subslide" - } - }, - "outputs": [], - "source": [ - "blackbirds.groupby(by=\"Year\").mean().plot();" - ] - } - ], - "metadata": { - "celltoolbar": "Slideshow", - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/02_blackbirds.py b/02_blackbirds.py new file mode 100644 index 0000000..1194905 --- /dev/null +++ b/02_blackbirds.py @@ -0,0 +1,208 @@ +# -*- coding: utf-8 -*- +"""02 Blackbirds.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/02%20Blackbirds.ipynb + +# 02 Blackbirds + +- [Practice of day 01 techniques](#Practice) +- [Setting a column to datetime format](#Setting-a-column-to-datetime-format) +- [Introducing groupby](#Introducing-groupby) +- [Distribution plots with `distplot`](#Distribution-plots) + - Includes, using `subplots` to get two graphs in one figure +- [Hypothesis testing](#Hypothesis-testing) + - Using `scipy.stats` to run a t-test +- [Box plots](#Boxplots) +- [Ordinal data](#Ordinal-data) + - The age categories happened to make sense in alphabetical order. What if they didn't? +- [Time series](#Time-series) + - Now we've grouped by year we can aggregate by mean to plot a time series + +## Practice + +![Blackbird](https://www.rspb.org.uk/globalassets/images/birds-and-wildlife/bird-species-illustrations/blackbird_male_1200x675.jpg?preset=landscape_mobile) +""" + +# Commented out IPython magic to ensure Python compatibility. +import pandas as pd +import matplotlib.pyplot as plt +import seaborn as sns +# %matplotlib inline + +"""1. Import the `blackbirds.csv` data into a `pandas` dataframe. +1. How many rows are there in your dataframe? (Try `len()`) +1. Is there a sensible index in the dataframe? +1. What do each of the columns represent? What do you think the age values mean? +1. Find the mean and standard deviation (`std`) of the wing span and weight columns. +1. Use the documentation to check *which* standard deviation you're getting. +1. Use the `quantile` function to find the median and the IQR too. +1. Is there a relationship between wing span and weight? Visualise it and measure it. +1. Use the `hue`, `size`, `style` and `markers` of the `seaborn` [scatterplot function](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) to distinguish between the different kinds of blackbird in your plot. +1. Find the mean and standard deviation weight and wing span of adult female and male blackbirds separately. +1. What other questions could you ask of this data set? + +### Q1 and Q2 +""" + +#blackbirds = pd.read_csv("blackbirds.csv") +blackbirds = pd.read_csv("https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/blackbirds.csv") +len(blackbirds) + +"""### Q3 + +There isn't anything unique in the columns to index by. + +### Q4 + +The values in the age column are Juvenile, First year, Adult and Unknown + +### Q5 +""" + +blackbirds.Weight.std() + +"""### Q6 + +The default `ddof` argument is 1, which means the denominator will be $n-1$, so this is sample standard deviation by default. + +### Q7 +""" + +print("Weight: The median is {}, with IQR {}".format(blackbirds.Weight.quantile(0.5), + blackbirds.Weight.quantile(0.75)-blackbirds.Weight.quantile(0.25))) +print("Wing: The median is {}, with IQR {}".format(blackbirds.Wing.quantile(0.5), + blackbirds.Wing.quantile(0.75)-blackbirds.Wing.quantile(0.25))) + +"""But actually,""" + +blackbirds.describe() + +"""### Setting a column to datetime format + +The `Year` column shouldn't really work like that. If you check `blackbirds.dtypes` you'll see why. +""" + +blackbirds.Year = pd.to_datetime(blackbirds.Year,format="%Y") +blackbirds.dtypes + +"""Check `blackbirds.dtypes` again. + +### Q8 +""" + +blackbirds.plot.scatter("Wing","Weight") +blackbirds.corr() + +"""### Q9""" + +plt.figure(figsize=(12,6)) + +sns.scatterplot(data=blackbirds,x="Wing",y="Weight",hue="Age", style="Sex", palette="hot", alpha=0.6); + +"""### Q10""" + +blackbirds.loc[(blackbirds.Sex=='M')&(blackbirds.Age=='A')].describe() + +blackbirds.loc[(blackbirds.Sex=='F')&(blackbirds.Age=='A')].describe() + +"""### Introducing groupby + +But this feels like good opportunity to see the `groupby` function: +""" + +blackbirds.groupby(["Sex","Age"]) + +"""By itself, `groupby` doesn't do much except make a groupby object. Just like with a pivot table, we need to tell it what to *aggregate* by...""" + +blackbirds.groupby(["Age","Sex"]).mean() + +"""## Distribution plots + +`seaborn` has a `distplot` function the combines a histogram with an estimate of the continuous distribution shape +""" + +# What error message do you get without the dropna? +sns.distplot(blackbirds.Weight.dropna()); + +# fig is the whole figure, axs is a list of two sets of axes +fig,axs = plt.subplots(1,2) +fig.suptitle("Distribution of weight and wing span") +# I don't care about the numbers on the y-axis +axs[0].get_yaxis().set_visible(False) +axs[1].get_yaxis().set_visible(False) +# Pass the axes to seaborn to tell it where to plot each graph +sns.distplot(blackbirds.Weight.dropna(), bins=10, ax=axs[0]) +sns.distplot(blackbirds.Wing.dropna(), bins=10, ax=axs[1]); + +"""Use `distplot` to compare the distribution of weight and the wing span for female and male blackbirds""" + +fig, axs = plt.subplots(1,2) +fig.suptitle("Weight and wingspan distribution by sex") + +axs[0].get_yaxis().set_visible(False) +sns.distplot(blackbirds[blackbirds.Sex=='M'].Wing.dropna(),color="goldenrod", ax=axs[0], label='M', bins=10) +sns.distplot(blackbirds[blackbirds.Sex=='F'].Wing.dropna(),color="rebeccapurple", ax=axs[0], label='F', bins=10) + +axs[1].get_yaxis().set_visible(False) +sns.distplot(blackbirds[blackbirds.Sex=='M'].Weight.dropna(),color="goldenrod", ax=axs[1], label='M') +sns.distplot(blackbirds[blackbirds.Sex=='F'].Weight.dropna(),color="rebeccapurple", ax=axs[1], label='F') +axs[0].legend(loc="lower left") + +"""What does this suggest? + +## Hypothesis testing + +It looks like the mean wing span for female blackbirds is different from the mean for males. How should we test that? + +The `scipy` package has a function for doing t-tests + +https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_ind.html +""" + +from scipy import stats + +stats.ttest_ind(blackbirds.loc[blackbirds.Sex == 'M',"Weight"].dropna(), + blackbirds.loc[blackbirds.Sex == 'F',"Weight"].dropna(), + equal_var=False) + +"""What can we conclude? Was this a one or a two-tailed test? Does it matter? + +## Boxplots + +We can use grouped boxplots to see how weight and wing span change with age +""" + +# Make a figure with two subplots with a shared y-axis +fig, axs = plt.subplots(1,2, sharey=True) +# axs is a list so we can get the first subplot with ax[0] +sns.boxplot(x="Wing",y="Age",data=blackbirds, ax=axs[0], whis=3) +# and the second with ax[1] +sns.boxplot(x="Weight",y="Age",data=blackbirds, ax=axs[1], whis=2); + +"""Investigate the optional arguments for boxplots. What definition of outlier is used? + +## Ordinal data + +It so happened that A, F, J and U worked quite well because they're in alphabetical order. But it would be better to tell `pandas` what order we really mean them to come in. +""" + +blackbirds.Age = pd.Categorical(blackbirds.Age, categories=["U","J","F","A"]) + +fig, axs = plt.subplots(2,1,sharex=True) +sns.boxplot(x="Wing",y="Age",data=blackbirds, ax=axs[0]) +sns.boxplot(x="Weight",y="Age",data=blackbirds, ax=axs[1]); + +"""## Time series + +Let's look at how weight and wing span have varied over time +""" + +# A groupby by itself doesn't do very much +blackbirds.groupby(by="Year") + +blackbirds.groupby(by="Year").mean() + +blackbirds.groupby(by="Year").mean().plot(); \ No newline at end of file diff --git a/03 Titanic.ipynb b/03 Titanic.ipynb deleted file mode 100644 index 287c64a..0000000 --- a/03 Titanic.ipynb +++ /dev/null @@ -1,155 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 03 Titanic\n", - "\n", - "![Titanic](https://pmcvariety.files.wordpress.com/2017/04/titanic.jpg?w=1000&h=563&crop=1)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "import pandas as pd\n", - "import seaborn as sns\n", - "%matplotlib inline" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Tab separated values" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "titanic = pd.read_csv(\"https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/titanic.txt\", sep='\\t')\n", - "titanic.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Setting ordinal categories" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "titanic.Class = pd.Categorical(titanic.Class, categories=[\"Crew\",\"3\",\"2\",\"1\"])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Seaborn's catplot" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": false, - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "sns.catplot(data=titanic, x=\"Sex\", hue=\"Class\", kind='count', col=\"Survived\");" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from scipy import stats" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## A non-parametric test" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "stats.mannwhitneyu(titanic.loc[titanic.Survived == \"Alive\"].Paid.dropna(),\n", - " titanic.loc[titanic.Survived == \"Dead\"].Paid.dropna(), alternative='greater')" - ] - } - ], - "metadata": { - "celltoolbar": "Slideshow", - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/03_titanic.py b/03_titanic.py new file mode 100644 index 0000000..1c6f356 --- /dev/null +++ b/03_titanic.py @@ -0,0 +1,38 @@ +# -*- coding: utf-8 -*- +"""03 Titanic.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/03%20Titanic.ipynb + +# 03 Titanic + +![Titanic](https://pmcvariety.files.wordpress.com/2017/04/titanic.jpg?w=1000&h=563&crop=1) +""" + +# Commented out IPython magic to ensure Python compatibility. +import matplotlib.pyplot as plt +import pandas as pd +import seaborn as sns +# %matplotlib inline + +"""## Tab separated values""" + +titanic = pd.read_csv("https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/titanic.txt", sep='\t') +titanic.head() + +"""## Setting ordinal categories""" + +titanic.Class = pd.Categorical(titanic.Class, categories=["Crew","3","2","1"]) + +"""## Seaborn's catplot""" + +sns.catplot(data=titanic, x="Sex", hue="Class", kind='count', col="Survived"); + +from scipy import stats + +"""## A non-parametric test""" + +stats.mannwhitneyu(titanic.loc[titanic.Survived == "Alive"].Paid.dropna(), + titanic.loc[titanic.Survived == "Dead"].Paid.dropna(), alternative='greater') \ No newline at end of file diff --git a/03b Toytanic.ipynb b/03b Toytanic.ipynb deleted file mode 100644 index 99ec6af..0000000 --- a/03b Toytanic.ipynb +++ /dev/null @@ -1,142 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "# 03b Toytanic\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "source": [ - "The usual imports" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import pandas as pd\n", - "import seaborn as sns\n", - "%matplotlib inline" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Generate n random ages, normally distributed around 20 with standard deviation 3, and round them to the nearest integer:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "n = 5\n", - "ages = np.random.normal(loc=20,scale=3,size=n).round()\n", - "ages" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Generate a list of \"Alive\" or \"Dead\", the probability being p:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "p = 0.5\n", - "surv = [\"Dead\" if np.random.random() > p else \"Alive\" for _ in range(n)]\n", - "surv" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Make a `pandas` dataframe by giving a dictionary of column headings and data:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "toytanic = pd.DataFrame(data={'Age': ages,'Survived':surv})\n", - "toytanic" - ] - } - ], - "metadata": { - "celltoolbar": "Slideshow", - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/03b_toytanic.py b/03b_toytanic.py new file mode 100644 index 0000000..1881039 --- /dev/null +++ b/03b_toytanic.py @@ -0,0 +1,36 @@ +# -*- coding: utf-8 -*- +"""03b Toytanic.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/03b%20Toytanic.ipynb + +# 03b Toytanic + +The usual imports +""" + +# Commented out IPython magic to ensure Python compatibility. +import matplotlib.pyplot as plt +import numpy as np +import pandas as pd +import seaborn as sns +# %matplotlib inline + +"""Generate n random ages, normally distributed around 20 with standard deviation 3, and round them to the nearest integer:""" + +n = 5 +ages = np.random.normal(loc=20,scale=3,size=n).round() +ages + +"""Generate a list of "Alive" or "Dead", the probability being p:""" + +p = 0.5 +surv = ["Dead" if np.random.random() > p else "Alive" for _ in range(n)] +surv + +"""Make a `pandas` dataframe by giving a dictionary of column headings and data:""" + +toytanic = pd.DataFrame(data={'Age': ages,'Survived':surv}) +toytanic \ No newline at end of file diff --git a/04 Baby names.ipynb b/04 Baby names.ipynb deleted file mode 100644 index d034fe9..0000000 --- a/04 Baby names.ipynb +++ /dev/null @@ -1,466 +0,0 @@ -{ - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "celltoolbar": "Slideshow", - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - }, - "colab": { - "name": "04 Baby names.ipynb", - "provenance": [] - } - }, - "cells": [ - { - "cell_type": "code", - "metadata": { - "id": "WHMFChKaYKk0", - "colab_type": "code", - "colab": {} - }, - "source": [ - "import pandas as pd" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yMMIOe7uYKk6", - "colab_type": "text" - }, - "source": [ - "## Cleaning a messy csv" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "9YqoK-hsYKk7", - "colab_type": "code", - "colab": {} - }, - "source": [ - "#g17 = pd.read_csv('2017girlsnames.csv', thousands=',')\n", - "g17 = pd.read_csv('https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/2017girlsnames.csv', thousands=',')\n", - "g17 = g17.drop(columns=\"Unnamed: 2\").dropna()\n", - "g17.Count = g17.Count.astype(int)\n", - "g17.head()\n" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "e8G9kiDYYKk_", - "colab_type": "text" - }, - "source": [ - "## Setting an index and sorting on it" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "LuDfNrSQYKlA", - "colab_type": "code", - "colab": {} - }, - "source": [ - "g17 = g17.set_index(g17.Name).drop(columns=[\"Name\"]).sort_index()\n", - "g17" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7rTCLOFCYKlD", - "colab_type": "text" - }, - "source": [ - "## Making a new, constant, column" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "3K7obDbpYKlD", - "colab_type": "code", - "colab": {} - }, - "source": [ - "g17[\"Gender\"]=\"Girl\"\n", - "g17" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KhCcfXc4YKlI", - "colab_type": "text" - }, - "source": [ - "## Deal with the thousands comma" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "BdHQbSrGYKlK", - "colab_type": "code", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 175 - }, - "outputId": "e763a0df-c9dc-4259-b013-0d5126b765a9" - }, - "source": [ - "#b17 = pd.read_csv(\"2017boysnames.csv\", thousands=',').drop(columns=[\"Unnamed: 2\"]).dropna()\n", - "b17 = pd.read_csv(\"https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/2017boysnames.csv\", thousands=',').drop(columns=[\"Unnamed: 2\"]).dropna()\n", - "b17" - ], - "execution_count": 1, - "outputs": [ - { - "output_type": "error", - "ename": "NameError", - "evalue": "ignored", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mb17\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/2017boysnames.csv\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mthousands\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m','\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdrop\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"Unnamed: 2\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdropna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mb17\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mNameError\u001b[0m: name 'pd' is not defined" - ] - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "f4Ati_sUYKlO", - "colab_type": "text" - }, - "source": [ - "## Converting a column to integer" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "hBFLosSiYKlP", - "colab_type": "code", - "colab": {} - }, - "source": [ - "b17.Count = b17.Count.astype(int)\n", - "b17" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "7Lol62afYKlT", - "colab_type": "code", - "colab": {} - }, - "source": [ - "b17 = b17.set_index(b17.Name).drop(columns=[\"Name\"]).sort_index()\n", - "b17" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "HkHadQ9eYKlX", - "colab_type": "code", - "colab": {} - }, - "source": [ - "b17[\"Gender\"] = \"Boy\"\n", - "b17" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "EPszu7BdYKlc", - "colab_type": "text" - }, - "source": [ - "## Combining data frames - concatenation" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "1bKpfxb0YKle", - "colab_type": "code", - "colab": {} - }, - "source": [ - "pd.concat([g17,b17])" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "ssqg5qK1YKlj", - "colab_type": "code", - "colab": {} - }, - "source": [ - "names17 = pd.concat([g17,b17])" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "QKc10WGzYKlm", - "colab_type": "code", - "colab": {} - }, - "source": [ - "names17 = names17.sort_index()\n", - "names17" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "fzDho9-FYKlp", - "colab_type": "code", - "colab": {} - }, - "source": [ - "#g16 = pd.read_csv(\"2016girlsnames.csv\", thousands=',')\n", - "g16 = pd.read_csv(\"https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/2016girlsnames.csv\", thousands=',')\n", - "g16.head()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "D7HRmEFWYKlt", - "colab_type": "code", - "colab": {} - }, - "source": [ - "\n", - "g16 = g16.dropna()\n", - "g16.head()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "tKJuiVNOYKlw", - "colab_type": "code", - "colab": {} - }, - "source": [ - "g16.Count = g16.Count.astype(int)\n", - "g16" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "scrolled": false, - "id": "QKXh7DwJYKlz", - "colab_type": "code", - "colab": {} - }, - "source": [ - "g16 = g16.dropna()\n", - "g16.Count = g16.Count.astype(int)\n", - "g16" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "muxFPtmSYKl1", - "colab_type": "code", - "colab": {} - }, - "source": [ - "#g15 = pd.read_csv(\"2015girlsnames.csv\", thousands=',')\n", - "g15 = pd.read_csv(\"https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/2015girlsnames.csv\", thousands=',')\n", - "g15" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "D62XD6AkYKl5", - "colab_type": "text" - }, - "source": [ - "## Drop a single row of data" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "vzmErueZYKl6", - "colab_type": "code", - "colab": {} - }, - "source": [ - "g15.drop(7477)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "3eY-VNrpYKl-", - "colab_type": "code", - "colab": {} - }, - "source": [ - "g15 = g15.drop(columns=[\"Unnamed: 2\"])\n", - "g15 = g15.dropna()\n", - "g15.Count = g15.Count.astype(int)\n", - "g15" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KPJHf1XFYKmG", - "colab_type": "text" - }, - "source": [ - "## Join two dataframes on a common column\n", - "\n", - "What is the difference between an *inner* join and an *outer* join?" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "9R5YlcZfYKmH", - "colab_type": "code", - "colab": {} - }, - "source": [ - "pd.merge(g15,g16,how='inner',on='Name',suffixes=[\"15\",\"16\"])" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "ErwGjsIRYKmO", - "colab_type": "code", - "colab": {} - }, - "source": [ - "pd.merge(g15,g16,how='outer',on='Name',suffixes=[\"15\",\"16\"])" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "fMndcwfxYKmV", - "colab_type": "code", - "colab": {} - }, - "source": [ - "girls1516 = pd.merge(g15,g16,how='inner',on='Name',suffixes=[\"15\",\"16\"])\n", - "girls1516" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "J0crlPvFYKmX", - "colab_type": "code", - "colab": {} - }, - "source": [ - "girls1516 = girls1516.set_index(\"Name\").sort_index()\n", - "girls1516" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "UtPRTjx6YKma", - "colab_type": "text" - }, - "source": [ - "## The filter function with like" - ] - }, - { - "cell_type": "code", - "metadata": { - "scrolled": false, - "id": "Jr3BFy_fYKmc", - "colab_type": "code", - "colab": {} - }, - "source": [ - "pd.options.display.max_rows = 10000\n", - "girls1516.filter(like='-', axis=0)" - ], - "execution_count": 0, - "outputs": [] - } - ] -} \ No newline at end of file diff --git a/04_baby_names.py b/04_baby_names.py new file mode 100644 index 0000000..4b52db5 --- /dev/null +++ b/04_baby_names.py @@ -0,0 +1,101 @@ +# -*- coding: utf-8 -*- +"""04 Baby names.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/04%20Baby%20names.ipynb +""" + +import pandas as pd + +"""## Cleaning a messy csv""" + +#g17 = pd.read_csv('2017girlsnames.csv', thousands=',') +g17 = pd.read_csv('https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/2017girlsnames.csv', thousands=',') +g17 = g17.drop(columns="Unnamed: 2").dropna() +g17.Count = g17.Count.astype(int) +g17.head() + +"""## Setting an index and sorting on it""" + +g17 = g17.set_index(g17.Name).drop(columns=["Name"]).sort_index() +g17 + +"""## Making a new, constant, column""" + +g17["Gender"]="Girl" +g17 + +"""## Deal with the thousands comma""" + +#b17 = pd.read_csv("2017boysnames.csv", thousands=',').drop(columns=["Unnamed: 2"]).dropna() +b17 = pd.read_csv("https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/2017boysnames.csv", thousands=',').drop(columns=["Unnamed: 2"]).dropna() +b17 + +"""## Converting a column to integer""" + +b17.Count = b17.Count.astype(int) +b17 + +b17 = b17.set_index(b17.Name).drop(columns=["Name"]).sort_index() +b17 + +b17["Gender"] = "Boy" +b17 + +"""## Combining data frames - concatenation""" + +pd.concat([g17,b17]) + +names17 = pd.concat([g17,b17]) + +names17 = names17.sort_index() +names17 + +#g16 = pd.read_csv("2016girlsnames.csv", thousands=',') +g16 = pd.read_csv("https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/2016girlsnames.csv", thousands=',') +g16.head() + +g16 = g16.dropna() +g16.head() + +g16.Count = g16.Count.astype(int) +g16 + +g16 = g16.dropna() +g16.Count = g16.Count.astype(int) +g16 + +#g15 = pd.read_csv("2015girlsnames.csv", thousands=',') +g15 = pd.read_csv("https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/2015girlsnames.csv", thousands=',') +g15 + +"""## Drop a single row of data""" + +g15.drop(7477) + +g15 = g15.drop(columns=["Unnamed: 2"]) +g15 = g15.dropna() +g15.Count = g15.Count.astype(int) +g15 + +"""## Join two dataframes on a common column + +What is the difference between an *inner* join and an *outer* join? +""" + +pd.merge(g15,g16,how='inner',on='Name',suffixes=["15","16"]) + +pd.merge(g15,g16,how='outer',on='Name',suffixes=["15","16"]) + +girls1516 = pd.merge(g15,g16,how='inner',on='Name',suffixes=["15","16"]) +girls1516 + +girls1516 = girls1516.set_index("Name").sort_index() +girls1516 + +"""## The filter function with like""" + +pd.options.display.max_rows = 10000 +girls1516.filter(like='-', axis=0) \ No newline at end of file diff --git a/04a Exam results.ipynb b/04a Exam results.ipynb deleted file mode 100644 index ca4c017..0000000 --- a/04a Exam results.ipynb +++ /dev/null @@ -1,875 +0,0 @@ -{ - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "celltoolbar": "Slideshow", - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - }, - "colab": { - "name": "04a Exam results.ipynb", - "provenance": [] - } - }, - "cells": [ - { - "cell_type": "code", - "metadata": { - "id": "xDHfAp43am1p", - "colab_type": "code", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 180 - }, - "outputId": "ea8d2c4c-8e71-44aa-e48c-d703f932082b" - }, - "source": [ - "!pip install names" - ], - "execution_count": 2, - "outputs": [ - { - "output_type": "stream", - "text": [ - "Collecting names\n", - "\u001b[?25l Downloading https://files.pythonhosted.org/packages/44/4e/f9cb7ef2df0250f4ba3334fbdabaa94f9c88097089763d8e85ada8092f84/names-0.3.0.tar.gz (789kB)\n", - "\u001b[K |████████████████████████████████| 798kB 9.5MB/s \n", - "\u001b[?25hBuilding wheels for collected packages: names\n", - " Building wheel for names (setup.py) ... \u001b[?25l\u001b[?25hdone\n", - " Created wheel for names: filename=names-0.3.0-cp36-none-any.whl size=803688 sha256=2a18d9e11ad025a8cbca453413c364018f40f2fbef8f96f14c8fce3883e71d3b\n", - " Stored in directory: /root/.cache/pip/wheels/f9/a5/e1/be3e0aaa6fa285575078fa2aafd9959b45bdbc8de8a6803aeb\n", - "Successfully built names\n", - "Installing collected packages: names\n", - "Successfully installed names-0.3.0\n" - ], - "name": "stdout" - } - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "V6ZiSHNAagTJ", - "colab_type": "code", - "colab": {} - }, - "source": [ - "import numpy as np\n", - "import pandas as pd\n", - "import names" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9U6hVwNuagTO", - "colab_type": "text" - }, - "source": [ - "# Fake exam results" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2VH2c87qagTP", - "colab_type": "text" - }, - "source": [ - "Generate some random names" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "PPXcGChIagTR", - "colab_type": "code", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 33 - }, - "outputId": "55b7feb0-de1f-43a7-c271-2cd9d7a2d542" - }, - "source": [ - "names.get_full_name()" - ], - "execution_count": 4, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "'Mary Oakden'" - ] - }, - "metadata": { - "tags": [] - }, - "execution_count": 4 - } - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "EBqEHLN5agTV", - "colab_type": "code", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 98 - }, - "outputId": "4904bd76-ed74-4039-c495-ef99acbc2e71" - }, - "source": [ - "[names.get_full_name() for _ in range(5)]" - ], - "execution_count": 5, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "['Jana Bryan',\n", - " 'Jerry Davis',\n", - " 'Sonja Shearer',\n", - " 'Ashley Collins',\n", - " 'Marion Paxton']" - ] - }, - "metadata": { - "tags": [] - }, - "execution_count": 5 - } - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KmdBFDsOagTY", - "colab_type": "text" - }, - "source": [ - "Make some (convincing) random numbers" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "NDKF0u8OagTY", - "colab_type": "code", - "colab": {} - }, - "source": [ - "np.random.normal(loc=50, scale=10)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "iL7D_9NtagTb", - "colab_type": "code", - "colab": {} - }, - "source": [ - "np.random.normal(loc=50, scale=10)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "D1Gbl2mnagTe", - "colab_type": "code", - "colab": {} - }, - "source": [ - "[np.random.normal(loc=50, scale=10) for _ in range(5)]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "IUTj1K3IagTg", - "colab_type": "code", - "colab": {} - }, - "source": [ - "[int(np.random.normal(loc=50, scale=10)) for _ in range(5)]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "72gsHLspagTk", - "colab_type": "text" - }, - "source": [ - "Build a list of names, some maths exam results and some physics exam results" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "OdaZ2oYTagTl", - "colab_type": "code", - "colab": {} - }, - "source": [ - "n = 20\n", - "student = [names.get_full_name() for _ in range(n)]\n", - "maths = [int(np.random.normal(loc=50, scale=10)) for _ in range(n)]\n", - "physics = [m - int(np.random.normal(loc=5, scale=10)) for m in maths]\n", - "\n", - "results = pd.DataFrame(data={\n", - " \"Student\": student,\n", - " \"Maths\": maths,\n", - " \"Physics\": physics\n", - "})\n", - "results" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NZtpCWT4agTq", - "colab_type": "text" - }, - "source": [ - "## Indexing with loc, iloc" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "HDGOy6kVagTr", - "colab_type": "code", - "colab": {} - }, - "source": [ - "results.loc[5]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "_UTya9ZAagTv", - "colab_type": "code", - "colab": {} - }, - "source": [ - "results.loc[5, \"Physics\"]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "DewQHlfragTz", - "colab_type": "code", - "colab": {} - }, - "source": [ - "results.iloc[0,0]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "KyDMLQ8pagT2", - "colab_type": "code", - "colab": {} - }, - "source": [ - "results.iloc[5, \"Maths\"]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "sE-rS0xmagT4", - "colab_type": "code", - "colab": {} - }, - "source": [ - "results.iloc[0:2,0:2]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "CLPdtNYdagT6", - "colab_type": "code", - "colab": {} - }, - "source": [ - "results.iloc[3:8]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "Y56QVCuMagT8", - "colab_type": "code", - "colab": {} - }, - "source": [ - "results.iloc[3:8,0:2]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "XAuwGS_VagT-", - "colab_type": "text" - }, - "source": [ - "Set the obvious index" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "I4Q4Js-HagUA", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV2 = results.set_index(results.Student)\n", - "resultsV2" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tXpFCVs3agUD", - "colab_type": "text" - }, - "source": [ - "Drop the old column if you like" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "mXIy1mLiagUD", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV2.drop(columns=\"Student\")" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "scrolled": false, - "id": "IyfZ4rJmagUG", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV2" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "SsYbdjKLagUK", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV3 = resultsV2.drop(columns=\"Student\")\n", - "resultsV3" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "F48CR-ZsagUR", - "colab_type": "text" - }, - "source": [ - "Locating by index" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "OAweMEpzagUT", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV3.loc[\"Alphonse Bolivar\"]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "jKSoJKkvagUX", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV3.loc[\"Alphonse Bolivar\":\"Will Ross\"].sort_values(by=\"Maths\")[\"Physics\"]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "y_kewhWFagUa", - "colab_type": "text" - }, - "source": [ - "## The filter function with regex" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "WyXrUNEOagUb", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV3.filter(like='t')" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "emGksH4xagUe", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV3.filter(regex='C.r', axis=0)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "eyJmMpyYagUg", - "colab_type": "text" - }, - "source": [ - "## The query function" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "OCbo2kIbagUh", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV3.query(\"Student < 'D'\")" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "c7959fO1agUk", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV3.query(\"Maths > Physics\")" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "XvOXtZyragUn", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV3.index.str.split(\" \", expand=True)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "Y-QIrNAwagUq", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV4 = resultsV3.set_index(resultsV3.index.str.split(\" \", expand=True))\n", - "resultsV4" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "K0tg3s0sagUs", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV4.sort_index(level=1)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "1hkJWoDtagUw", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV4.sort_index(level=1)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "hKNUli96agUz", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV5 = resultsV4.sort_index(level=1)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "pQd2J50GagU3", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV5" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "t4SsDf0fagU5", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV6 = resultsV5.swaplevel()\n", - "resultsV6" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "SA-ie6pzagU8", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV6.loc[\"P\":\"W\"]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "pxBUgjADagU9", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV6.sort_values(by=\"Maths\", ascending=False)[:3]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "35vuSZlUagU_", - "colab_type": "text" - }, - "source": [ - "## The numpy where function" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "NLoB_MmqagVB", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV6[\"Best\"] = np.where(resultsV6.Maths > resultsV6.Physics, \"Maths\", \"Physics\")\n", - "resultsV6" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "QpFmrck1agVH", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV6[\"Combined\"] = resultsV6[\"Maths\"]+resultsV6[\"Physics\"]\n", - "resultsV6.sort_values(by=\"Combined\", ascending=False)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TU_QxtcVagVK", - "colab_type": "text" - }, - "source": [ - "## Categorising data with cut" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "UOgNnaHTagVM", - "colab_type": "code", - "colab": {} - }, - "source": [ - "pd.cut(resultsV6.Combined, bins=[0,30,50,70,95,1000], labels=[\"Fail\",\"Third\",\"2:2\",\"2:1\",\"First\"])" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "llUpRtG2agVQ", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV6[\"Class\"] = pd.cut(resultsV6.Combined, bins=[0,30,50,70,95,1000], labels=[\"Fail\",\"Third\",\"2:2\",\"2:1\",\"First\"])" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "5krKk_NtagVR", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV6" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "Uv0MeFQjagVW", - "colab_type": "code", - "colab": {} - }, - "source": [ - "import seaborn as sns" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "-Klj-x1fagVZ", - "colab_type": "code", - "colab": {} - }, - "source": [ - "resultsV6.Class.cat.categories" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "faypV0NfagVh", - "colab_type": "code", - "colab": {} - }, - "source": [ - "import seaborn as sns" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "bqP7Gq1yagVk", - "colab_type": "code", - "colab": {} - }, - "source": [ - "sns.scatterplot(data=resultsV6, x=\"Maths\", y=\"Physics\", hue=\"Class\", style=\"Best\");" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "QUoOUrIKagVm", - "colab_type": "text" - }, - "source": [ - "## Adding labels in pyplot/seaborn" - ] - }, - { - "cell_type": "code", - "metadata": { - "scrolled": true, - "id": "uE8EWb6magVo", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ax = sns.scatterplot(data=resultsV6, x=\"Maths\", y=\"Physics\", hue=\"Class\", style=\"Best\")\n", - "for surname in resultsV6.index:\n", - " ax.annotate(surname[1], (resultsV6.loc[surname, \"Maths\"], resultsV6.loc[surname, \"Physics\"]))" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "scrolled": true, - "id": "y24vrhkgagVs", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ax = sns.scatterplot(data=resultsV6, x=\"Maths\", y=\"Physics\", hue=\"Class\", style=\"Best\")\n", - "for surname in resultsV6.index:\n", - " ax.text(resultsV6.loc[surname, \"Maths\"], resultsV6.loc[surname, \"Physics\"], surname[1])" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "scrolled": true, - "id": "grxXGsB-agVx", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ax = sns.scatterplot(data=resultsV6, x=\"Maths\", y=\"Physics\", hue=\"Class\", style=\"Best\")\n", - "for fullname in resultsV6.index:\n", - " ax.text(resultsV6.loc[fullname, \"Maths\"], resultsV6.loc[fullname, \"Physics\"], fullname[1], rotation=30, va=\"center\")" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "WHK04Ld1agVz", - "colab_type": "code", - "colab": {} - }, - "source": [ - "fg = sns.catplot(data=resultsV6, x=\"Class\", kind=\"count\")\n", - "ax = fg.ax\n", - "for x,bar in enumerate(ax.patches):\n", - " if(bar.get_height()>0):\n", - " ax.annotate(int(bar.get_height()),(x,bar.get_height()+0.35))" - ], - "execution_count": 0, - "outputs": [] - } - ] -} \ No newline at end of file diff --git a/04a_exam_results.py b/04a_exam_results.py new file mode 100644 index 0000000..09a5d2d --- /dev/null +++ b/04a_exam_results.py @@ -0,0 +1,159 @@ +# -*- coding: utf-8 -*- +"""04a Exam results.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/04a%20Exam%20results.ipynb +""" + +!pip install names + +import numpy as np +import pandas as pd +import names + +"""# Fake exam results + +Generate some random names +""" + +names.get_full_name() + +[names.get_full_name() for _ in range(5)] + +"""Make some (convincing) random numbers""" + +np.random.normal(loc=50, scale=10) + +np.random.normal(loc=50, scale=10) + +[np.random.normal(loc=50, scale=10) for _ in range(5)] + +[int(np.random.normal(loc=50, scale=10)) for _ in range(5)] + +"""Build a list of names, some maths exam results and some physics exam results""" + +n = 20 +student = [names.get_full_name() for _ in range(n)] +maths = [int(np.random.normal(loc=50, scale=10)) for _ in range(n)] +physics = [m - int(np.random.normal(loc=5, scale=10)) for m in maths] + +results = pd.DataFrame(data={ + "Student": student, + "Maths": maths, + "Physics": physics +}) +results + +"""## Indexing with loc, iloc""" + +results.loc[5] + +results.loc[5, "Physics"] + +results.iloc[0,0] + +results.iloc[5, "Maths"] + +results.iloc[0:2,0:2] + +results.iloc[3:8] + +results.iloc[3:8,0:2] + +"""Set the obvious index""" + +resultsV2 = results.set_index(results.Student) +resultsV2 + +"""Drop the old column if you like""" + +resultsV2.drop(columns="Student") + +resultsV2 + +resultsV3 = resultsV2.drop(columns="Student") +resultsV3 + +"""Locating by index""" + +resultsV3.loc["Alphonse Bolivar"] + +resultsV3.loc["Alphonse Bolivar":"Will Ross"].sort_values(by="Maths")["Physics"] + +"""## The filter function with regex""" + +resultsV3.filter(like='t') + +resultsV3.filter(regex='C.r', axis=0) + +"""## The query function""" + +resultsV3.query("Student < 'D'") + +resultsV3.query("Maths > Physics") + +resultsV3.index.str.split(" ", expand=True) + +resultsV4 = resultsV3.set_index(resultsV3.index.str.split(" ", expand=True)) +resultsV4 + +resultsV4.sort_index(level=1) + +resultsV4.sort_index(level=1) + +resultsV5 = resultsV4.sort_index(level=1) + +resultsV5 + +resultsV6 = resultsV5.swaplevel() +resultsV6 + +resultsV6.loc["P":"W"] + +resultsV6.sort_values(by="Maths", ascending=False)[:3] + +"""## The numpy where function""" + +resultsV6["Best"] = np.where(resultsV6.Maths > resultsV6.Physics, "Maths", "Physics") +resultsV6 + +resultsV6["Combined"] = resultsV6["Maths"]+resultsV6["Physics"] +resultsV6.sort_values(by="Combined", ascending=False) + +"""## Categorising data with cut""" + +pd.cut(resultsV6.Combined, bins=[0,30,50,70,95,1000], labels=["Fail","Third","2:2","2:1","First"]) + +resultsV6["Class"] = pd.cut(resultsV6.Combined, bins=[0,30,50,70,95,1000], labels=["Fail","Third","2:2","2:1","First"]) + +resultsV6 + +import seaborn as sns + +resultsV6.Class.cat.categories + +import seaborn as sns + +sns.scatterplot(data=resultsV6, x="Maths", y="Physics", hue="Class", style="Best"); + +"""## Adding labels in pyplot/seaborn""" + +ax = sns.scatterplot(data=resultsV6, x="Maths", y="Physics", hue="Class", style="Best") +for surname in resultsV6.index: + ax.annotate(surname[1], (resultsV6.loc[surname, "Maths"], resultsV6.loc[surname, "Physics"])) + +ax = sns.scatterplot(data=resultsV6, x="Maths", y="Physics", hue="Class", style="Best") +for surname in resultsV6.index: + ax.text(resultsV6.loc[surname, "Maths"], resultsV6.loc[surname, "Physics"], surname[1]) + +ax = sns.scatterplot(data=resultsV6, x="Maths", y="Physics", hue="Class", style="Best") +for fullname in resultsV6.index: + ax.text(resultsV6.loc[fullname, "Maths"], resultsV6.loc[fullname, "Physics"], fullname[1], rotation=30, va="center") + +fg = sns.catplot(data=resultsV6, x="Class", kind="count") +ax = fg.ax +for x,bar in enumerate(ax.patches): + if(bar.get_height()>0): + ax.annotate(int(bar.get_height()),(x,bar.get_height()+0.35)) \ No newline at end of file diff --git a/05 Hypothesis Tests.ipynb b/05 Hypothesis Tests.ipynb deleted file mode 100644 index a4dadab..0000000 --- a/05 Hypothesis Tests.ipynb +++ /dev/null @@ -1,682 +0,0 @@ -{ - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "celltoolbar": "Slideshow", - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - }, - "colab": { - "name": "05 Hypothesis Tests.ipynb", - "provenance": [] - } - }, - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "azgP28ebdpQI", - "colab_type": "text" - }, - "source": [ - "# Hypothesis testing with `scipy` `stats`\n", - "\n", - "- [Chi squared](#Chi-squared)\n", - "- [t test](#t-test)\n", - "- Binomial\n", - " - [Distribution](#Binomial)\n", - " - [Hypothesis test](#A-binomial-hypothesis-test)\n", - " - [Using the critical region](#Critical-region)\n", - " - [Using the `binom_test` function](#Alternatively)\n", - " - [Interactive](#Using-interact)" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "3w4QxLzvdpQL", - "colab_type": "code", - "colab": {} - }, - "source": [ - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import pandas as pd\n", - "import seaborn as sns\n", - "from scipy import stats\n", - "import warnings\n", - "warnings.simplefilter(action='ignore', category=FutureWarning)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bdXSTxrAdpQO", - "colab_type": "text" - }, - "source": [ - "## Chi squared" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "QX74pV2PdpQP", - "colab_type": "code", - "colab": {} - }, - "source": [ - "titanic = pd.read_csv('https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/titanic.txt', sep='\\t')" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pu-gL9AIdpQS", - "colab_type": "text" - }, - "source": [ - "The Chi square test compares the observed coincidence of two categorical variables with what the expected coincidence would be if they were independent.\n", - "\n", - "Going back to the Titanic data, we can see the observed coincidence of `Sex` and `Survived` in a *contingency table* (what `pandas` calls a `crosstab`):" - ] - }, - { - "cell_type": "code", - "metadata": { - "scrolled": true, - "id": "YU1GCQTYdpQS", - "colab_type": "code", - "colab": {} - }, - "source": [ - "pd.crosstab(titanic.Sex, titanic.Survived, margins=True)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4QNdfxhZdpQV", - "colab_type": "text" - }, - "source": [ - "We can pass that table to the `contingency.expected_freq` function from `scipy.stats` to see what numbers we'd expected if the two variables were independent:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "8BuuBHNedpQW", - "colab_type": "code", - "colab": {} - }, - "source": [ - "stats.contingency.expected_freq(pd.crosstab(titanic.Sex, titanic.Survived))\n" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LvdwV6rkdpQa", - "colab_type": "text" - }, - "source": [ - "So it certainly looks like there's something going on. We can pass that crosstab into `chi2_contingency` to carry out the hypothesis test with:\n", - "\n", - "$H_0$: The variables `Sex` and `Survived` are independent\n", - "\n", - "$H_1$: There is an association between `Sex` and `Survived`" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "uNyPXUEcdpQa", - "colab_type": "code", - "colab": {} - }, - "source": [ - "titanic_chi2 = stats.chi2_contingency(pd.crosstab(titanic.Sex, titanic.Survived))\n", - "titanic_chi2" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "de_HyvCMdpQe", - "colab_type": "text" - }, - "source": [ - "The first item is the chi square statistic, the second is the p value, and the third is the expected contingency table if the null hypothesis were true.\n", - "\n", - "The high chi square statistic and very low p value strongly suggest that these two variables are *not* independent." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fzVyft4idpQf", - "colab_type": "text" - }, - "source": [ - "## t test" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "_zIUW4k5dpQf", - "colab_type": "code", - "colab": {} - }, - "source": [ - "from ipywidgets import interact, IntSlider, FloatSlider" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "AbQmXz7TdpQi", - "colab_type": "code", - "colab": {} - }, - "source": [ - "def update(n=30,meanA=50,stdA=1,meanB=50,stdB=1,alpha=0.05):\n", - " # generate two sets of normally distributed data\n", - " groupA = np.random.normal(meanA, stdA, n)\n", - " groupB = np.random.normal(meanB, stdB, n)\n", - " # plot them\n", - " sns.distplot(groupA)\n", - " sns.distplot(groupB)\n", - " # apply an independent t-test\n", - " ttest_result = stats.ttest_ind(groupA,groupB, equal_var=False)\n", - " s = '''\n", - " meanA = {}\n", - " meanB = {}\n", - " H0: meanA = meanB\n", - " H1: meanA <> meanB\n", - " t = {}\n", - " ''' \n", - " if (ttest_result.pvalue) <= alpha:\n", - " s+= '''\n", - " p = {} <= {}\n", - " Reject H0 at the {} significance level\n", - " '''\n", - " else:\n", - " s+= '''\n", - " p = {} > {}\n", - " Fail to reject H0 at the {} significance level\n", - " '''\n", - " print(s.format(groupA.mean().round(2),groupB.mean().round(2),ttest_result.statistic, ttest_result.pvalue, alpha, alpha))\n", - "interact(update,\n", - " n=IntSlider(value=30,min=3,max=100,step=1,continuous_update=False),\n", - " meanA=IntSlider(value=50,min=10,max=100,step=1,continuous_update=False),\n", - " stdA=IntSlider(value=1,min=1,max=10,step=1,continuous_update=False),\n", - " meanB=IntSlider(value=50,min=10,max=100,step=1,continuous_update=False),\n", - " stdB=IntSlider(value=1,min=1,max=10,step=1,continuous_update=False),\n", - " alpha=FloatSlider(value=0.05,min=0.01,max=0.1,step=0.01,continuous_update=False)\n", - " );" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SO4HwR3NdpQl", - "colab_type": "text" - }, - "source": [ - "## Binomial" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "otbJXvBXdpQn", - "colab_type": "text" - }, - "source": [ - "* Use the scipy.stats [binomial](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html#scipy.stats.binom) function to generate binomial distributions. Plot them. Make an `interact` to vary the parameters\n", - "* The [binom_test](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.binom_test.html) to write a binomial hypothesis test procedure." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5oslLFQLdpQp", - "colab_type": "text" - }, - "source": [ - "Generate one random number from a binomial distribution" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "sYpGkZbadpQq", - "colab_type": "code", - "colab": {} - }, - "source": [ - "stats.binom.rvs(n=100,p=1/6)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "kKtDDiLtdpQt", - "colab_type": "text" - }, - "source": [ - "Recall that this represents the number of successes from $n$ independent trials, each with probability of success $p$" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "teC2uuK-dpQu", - "colab_type": "text" - }, - "source": [ - "Let's now get a list of these to plot" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "7BTeQGaadpQw", - "colab_type": "code", - "colab": {} - }, - "source": [ - "binomial = stats.binom.rvs(n=100,p=1/6, size=1000)\n", - "sns.distplot(binomial, bins=20);" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9bTxU8hudpQy", - "colab_type": "text" - }, - "source": [ - "In theory we could get from 0 to 100 success so let's fix the axes:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "KvQXcGpWdpQz", - "colab_type": "code", - "colab": {} - }, - "source": [ - "#binomial = stats.binom.rvs(n=100,p=1/6, size=1000)\n", - "binomial = np.random.binomial(100,1/6,size=1000)\n", - "ax = sns.distplot(binomial, bins=10)\n", - "ax.set_xlim(0,100);" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "S43_jDPLdpQ2", - "colab_type": "text" - }, - "source": [ - "This shows how skewed the distribution is. We can get results over 40 successes but very rarely." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ocU2qjRUdpQ3", - "colab_type": "text" - }, - "source": [ - "### A binomial hypothesis test\n", - "There are two ways to approach a binomial hypothesis test:\n", - "\n", - "- find the probability of the observed outcome (or an even less likely outcome) and see if that is less than your chosen alpha\n", - "- identify a *critical region* of outcomes that represent your alpha and then see if the observed outcome is within that region" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "gzlPy0LZdpQ4", - "colab_type": "text" - }, - "source": [ - "### Critical regions\n", - "\n", - "Let's say that you think Dodgy Bob's dice is biased in favour of sixes. You're going to ask him to roll his dice a hundred times. How can we decide in advance what sort of outcome would convince us to reject the null (and safe) hypothesis and assert that the dice is biased?" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "vltRJmkfdpQ5", - "colab_type": "text" - }, - "source": [ - "Let $p$ be the probability of throwing a six on an unbiased dice.\n", - "\n", - "$H_0:p=\\frac{1}{6}\\\\H_1:p>\\frac{1}{6}$\n", - "\n", - "Shall we say $\\alpha=0.05$ is our significance level? In other words, we'll decide the critical region by considering outcomes that have a less than 5% if the dice is *not* dodgy." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Jt6tQ80kdpQ6", - "colab_type": "text" - }, - "source": [ - "The `stats.binom.cdf(k,n,p)` function will give us the cumulative probability of getting up to and including `k` successes from `n` trials with probability `p`.\n", - "\n", - "So for example the probability of getting 10 or fewer sixes from 100 rolls of an unbiased dice is:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "fIFrpcandpQ6", - "colab_type": "code", - "colab": {} - }, - "source": [ - "stats.binom.cdf(90,100,1/6)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Sw5Ps3VOdpQ8", - "colab_type": "text" - }, - "source": [ - "So if we'd been worried about this dice being biased *against* sixes, that result would have been in our critical region!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "D7SJWkZkdpQ-", - "colab_type": "text" - }, - "source": [ - "Look how the probability of getting k or fewer successes increases as k goes from 0 to 100:" - ] - }, - { - "cell_type": "code", - "metadata": { - "scrolled": true, - "id": "lv7hdVcidpQ-", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ks = np.arange(101)\n", - "bcdf = [stats.binom.cdf(k,100,1/6) for k in ks]\n", - "sns.lineplot(ks, bcdf);" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BV4gAn7CdpRB", - "colab_type": "text" - }, - "source": [ - "To identify our critical region, we need to know when k crosses into the 95% region. In other words, there's a 95% chance that we get fewer than k successes, or a 5% chance that we get k or more.\n", - "\n", - "`stats.binom` has a function for that too:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "GiBWGDA0dpRC", - "colab_type": "code", - "colab": {} - }, - "source": [ - "stats.binom.ppf(0.95,100,1/6)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "vsckgI4OdpRG", - "colab_type": "text" - }, - "source": [ - "The critical region starts at 23 then:" - ] - }, - { - "cell_type": "code", - "metadata": { - "scrolled": true, - "id": "04B13albdpRI", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ks = np.arange(101)\n", - "bcdf = [stats.binom.cdf(k,100,1/6) for k in ks]\n", - "ax = sns.lineplot(ks, bcdf)\n", - "ax.axvspan(23,100,color=\"red\",alpha=0.1);" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2fgJJ5X_dpRP", - "colab_type": "text" - }, - "source": [ - "Just to check:" - ] - }, - { - "cell_type": "code", - "metadata": { - "scrolled": true, - "id": "xUYvuopxdpRQ", - "colab_type": "code", - "colab": {} - }, - "source": [ - "1-stats.binom.cdf(23,100,1/6)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "l9VL6sRWdpRX", - "colab_type": "text" - }, - "source": [ - "So 23 sixes is in our critical region.\n", - "\n", - "Now we can get Dodgy Bob to roll the dice 100 times and if he rolls 23 or more sixes we can reject the null hypothesis at the 5% significance level.\n", - "\n", - "We could then repeat this with as many suspected dodgy dice owners as we like. But note that for every twenty accusations we make, we'd expect one of them to be a false accusation! " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "djk2Hc6xdpRZ", - "colab_type": "text" - }, - "source": [ - "### Alternatively\n", - "\n", - "Alternatively, suppose we've already seen Dodgy Bob roll his dice 100 times, and he just got 25 sixes. We can carry out the hypothesis test in one line like this:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "1fzQjhCPdpRb", - "colab_type": "code", - "colab": {} - }, - "source": [ - "stats.binom_test(25, 100, 1/6)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-7i3EYfedpRf", - "colab_type": "text" - }, - "source": [ - "This p value is less than our threshold, so we reject the null hypothesis as before." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RWaEkgYtdpRg", - "colab_type": "text" - }, - "source": [ - "If he'd rolled 21 sixes:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "bqhct__BdpRi", - "colab_type": "code", - "colab": {} - }, - "source": [ - "stats.binom_test(21, 100, 1/6)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6RIlBpcGdpRk", - "colab_type": "text" - }, - "source": [ - "This is perhaps an unusually high number of sixes, but the p value is not below our threshold, so we fail to reject the null hypothesis, and leave Dodgy Bob with his dice." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "E3iATbGMdpRm", - "colab_type": "text" - }, - "source": [ - "### Using interact" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "pHcRq_zhdpRm", - "colab_type": "code", - "colab": {} - }, - "source": [ - "from ipywidgets import interact, IntSlider, FloatSlider, Dropdown" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "_4olo54EdpRp", - "colab_type": "code", - "colab": {} - }, - "source": [ - "\n", - "def update(n,p,alpha,tails):\n", - " ks = np.arange(n+1)\n", - " bcdf = [stats.binom.cdf(k,n,p) for k in ks]\n", - " ax = sns.lineplot(ks, bcdf)\n", - " if tails == \"both\":\n", - " a = alpha/2\n", - " else:\n", - " a = alpha\n", - " if tails == \"both\" or tails == \"left\":\n", - " left_crit = int(stats.binom.ppf(a,n,p))\n", - " ax.axvspan(0,left_crit,color=\"red\",alpha=0.1)\n", - " ax.annotate(left_crit,(left_crit,0.5))\n", - " if tails == \"both\" or tails == \"right\":\n", - " right_crit = int(stats.binom.ppf(1-a,n,p))\n", - " ax.axvspan(right_crit,n,color=\"red\",alpha=0.1)\n", - " ax.annotate(right_crit,(right_crit,0.5))\n", - " # The alpha here is opacity not significance level!\n", - "\n", - "interact(update,\n", - " n = IntSlider(value=100, min=10, max=200, continuous_update=False),\n", - " p = FloatSlider(value=1/6, min=0.01, max=0.99, step=0.01, continuous_update=False),\n", - " alpha = Dropdown(options=[0.05,0.01,0.005,0.001]),\n", - " tails = Dropdown(options=[\"left\",\"right\",\"both\"]));" - ], - "execution_count": 0, - "outputs": [] - } - ] -} \ No newline at end of file diff --git a/05_hypothesis_tests.py b/05_hypothesis_tests.py new file mode 100644 index 0000000..2eec278 --- /dev/null +++ b/05_hypothesis_tests.py @@ -0,0 +1,225 @@ +# -*- coding: utf-8 -*- +"""05 Hypothesis Tests.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/05%20Hypothesis%20Tests.ipynb + +# Hypothesis testing with `scipy` `stats` + +- [Chi squared](#Chi-squared) +- [t test](#t-test) +- Binomial + - [Distribution](#Binomial) + - [Hypothesis test](#A-binomial-hypothesis-test) + - [Using the critical region](#Critical-region) + - [Using the `binom_test` function](#Alternatively) + - [Interactive](#Using-interact) +""" + +import matplotlib.pyplot as plt +import numpy as np +import pandas as pd +import seaborn as sns +from scipy import stats +import warnings +warnings.simplefilter(action='ignore', category=FutureWarning) + +"""## Chi squared""" + +titanic = pd.read_csv('https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/titanic.txt', sep='\t') + +"""The Chi square test compares the observed coincidence of two categorical variables with what the expected coincidence would be if they were independent. + +Going back to the Titanic data, we can see the observed coincidence of `Sex` and `Survived` in a *contingency table* (what `pandas` calls a `crosstab`): +""" + +pd.crosstab(titanic.Sex, titanic.Survived, margins=True) + +"""We can pass that table to the `contingency.expected_freq` function from `scipy.stats` to see what numbers we'd expected if the two variables were independent:""" + +stats.contingency.expected_freq(pd.crosstab(titanic.Sex, titanic.Survived)) + +"""So it certainly looks like there's something going on. We can pass that crosstab into `chi2_contingency` to carry out the hypothesis test with: + +$H_0$: The variables `Sex` and `Survived` are independent + +$H_1$: There is an association between `Sex` and `Survived` +""" + +titanic_chi2 = stats.chi2_contingency(pd.crosstab(titanic.Sex, titanic.Survived)) +titanic_chi2 + +"""The first item is the chi square statistic, the second is the p value, and the third is the expected contingency table if the null hypothesis were true. + +The high chi square statistic and very low p value strongly suggest that these two variables are *not* independent. + +## t test +""" + +from ipywidgets import interact, IntSlider, FloatSlider + +def update(n=30,meanA=50,stdA=1,meanB=50,stdB=1,alpha=0.05): + # generate two sets of normally distributed data + groupA = np.random.normal(meanA, stdA, n) + groupB = np.random.normal(meanB, stdB, n) + # plot them + sns.distplot(groupA) + sns.distplot(groupB) + # apply an independent t-test + ttest_result = stats.ttest_ind(groupA,groupB, equal_var=False) + s = ''' + meanA = {} + meanB = {} + H0: meanA = meanB + H1: meanA <> meanB + t = {} + ''' + if (ttest_result.pvalue) <= alpha: + s+= ''' + p = {} <= {} + Reject H0 at the {} significance level + ''' + else: + s+= ''' + p = {} > {} + Fail to reject H0 at the {} significance level + ''' + print(s.format(groupA.mean().round(2),groupB.mean().round(2),ttest_result.statistic, ttest_result.pvalue, alpha, alpha)) +interact(update, + n=IntSlider(value=30,min=3,max=100,step=1,continuous_update=False), + meanA=IntSlider(value=50,min=10,max=100,step=1,continuous_update=False), + stdA=IntSlider(value=1,min=1,max=10,step=1,continuous_update=False), + meanB=IntSlider(value=50,min=10,max=100,step=1,continuous_update=False), + stdB=IntSlider(value=1,min=1,max=10,step=1,continuous_update=False), + alpha=FloatSlider(value=0.05,min=0.01,max=0.1,step=0.01,continuous_update=False) + ); + +"""## Binomial + +* Use the scipy.stats [binomial](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html#scipy.stats.binom) function to generate binomial distributions. Plot them. Make an `interact` to vary the parameters +* The [binom_test](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.binom_test.html) to write a binomial hypothesis test procedure. + +Generate one random number from a binomial distribution +""" + +stats.binom.rvs(n=100,p=1/6) + +"""Recall that this represents the number of successes from $n$ independent trials, each with probability of success $p$ + +Let's now get a list of these to plot +""" + +binomial = stats.binom.rvs(n=100,p=1/6, size=1000) +sns.distplot(binomial, bins=20); + +"""In theory we could get from 0 to 100 success so let's fix the axes:""" + +#binomial = stats.binom.rvs(n=100,p=1/6, size=1000) +binomial = np.random.binomial(100,1/6,size=1000) +ax = sns.distplot(binomial, bins=10) +ax.set_xlim(0,100); + +"""This shows how skewed the distribution is. We can get results over 40 successes but very rarely. + +### A binomial hypothesis test +There are two ways to approach a binomial hypothesis test: + +- find the probability of the observed outcome (or an even less likely outcome) and see if that is less than your chosen alpha +- identify a *critical region* of outcomes that represent your alpha and then see if the observed outcome is within that region + +### Critical regions + +Let's say that you think Dodgy Bob's dice is biased in favour of sixes. You're going to ask him to roll his dice a hundred times. How can we decide in advance what sort of outcome would convince us to reject the null (and safe) hypothesis and assert that the dice is biased? + +Let $p$ be the probability of throwing a six on an unbiased dice. + +$H_0:p=\frac{1}{6}\\H_1:p>\frac{1}{6}$ + +Shall we say $\alpha=0.05$ is our significance level? In other words, we'll decide the critical region by considering outcomes that have a less than 5% if the dice is *not* dodgy. + +The `stats.binom.cdf(k,n,p)` function will give us the cumulative probability of getting up to and including `k` successes from `n` trials with probability `p`. + +So for example the probability of getting 10 or fewer sixes from 100 rolls of an unbiased dice is: +""" + +stats.binom.cdf(90,100,1/6) + +"""So if we'd been worried about this dice being biased *against* sixes, that result would have been in our critical region! + +Look how the probability of getting k or fewer successes increases as k goes from 0 to 100: +""" + +ks = np.arange(101) +bcdf = [stats.binom.cdf(k,100,1/6) for k in ks] +sns.lineplot(ks, bcdf); + +"""To identify our critical region, we need to know when k crosses into the 95% region. In other words, there's a 95% chance that we get fewer than k successes, or a 5% chance that we get k or more. + +`stats.binom` has a function for that too: +""" + +stats.binom.ppf(0.95,100,1/6) + +"""The critical region starts at 23 then:""" + +ks = np.arange(101) +bcdf = [stats.binom.cdf(k,100,1/6) for k in ks] +ax = sns.lineplot(ks, bcdf) +ax.axvspan(23,100,color="red",alpha=0.1); + +"""Just to check:""" + +1-stats.binom.cdf(23,100,1/6) + +"""So 23 sixes is in our critical region. + +Now we can get Dodgy Bob to roll the dice 100 times and if he rolls 23 or more sixes we can reject the null hypothesis at the 5% significance level. + +We could then repeat this with as many suspected dodgy dice owners as we like. But note that for every twenty accusations we make, we'd expect one of them to be a false accusation! + +### Alternatively + +Alternatively, suppose we've already seen Dodgy Bob roll his dice 100 times, and he just got 25 sixes. We can carry out the hypothesis test in one line like this: +""" + +stats.binom_test(25, 100, 1/6) + +"""This p value is less than our threshold, so we reject the null hypothesis as before. + +If he'd rolled 21 sixes: +""" + +stats.binom_test(21, 100, 1/6) + +"""This is perhaps an unusually high number of sixes, but the p value is not below our threshold, so we fail to reject the null hypothesis, and leave Dodgy Bob with his dice. + +### Using interact +""" + +from ipywidgets import interact, IntSlider, FloatSlider, Dropdown + +def update(n,p,alpha,tails): + ks = np.arange(n+1) + bcdf = [stats.binom.cdf(k,n,p) for k in ks] + ax = sns.lineplot(ks, bcdf) + if tails == "both": + a = alpha/2 + else: + a = alpha + if tails == "both" or tails == "left": + left_crit = int(stats.binom.ppf(a,n,p)) + ax.axvspan(0,left_crit,color="red",alpha=0.1) + ax.annotate(left_crit,(left_crit,0.5)) + if tails == "both" or tails == "right": + right_crit = int(stats.binom.ppf(1-a,n,p)) + ax.axvspan(right_crit,n,color="red",alpha=0.1) + ax.annotate(right_crit,(right_crit,0.5)) + # The alpha here is opacity not significance level! + +interact(update, + n = IntSlider(value=100, min=10, max=200, continuous_update=False), + p = FloatSlider(value=1/6, min=0.01, max=0.99, step=0.01, continuous_update=False), + alpha = Dropdown(options=[0.05,0.01,0.005,0.001]), + tails = Dropdown(options=["left","right","both"])); \ No newline at end of file diff --git a/05a numpy and interact.ipynb b/05a numpy and interact.ipynb deleted file mode 100644 index 71cb2cb..0000000 --- a/05a numpy and interact.ipynb +++ /dev/null @@ -1,552 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Ordinary python lists" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "[3,4,5,\"Bob\"]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "mylist = [2,5,7,\"Alice\"]\n", - "mylist" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "mylist[3]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "mylist[4]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Building lists\n", - "\n", - "With a `for` loop:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "evennumbers = []\n", - "for i in range(200):\n", - " if i%2 == 0:\n", - " evennumbers.append(i)\n", - "len(evennumbers)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With a `while` loop" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "evennumbers = [0]\n", - "while len(evennumbers) < 100:\n", - " evennumbers.append(evennumbers[-1]+2)\n", - "len(evennumbers)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Another `for` loop" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "evennumbers = []\n", - "for i in range(100):\n", - " evennumbers.append(2*i)\n", - "\n", - " \n", - "# We can slice lists\n", - "evennumbers[3:7]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With a *list comprehension*:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "evennumbers = [2*i for i in range(100)]\n", - "evennumbers[-4:-1]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With a different list comprehension:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "evennumbers = [i for i in range(200) if i%2 == 0]\n", - "evennumbers[50:-48]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sum(evennumbers) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Nested clauses" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "v = 0\n", - "for e in evennumbers:\n", - " #print(\"This will happen every time\")\n", - " if e > 50:\n", - " #print(\"This will happen fifty times\")\n", - " if e%7 == 0:\n", - " #print(\"This will happen only for even multiples of 7 bigger than 50\")\n", - " v += 1\n", - "v" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## About numpy" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A `numpy` `array` is just a dressed-up list:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "evennumbers = np.arange(0,200,2)\n", - "evennumbers" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With some extra functionality:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "evennumbers.mean()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "evennumbers[evennumbers < 10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### and pandas" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`pandas` just dresses `numpy` in some more functionality:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "a = np.array([1,2,3])\n", - "b = np.array([\"bob\",\"gene\",\"tina\"])\n", - "df = pd.DataFrame(data = {\n", - " 'number': a,\n", - " 'name': b\n", - "})\n", - "df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Back to numpy" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's generate a normally distributed population:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "population = np.random.normal(loc=50, scale=3, size=10000)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`loc` is $\\mu$, `scale` is $\\sigma$, `size` is $N$" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Take a sample from this population:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sample = np.random.choice(population, size=100, replace=False)\n", - "\n", - "print(\"Population mean = {}\".format(population.mean().round(2)))\n", - "print(\"Sample mean = {}\".format(sample.mean().round(2)))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import seaborn as sns\n", - "import matplotlib.pyplot as plt\n", - "import warnings\n", - "warnings.simplefilter(action='ignore', category=FutureWarning)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "fig, axs = plt.subplots()\n", - "sns.distplot(population, ax = axs)\n", - "axs.axvline(population.mean())\n", - "sns.distplot(sample, ax = axs)\n", - "axs.axvline(sample.mean())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### The interact widget" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from ipywidgets import interact" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def update(n):\n", - " print(n)\n", - "\n", - "interact(update,n=(1,100))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The continuous updating is annoying so..." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from ipywidgets import IntSlider" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def update(n):\n", - " print(n)\n", - "\n", - "interact(update,n=IntSlider(min=1, max=100, step=1, continuous_update=False))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "\n", - "def update(n):\n", - " fig, axs = plt.subplots()\n", - " sns.distplot(population, ax = axs)\n", - " axs.set_ylim(0,0.2)\n", - " axs.set_xlim(30,70)\n", - " axs.axvline(population.mean())\n", - " sample = np.random.choice(population, size=n, replace=False)\n", - " sns.distplot(sample, ax = axs)\n", - " axs.axvline(sample.mean())\n", - "\n", - "interact(update,n=IntSlider(value=10, min=2, max=1000, step=1, continuous_update=False))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "So the sample mean $\\bar{x}$ is a good (unbiased) estimator for the population mean $\\mu$.\n", - "\n", - "The same is **not** true of the standard deviation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "def update(n):\n", - " # make a set of axes\n", - " fig, axs = plt.subplots()\n", - " # set the axes\n", - " axs.set_ylim(0,0.2)\n", - " axs.set_xlim(30,70)\n", - " # plot the population\n", - " sns.distplot(population, ax = axs)\n", - " # find the mean and sd for the population\n", - " mu = population.mean()\n", - " sigma = population.std()\n", - " # color one sd from the mean\n", - " axs.axvspan(mu-sigma, mu+sigma, facecolor=\"lightsteelblue\", alpha=0.4)\n", - " # draw a sample\n", - " sample = np.random.choice(population, size=n, replace=False)\n", - " # plot the sample\n", - " sns.distplot(sample, ax = axs)\n", - " # find the sample mean and the 'wrong' sd\n", - " xbar = sample.mean()\n", - " s = sample.std()\n", - " axs.axvspan(xbar-s,xbar+s, facecolor=\"wheat\", alpha=0.2)\n", - " print(\"Standard deviation of sample with /n = {}\".format(s))\n", - " print(\"Population standard deviation = {}\".format(sigma))\n", - "\n", - "\n", - "interact(update,n=IntSlider(value=10, min=2, max=1000, step=1, continuous_update=False))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**This is not quite as compelling a visualisation as we wanted.**\n", - "\n", - "We want to show that using $\\dfrac{\\sum(x-\\bar{x})^2}{n}$ tends to *underestimate* the standard deviation of the population, which is why we use $n-1$ instead.\n", - "\n", - "But of course, at these sample sizes, the difference between dividing by $n$ and dividing by $n-1$ is not going to be visible on a graph." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Instead, we could draw a lot of samples and plot the distribution of the uncorrected and corrected standard deviations compared to the true value." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "hide_input": false - }, - "outputs": [], - "source": [ - "# Our population\n", - "population = np.random.normal(loc=50, scale=5, size=10000)\n", - "# Make 1000 samples, each of size 5\n", - "samples = [np.random.choice(population, size=5, replace=False) for _ in range(1000)]\n", - "# Make an array of the means of each sample\n", - "sample_means = np.array([sample.mean() for sample in samples])\n", - "# And their (uncorrected) standard deviations\n", - "sample_stds =np.array([sample.std() for sample in samples])\n", - "# Amd their (corrected) standar deviations\n", - "sample_stds_corrected = np.array([sample.std(ddof=1) for sample in samples])\n", - "# We want three sets of axes\n", - "fig, axs = plt.subplots(1,3)\n", - "fig.set_figwidth(18)\n", - "# plot the distribution of means around the true mean\n", - "axs[0].set_title(\"Mean\")\n", - "axs[0].get_yaxis().set_visible(False)\n", - "sns.distplot(sample_means, ax=axs[0], color=\"goldenrod\")\n", - "axs[0].axvline(population.mean(), color=\"steelblue\")\n", - "axs[0].axvline(sample_means.mean(), color=\"orange\")\n", - "# and the distribution of uncorrected standard deviations around the true standard deviation\n", - "axs[1].set_title(\"Uncorrected standard deviation\")\n", - "axs[1].get_yaxis().set_visible(False)\n", - "sns.distplot(sample_stds, ax=axs[1], color=\"goldenrod\")\n", - "axs[1].axvline(population.std(), color=\"steelblue\")\n", - "axs[1].axvline(sample_stds.mean(), color=\"orange\")\n", - "# and the distribution of corrected standard deviations around the true standard deviation\n", - "axs[2].set_title(\"Corrected standard deviation\")\n", - "axs[2].get_yaxis().set_visible(False)\n", - "sns.distplot(sample_stds_corrected, ax=axs[2], color=\"goldenrod\")\n", - "axs[2].axvline(population.std(), color=\"steelblue\")\n", - "axs[2].axvline(sample_stds_corrected.mean(), color=\"orange\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/05a_numpy_and_interact.py b/05a_numpy_and_interact.py new file mode 100644 index 0000000..4b2ac11 --- /dev/null +++ b/05a_numpy_and_interact.py @@ -0,0 +1,231 @@ +# -*- coding: utf-8 -*- +"""05a numpy and interact.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/05a%20numpy%20and%20interact.ipynb + +## Ordinary python lists +""" + +[3,4,5,"Bob"] + +mylist = [2,5,7,"Alice"] +mylist + +mylist[3] + +mylist[4] + +"""### Building lists + +With a `for` loop: +""" + +evennumbers = [] +for i in range(200): + if i%2 == 0: + evennumbers.append(i) +len(evennumbers) + +"""With a `while` loop""" + +evennumbers = [0] +while len(evennumbers) < 100: + evennumbers.append(evennumbers[-1]+2) +len(evennumbers) + +"""Another `for` loop""" + +evennumbers = [] +for i in range(100): + evennumbers.append(2*i) + + +# We can slice lists +evennumbers[3:7] + +"""With a *list comprehension*:""" + +evennumbers = [2*i for i in range(100)] +evennumbers[-4:-1] + +"""With a different list comprehension:""" + +evennumbers = [i for i in range(200) if i%2 == 0] +evennumbers[50:-48] + +sum(evennumbers) + +"""### Nested clauses""" + +v = 0 +for e in evennumbers: + #print("This will happen every time") + if e > 50: + #print("This will happen fifty times") + if e%7 == 0: + #print("This will happen only for even multiples of 7 bigger than 50") + v += 1 +v + +"""## About numpy""" + +import numpy as np + +"""A `numpy` `array` is just a dressed-up list:""" + +evennumbers = np.arange(0,200,2) +evennumbers + +"""With some extra functionality:""" + +evennumbers.mean() + +evennumbers[evennumbers < 10] + +"""### and pandas""" + +import pandas as pd + +"""`pandas` just dresses `numpy` in some more functionality:""" + +a = np.array([1,2,3]) +b = np.array(["bob","gene","tina"]) +df = pd.DataFrame(data = { + 'number': a, + 'name': b +}) +df + +"""## Back to numpy + +Let's generate a normally distributed population: +""" + +population = np.random.normal(loc=50, scale=3, size=10000) + +"""`loc` is $\mu$, `scale` is $\sigma$, `size` is $N$ + +Take a sample from this population: +""" + +sample = np.random.choice(population, size=100, replace=False) + +print("Population mean = {}".format(population.mean().round(2))) +print("Sample mean = {}".format(sample.mean().round(2))) + +import seaborn as sns +import matplotlib.pyplot as plt +import warnings +warnings.simplefilter(action='ignore', category=FutureWarning) + +fig, axs = plt.subplots() +sns.distplot(population, ax = axs) +axs.axvline(population.mean()) +sns.distplot(sample, ax = axs) +axs.axvline(sample.mean()) + +"""### The interact widget""" + +from ipywidgets import interact + +def update(n): + print(n) + +interact(update,n=(1,100)) + +"""The continuous updating is annoying so...""" + +from ipywidgets import IntSlider + +def update(n): + print(n) + +interact(update,n=IntSlider(min=1, max=100, step=1, continuous_update=False)) + +def update(n): + fig, axs = plt.subplots() + sns.distplot(population, ax = axs) + axs.set_ylim(0,0.2) + axs.set_xlim(30,70) + axs.axvline(population.mean()) + sample = np.random.choice(population, size=n, replace=False) + sns.distplot(sample, ax = axs) + axs.axvline(sample.mean()) + +interact(update,n=IntSlider(value=10, min=2, max=1000, step=1, continuous_update=False)) + +"""So the sample mean $\bar{x}$ is a good (unbiased) estimator for the population mean $\mu$. + +The same is **not** true of the standard deviation. +""" + +def update(n): + # make a set of axes + fig, axs = plt.subplots() + # set the axes + axs.set_ylim(0,0.2) + axs.set_xlim(30,70) + # plot the population + sns.distplot(population, ax = axs) + # find the mean and sd for the population + mu = population.mean() + sigma = population.std() + # color one sd from the mean + axs.axvspan(mu-sigma, mu+sigma, facecolor="lightsteelblue", alpha=0.4) + # draw a sample + sample = np.random.choice(population, size=n, replace=False) + # plot the sample + sns.distplot(sample, ax = axs) + # find the sample mean and the 'wrong' sd + xbar = sample.mean() + s = sample.std() + axs.axvspan(xbar-s,xbar+s, facecolor="wheat", alpha=0.2) + print("Standard deviation of sample with /n = {}".format(s)) + print("Population standard deviation = {}".format(sigma)) + + +interact(update,n=IntSlider(value=10, min=2, max=1000, step=1, continuous_update=False)) + +"""**This is not quite as compelling a visualisation as we wanted.** + +We want to show that using $\dfrac{\sum(x-\bar{x})^2}{n}$ tends to *underestimate* the standard deviation of the population, which is why we use $n-1$ instead. + +But of course, at these sample sizes, the difference between dividing by $n$ and dividing by $n-1$ is not going to be visible on a graph. + +Instead, we could draw a lot of samples and plot the distribution of the uncorrected and corrected standard deviations compared to the true value. +""" + +# Our population +population = np.random.normal(loc=50, scale=5, size=10000) +# Make 1000 samples, each of size 5 +samples = [np.random.choice(population, size=5, replace=False) for _ in range(1000)] +# Make an array of the means of each sample +sample_means = np.array([sample.mean() for sample in samples]) +# And their (uncorrected) standard deviations +sample_stds =np.array([sample.std() for sample in samples]) +# Amd their (corrected) standar deviations +sample_stds_corrected = np.array([sample.std(ddof=1) for sample in samples]) +# We want three sets of axes +fig, axs = plt.subplots(1,3) +fig.set_figwidth(18) +# plot the distribution of means around the true mean +axs[0].set_title("Mean") +axs[0].get_yaxis().set_visible(False) +sns.distplot(sample_means, ax=axs[0], color="goldenrod") +axs[0].axvline(population.mean(), color="steelblue") +axs[0].axvline(sample_means.mean(), color="orange") +# and the distribution of uncorrected standard deviations around the true standard deviation +axs[1].set_title("Uncorrected standard deviation") +axs[1].get_yaxis().set_visible(False) +sns.distplot(sample_stds, ax=axs[1], color="goldenrod") +axs[1].axvline(population.std(), color="steelblue") +axs[1].axvline(sample_stds.mean(), color="orange") +# and the distribution of corrected standard deviations around the true standard deviation +axs[2].set_title("Corrected standard deviation") +axs[2].get_yaxis().set_visible(False) +sns.distplot(sample_stds_corrected, ax=axs[2], color="goldenrod") +axs[2].axvline(population.std(), color="steelblue") +axs[2].axvline(sample_stds_corrected.mean(), color="orange") \ No newline at end of file diff --git a/06 Chartify Tutorial.ipynb b/06 Chartify Tutorial.ipynb deleted file mode 100644 index 0b9d13d..0000000 --- a/06 Chartify Tutorial.ipynb +++ /dev/null @@ -1,1292 +0,0 @@ -{ - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - }, - "toc": { - "nav_menu": {}, - "number_sections": true, - "sideBar": true, - "skip_h1_title": false, - "toc_cell": true, - "toc_position": { - "height": "608px", - "left": "0px", - "right": "1176px", - "top": "111px", - "width": "212px" - }, - "toc_section_display": "block", - "toc_window_display": true - }, - "colab": { - "name": "06 Chartify Tutorial.ipynb", - "provenance": [] - } - }, - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "AuErQX3mZ0ai", - "colab_type": "text" - }, - "source": [ - "Very slightly adapted from the developers' original tutorial [here](https://github.com/spotify/chartify/blob/master/examples/Chartify%20Tutorial.ipynb)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "toc": true, - "id": "iEyaxCVBZ0aj", - "colab_type": "text" - }, - "source": [ - "

Table of Contents

\n", - "" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "8eoe2rMcaAdr", - "colab_type": "code", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 996 - }, - "outputId": "5eb75d4f-6f05-4fd8-f30d-5004af57b8a6" - }, - "source": [ - "!pip install chartify" - ], - "execution_count": 3, - "outputs": [ - { - "output_type": "stream", - "text": [ - "Collecting chartify\n", - "\u001b[?25l Downloading https://files.pythonhosted.org/packages/de/e7/2b4ffc35795210241f669633433b2075b118a53dbffce731a37fd9ebdb13/chartify-2.6.1-py2.py3-none-any.whl (48kB)\n", - "\u001b[K |████████████████████████████████| 51kB 3.6MB/s \n", - "\u001b[?25hRequirement already satisfied: bokeh<2.0.0,>=0.12.16 in /usr/local/lib/python3.6/dist-packages (from chartify) (1.0.4)\n", - "Requirement already satisfied: pandas<1.0.0,>=0.21.0 in /usr/local/lib/python3.6/dist-packages (from chartify) (0.24.2)\n", - "Requirement already satisfied: scipy<2.0.0,>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from chartify) (1.3.1)\n", - "Collecting colour<1.0.0,>=0.1.5 (from chartify)\n", - " Downloading https://files.pythonhosted.org/packages/74/46/e81907704ab203206769dee1385dc77e1407576ff8f50a0681d0a6b541be/colour-0.1.5-py2.py3-none-any.whl\n", - "Requirement already satisfied: Pillow>=4.3.0 in /usr/local/lib/python3.6/dist-packages (from chartify) (4.3.0)\n", - "Collecting selenium<=3.8.0,>=3.7.0 (from chartify)\n", - "\u001b[?25l Downloading https://files.pythonhosted.org/packages/2c/10/5ed4ece1869781c4420de7983fcb2f1bf6522a5d6f6bd0b634ce057f4984/selenium-3.8.0-py2.py3-none-any.whl (941kB)\n", - "\u001b[K |████████████████████████████████| 942kB 14.1MB/s \n", - "\u001b[?25hRequirement already satisfied: jupyter<2.0.0,>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from chartify) (1.0.0)\n", - "Requirement already satisfied: six>=1.5.2 in /usr/local/lib/python3.6/dist-packages (from bokeh<2.0.0,>=0.12.16->chartify) (1.12.0)\n", - "Requirement already satisfied: packaging>=16.8 in /usr/local/lib/python3.6/dist-packages (from bokeh<2.0.0,>=0.12.16->chartify) (19.1)\n", - "Requirement already satisfied: numpy>=1.7.1 in /usr/local/lib/python3.6/dist-packages (from bokeh<2.0.0,>=0.12.16->chartify) (1.16.5)\n", - "Requirement already satisfied: Jinja2>=2.7 in /usr/local/lib/python3.6/dist-packages (from bokeh<2.0.0,>=0.12.16->chartify) (2.10.1)\n", - "Requirement already satisfied: tornado>=4.3 in /usr/local/lib/python3.6/dist-packages (from bokeh<2.0.0,>=0.12.16->chartify) (4.5.3)\n", - "Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from bokeh<2.0.0,>=0.12.16->chartify) (2.5.3)\n", - "Requirement already satisfied: PyYAML>=3.10 in /usr/local/lib/python3.6/dist-packages (from bokeh<2.0.0,>=0.12.16->chartify) (3.13)\n", - "Requirement already satisfied: pytz>=2011k in /usr/local/lib/python3.6/dist-packages (from pandas<1.0.0,>=0.21.0->chartify) (2018.9)\n", - "Requirement already satisfied: olefile in /usr/local/lib/python3.6/dist-packages (from Pillow>=4.3.0->chartify) (0.46)\n", - "Requirement already satisfied: ipykernel in /usr/local/lib/python3.6/dist-packages (from jupyter<2.0.0,>=1.0.0->chartify) (4.6.1)\n", - "Requirement already satisfied: qtconsole in /usr/local/lib/python3.6/dist-packages (from jupyter<2.0.0,>=1.0.0->chartify) (4.5.5)\n", - "Requirement already satisfied: ipywidgets in /usr/local/lib/python3.6/dist-packages (from jupyter<2.0.0,>=1.0.0->chartify) (7.5.1)\n", - "Requirement already satisfied: notebook in /usr/local/lib/python3.6/dist-packages (from jupyter<2.0.0,>=1.0.0->chartify) (5.2.2)\n", - "Requirement already satisfied: nbconvert in /usr/local/lib/python3.6/dist-packages (from jupyter<2.0.0,>=1.0.0->chartify) (5.6.0)\n", - "Requirement already satisfied: jupyter-console in /usr/local/lib/python3.6/dist-packages (from jupyter<2.0.0,>=1.0.0->chartify) (5.2.0)\n", - "Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging>=16.8->bokeh<2.0.0,>=0.12.16->chartify) (2.4.2)\n", - "Requirement already satisfied: attrs in /usr/local/lib/python3.6/dist-packages (from packaging>=16.8->bokeh<2.0.0,>=0.12.16->chartify) (19.1.0)\n", - "Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from Jinja2>=2.7->bokeh<2.0.0,>=0.12.16->chartify) (1.1.1)\n", - "Requirement already satisfied: traitlets>=4.1.0 in /usr/local/lib/python3.6/dist-packages (from ipykernel->jupyter<2.0.0,>=1.0.0->chartify) (4.3.2)\n", - "Requirement already satisfied: ipython>=4.0.0 in /usr/local/lib/python3.6/dist-packages (from ipykernel->jupyter<2.0.0,>=1.0.0->chartify) (5.5.0)\n", - "Requirement already satisfied: jupyter-client in /usr/local/lib/python3.6/dist-packages (from ipykernel->jupyter<2.0.0,>=1.0.0->chartify) (5.3.1)\n", - "Requirement already satisfied: jupyter-core in /usr/local/lib/python3.6/dist-packages (from qtconsole->jupyter<2.0.0,>=1.0.0->chartify) (4.5.0)\n", - "Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from qtconsole->jupyter<2.0.0,>=1.0.0->chartify) (2.1.3)\n", - "Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from qtconsole->jupyter<2.0.0,>=1.0.0->chartify) (0.2.0)\n", - "Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets->jupyter<2.0.0,>=1.0.0->chartify) (3.5.1)\n", - "Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets->jupyter<2.0.0,>=1.0.0->chartify) (4.4.0)\n", - "Requirement already satisfied: terminado>=0.3.3; sys_platform != \"win32\" in /usr/local/lib/python3.6/dist-packages (from notebook->jupyter<2.0.0,>=1.0.0->chartify) (0.8.2)\n", - "Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter<2.0.0,>=1.0.0->chartify) (0.8.4)\n", - "Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter<2.0.0,>=1.0.0->chartify) (0.3)\n", - "Requirement already satisfied: bleach in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter<2.0.0,>=1.0.0->chartify) (3.1.0)\n", - "Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter<2.0.0,>=1.0.0->chartify) (1.4.2)\n", - "Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter<2.0.0,>=1.0.0->chartify) (0.6.0)\n", - "Requirement already satisfied: testpath in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter<2.0.0,>=1.0.0->chartify) (0.4.2)\n", - "Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from jupyter-console->jupyter<2.0.0,>=1.0.0->chartify) (1.0.16)\n", - "Requirement already satisfied: decorator in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.1.0->ipykernel->jupyter<2.0.0,>=1.0.0->chartify) (4.4.0)\n", - "Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0->ipykernel->jupyter<2.0.0,>=1.0.0->chartify) (41.2.0)\n", - "Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0->ipykernel->jupyter<2.0.0,>=1.0.0->chartify) (0.7.5)\n", - "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0->ipykernel->jupyter<2.0.0,>=1.0.0->chartify) (4.7.0)\n", - "Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0->ipykernel->jupyter<2.0.0,>=1.0.0->chartify) (0.8.1)\n", - "Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.6/dist-packages (from jupyter-client->ipykernel->jupyter<2.0.0,>=1.0.0->chartify) (17.0.0)\n", - "Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.2.0->ipywidgets->jupyter<2.0.0,>=1.0.0->chartify) (2.6.0)\n", - "Requirement already satisfied: ptyprocess; os_name != \"nt\" in /usr/local/lib/python3.6/dist-packages (from terminado>=0.3.3; sys_platform != \"win32\"->notebook->jupyter<2.0.0,>=1.0.0->chartify) (0.6.0)\n", - "Requirement already satisfied: webencodings in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->jupyter<2.0.0,>=1.0.0->chartify) (0.5.1)\n", - "Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.0->jupyter-console->jupyter<2.0.0,>=1.0.0->chartify) (0.1.7)\n", - "Installing collected packages: colour, selenium, chartify\n", - "Successfully installed chartify-2.6.1 colour-0.1.5 selenium-3.8.0\n" - ], - "name": "stdout" - } - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "jtydwkm6Z0al", - "colab_type": "code", - "colab": {} - }, - "source": [ - "# Copyright (c) 2017-2018 Spotify AB\n", - "#\n", - "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# http://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License.\n", - "import chartify\n", - "import pandas as pd\n", - "\n", - "# needed to make the examples work in the notebook\n", - "chartify.examples._OUTPUT_FORMAT = 'html'" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0XdK0ysRZ0ao", - "colab_type": "text" - }, - "source": [ - "# Chart object\n", - "- Run the cell below to instantiate a chart and assign to to a variable" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "zxt3QTs9Z0ap", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch = chartify.Chart()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bFQd_zg0Z0ar", - "colab_type": "text" - }, - "source": [ - "- Use .show() to render the chart." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "J-yGpVRCZ0as", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch.show()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GjJo3-5fZ0aw", - "colab_type": "text" - }, - "source": [ - "- Note that the chart is blank at this point.\n", - "- The default labels provide directions for how to override their values." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4b2_OacrZ0ax", - "colab_type": "text" - }, - "source": [ - "# Adding chart labels\n", - "- __Your turn__: Add labels to the following chart. Look at the default values for instruction.\n", - "- Title\n", - "- Subtitle\n", - "- Source\n", - "- X label\n", - "- Y label" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "YkUPFpsBZ0ay", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch = chartify.Chart()\n", - "# Add code here to overwrite the labels\n", - "\n", - "\n", - "ch.show()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OCy1p7OrZ0a1", - "colab_type": "text" - }, - "source": [ - "# Getting help\n", - "- From within a jupyter notebook you can see the available attributes of the chart object by pressing \"tab\"\n", - "- Select the space just after the \".\" character below and hit tab." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "lx1swBlOZ0a2", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch = chartify.Chart()\n", - "ch." - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FiDd1vPHZ0a5", - "colab_type": "text" - }, - "source": [ - "- You can also use \"?\" to pull up documentation for objects and methods.\n", - "- Run the cell below to pull up the chartify.Chart documentation" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "aeKCw4heZ0a6", - "colab_type": "code", - "colab": {} - }, - "source": [ - "chartify.Chart?" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pB7tvhcIZ0a-", - "colab_type": "text" - }, - "source": [ - "- This can also be accomplished by pressing \"shift + tab\".\n", - "- Press \"shift + tab\" twice to see the expanded documentation.\n", - "- Try it with the next cell." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "Hc_xVdgXZ0bB", - "colab_type": "code", - "colab": {} - }, - "source": [ - "chartify.Chart" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "q9XCzpQRZ0bH", - "colab_type": "text" - }, - "source": [ - "# Callouts\n", - "- The chart object has a callout object (ch.callout) that contains methods for adding callouts to the chart.\n", - "- Callouts can be used to add text, lines, or shaded areas to annotate parts of your chart.\n", - "- __Your Turn:__ Fill in the code below to add a text callout that says \"hi\" at coordinate (10, 10)\n", - "- Look up the documentation for ch.callout.text if you need help" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "_G45PeU8Z0bI", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch = chartify.Chart()\n", - "#ch.callout.text()\n", - " \n", - "ch.show()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1nRH0lcMZ0bL", - "colab_type": "text" - }, - "source": [ - "- Use tab below to see what callouts are available." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "C0gy1LszZ0bM", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch.callout." - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SKrADg1CZ0bO", - "colab_type": "text" - }, - "source": [ - "# Axes\n", - "- The axes object contains methods for setting or getting axis properties." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_Q_9HsyFZ0bP", - "colab_type": "text" - }, - "source": [ - "- __Your turn__: modify the chart below so the xaxis range goes from 0 to 100" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "sxv8owYdZ0bQ", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch = chartify.Chart()\n", - "ch.callout.text('hi', 10, 10)\n", - "# Add code here to modify the xrange to (0, 100)\n", - "ch.show()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KijpsLI_Z0bS", - "colab_type": "text" - }, - "source": [ - "# Method chaining\n", - "- Chart methods can be chained by wrapping the statments in parentheses. See the example below:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "9H9zgtrYZ0bT", - "colab_type": "code", - "colab": {} - }, - "source": [ - "(chartify.Chart(blank_labels=True)\n", - " .callout.text('hi', 10, 10)\n", - " .axes.set_xaxis_range(0, 100)\n", - " .show()\n", - ")" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Ub_B2iuGZ0bX", - "colab_type": "text" - }, - "source": [ - "# Plotting" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "PJr3jo4jZ0bY", - "colab_type": "text" - }, - "source": [ - "## Input data format\n", - "Chartify expects the input data to be:\n", - "- Tidy (Each variable has its own column, each row corresponds to an observation)\n", - "- In the columns of a Pandas DataFrame.\n", - "\n", - "Below we'll explore some examples of valid and invalid input data" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qVHhvARvZ0bY", - "colab_type": "text" - }, - "source": [ - "- Run this cell to generate an example dataset" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "FWj-dorZZ0ba", - "colab_type": "code", - "colab": {} - }, - "source": [ - "data = chartify.examples.example_data()\n", - "data.head()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "IJuO_A6JZ0bd", - "colab_type": "text" - }, - "source": [ - "## Pivoted data: INVALID\n", - "- Pivoted data is not Tidy (note the `country` dimension has an observation in each column)" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "0ZDERY1IZ0be", - "colab_type": "code", - "colab": {} - }, - "source": [ - "pivoted_data = pd.pivot_table(data, columns='country', values='quantity', index='fruit', aggfunc='sum')\n", - "pivoted_data" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "H90ZjlQGZ0bk", - "colab_type": "text" - }, - "source": [ - "### Melting pivoted data: VALID\n", - "- You can use pandas.melt to convert pivoted data into the tidy data format.\n", - "- The output of SQL queries with `groupby` produces output in tidy format." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "KjnJ2nCGZ0bl", - "colab_type": "code", - "colab": {} - }, - "source": [ - "value_columns = pivoted_data.columns\n", - "\n", - "melted_data = pd.melt(pivoted_data.reset_index(), # Need to reset the index to put \"fruit\" into a column.\n", - " id_vars='fruit',\n", - " value_vars=value_columns)\n", - "melted_data.head()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "YrKCx0PvZ0br", - "colab_type": "text" - }, - "source": [ - "## Pandas series: INVALID\n", - "- Data in a pandas Series must be converted to a DataFrame for use with Chartify." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "M-s6mT1nZ0bs", - "colab_type": "code", - "colab": {} - }, - "source": [ - "data.groupby(['country'])['quantity'].sum()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "EdhA0beAZ0b0", - "colab_type": "text" - }, - "source": [ - "## Pandas index: INVALID\n", - "- The output below is a pandas DataFrame, but the country dimension is in the Index." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "hIu2pIRFZ0b3", - "colab_type": "code", - "colab": {} - }, - "source": [ - "data.groupby(['country'])[['quantity']].sum()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Rqf-6PfFZ0b6", - "colab_type": "text" - }, - "source": [ - "## Pandas DataFrame: VALID\n", - "- The code below produces a valid pandas DataFrame for use with Chartify.\n", - "- Notice how the country dimension is now in a column." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "o6zoe2k2Z0b7", - "colab_type": "code", - "colab": {} - }, - "source": [ - "chart_data = data.groupby(['country'])['quantity'].sum().reset_index()\n", - "chart_data" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GIQnB5H0Z0b-", - "colab_type": "text" - }, - "source": [ - "# Axis types\n", - "- Specify the x_axis_type and y_axis_type parameters when instantiating the chart object.\n", - "- Both are set to `linear` by default.\n", - "- Look at the chart object documentation to see the list of available options for x_axis_type and y_axis_type" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "DAGhP9U0Z0b-", - "colab_type": "code", - "colab": {} - }, - "source": [ - "chartify.Chart?" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7Ys9oOiTZ0cA", - "colab_type": "text" - }, - "source": [ - "- __The Chart axis types influence the plots that are available__\n", - "- Look at how the plot methods change based on the axis types:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "PX4KmoSqZ0cA", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch = chartify.Chart(x_axis_type='datetime',\n", - " y_axis_type='linear')" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OOjqYX-gZ0cC", - "colab_type": "text" - }, - "source": [ - "When you've executed the cell above, the tab complete below will change" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "bAmGL2NBZ0cD", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch.plot." - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "prY5XAJyZ0cG", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch = chartify.Chart(x_axis_type='categorical',\n", - " y_axis_type='linear')" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RmbxIAmYZ0cI", - "colab_type": "text" - }, - "source": [ - "And again" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "7HfHeBrwZ0cI", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch.plot." - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Vl5Lrb-sZ0cL", - "colab_type": "text" - }, - "source": [ - "- __Your turn__: Create a chart with 'density' y and 'linear' x axis types. What type of plots are available?" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "dkSZn7ClZ0cM", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch = chartify.Chart(# Your code goes here)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "HGLOAzRtZ0cO", - "colab_type": "text" - }, - "source": [ - "# Vertical Bar plot\n", - "- __Your turn__: Create a bar plot based on the dataframe below." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "E1vGCngyZ0cP", - "colab_type": "code", - "colab": {} - }, - "source": [ - "bar_data = (data.groupby('country')[['quantity']].sum()\n", - " .reset_index()\n", - " )\n", - "bar_data" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "AmDNwpxOZ0cR", - "colab_type": "code", - "colab": {} - }, - "source": [ - "# Implement the bar plot here.\n", - "# Set the appropriate x_axis_type otherwise the bar method won't be available.\n", - "# Look at the bar documentation to figure out how to pass in the parameters.\n", - "# If you get stuck move on to the next section for hints.\n", - "ch = chartify.Chart(# Your code goes here)\n" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tbfcSistZ0cU", - "colab_type": "text" - }, - "source": [ - "# Examples\n", - "- Chartify includes many examples. They're a good starting point if you're trying to create a chart that you're unfamiliar with." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "P5X9vjteZ0cV", - "colab_type": "code", - "colab": {} - }, - "source": [ - "chartify.examples." - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FE44ouJ9Z0cX", - "colab_type": "text" - }, - "source": [ - "- Run the appropriate method to see examples and the corresponding code that generates the example." - ] - }, - { - "cell_type": "code", - "metadata": { - "scrolled": false, - "id": "7wVsbTK6Z0cY", - "colab_type": "code", - "colab": {} - }, - "source": [ - "chartify.examples.plot_bar()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lCt5g-GxZ0cb", - "colab_type": "text" - }, - "source": [ - "# Bar plot - Horizontal vs. Vertical\n", - "- Copy your bar plot here, but make it horizantal instead of vertical. Look to the example above if you get stuck." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "iDbNwP3lZ0cc", - "colab_type": "code", - "colab": {} - }, - "source": [ - "" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NJKLG50xZ0ce", - "colab_type": "text" - }, - "source": [ - "# Grouped bar plot\n", - "- __Your Turn__: Create a grouped bar plot with the data below." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "0pKwSpKoZ0cf", - "colab_type": "code", - "colab": {} - }, - "source": [ - "grouped_bar_data = (data.groupby(['country', 'fruit'])[['quantity']].sum()\n", - " .reset_index()\n", - " )\n", - "grouped_bar_data" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "k_K7bMi1Z0ch", - "colab_type": "code", - "colab": {} - }, - "source": [ - "# Implement the grouped bar plot here.\n", - "# Look at the example for help if you get stuck.\n" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "CuVwfV7sZ0cj", - "colab_type": "text" - }, - "source": [ - "# Color palette types\n", - "- Chartify includes 4 different color palette types: `categorical`, `accent`, `sequential`, `diverging`.\n", - "- Note the differences in the examples below" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "slOfGhTgZ0ck", - "colab_type": "code", - "colab": {} - }, - "source": [ - "\n", - "chartify.examples.style_color_palette_categorical()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "lpkqhtWHZ0cp", - "colab_type": "code", - "colab": {} - }, - "source": [ - "chartify.examples.style_color_palette_accent()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "YU61XMPyZ0cu", - "colab_type": "code", - "colab": {} - }, - "source": [ - "chartify.examples.style_color_palette_diverging()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "EnQHgFh2Z0cx", - "colab_type": "code", - "colab": {} - }, - "source": [ - "chartify.examples.style_color_palette_sequential()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8lBvY5oxZ0cy", - "colab_type": "text" - }, - "source": [ - "# Color palettes\n", - "- Chartify includes a set of pre-defined color palettes:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "LfrgTVJOZ0cz", - "colab_type": "code", - "colab": {} - }, - "source": [ - "chartify.color_palettes" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qHlj7L0MZ0c1", - "colab_type": "text" - }, - "source": [ - "- Use .show() to see the colors associated with each:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "RaULTM8lZ0c1", - "colab_type": "code", - "colab": {} - }, - "source": [ - "chartify.color_palettes.show()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6BXyWMKhZ0c3", - "colab_type": "text" - }, - "source": [ - "- Assign the color palettes with `.set_color_palette`" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "AGM7KG4CZ0c3", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch = chartify.Chart(x_axis_type='categorical',\n", - " blank_labels=True)\n", - "ch.style.set_color_palette('categorical', 'Dark2')\n", - "ch.plot.bar(data_frame=grouped_bar_data,\n", - " categorical_columns=['fruit', 'country'],\n", - " numeric_column='quantity',\n", - " color_column='fruit')\n", - "ch.show()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OhiFUPJzZ0c5", - "colab_type": "text" - }, - "source": [ - "- Color palette objects include methods for manipulation. See the examples below:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "JSX-jSUBZ0c7", - "colab_type": "code", - "colab": {} - }, - "source": [ - "dark2 = chartify.color_palettes['Dark2']\n", - "dark2.show()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fZFlQepdZ0c9", - "colab_type": "text" - }, - "source": [ - "- Sort" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "-Vo7hImIZ0c-", - "colab_type": "code", - "colab": {} - }, - "source": [ - "sorted_dark2 = dark2.sort_by_hue()\n", - "sorted_dark2.show()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "gH4OgUxaZ0dE", - "colab_type": "text" - }, - "source": [ - "- Expand" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "LlO9L0GWZ0dF", - "colab_type": "code", - "colab": {} - }, - "source": [ - "dark2.expand_palette(20).show()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "anRR5bevZ0dI", - "colab_type": "text" - }, - "source": [ - "- Shift" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "8pt7BvrrZ0dJ", - "colab_type": "code", - "colab": {} - }, - "source": [ - "shifted_dark2 = dark2.shift_palette('white', percent=20)\n", - "shifted_dark2.show()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "hsuTErTSZ0dM", - "colab_type": "text" - }, - "source": [ - "- Assign the shifted color palette to a chart:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "RPATijafZ0dN", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch = chartify.Chart(x_axis_type='categorical',\n", - " blank_labels=True)\n", - "ch.style.set_color_palette('categorical', shifted_dark2)\n", - "ch.plot.bar(data_frame=grouped_bar_data,\n", - " categorical_columns=['fruit', 'country'],\n", - " numeric_column='quantity',\n", - " color_column='fruit')\n", - "ch.show()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5zx2uv5CZ0dP", - "colab_type": "text" - }, - "source": [ - "# Layouts\n", - "- Chartify layouts are tailored toward use in slides.\n", - "- Notice how the output changes for each of the slide layout options below:" - ] - }, - { - "cell_type": "code", - "metadata": { - "scrolled": false, - "id": "9-E55feQZ0dR", - "colab_type": "code", - "colab": {} - }, - "source": [ - "layout_options = ['slide_100%', 'slide_75%', 'slide_50%', 'slide_25%']\n", - "for option in layout_options:\n", - " ch = chartify.Chart(layout=option, blank_labels=True, x_axis_type='categorical')\n", - " ch.set_title('Layout: {}'.format(option))\n", - " ch.plot.bar(data_frame=grouped_bar_data,\n", - " categorical_columns=['fruit', 'country'],\n", - " numeric_column='quantity',\n", - " color_column='fruit')\n", - "\n", - " ch.show()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "p7EdK32aZ0dS", - "colab_type": "text" - }, - "source": [ - "# Advanced usage with Bokeh\n", - "- Chartify is built on top of another visualization package called [Bokeh](http://bokeh.pydata.org/en/latest/)\n", - "- The example below shows how you can access the Bokeh [figure](https://bokeh.pydata.org/en/latest/docs/reference/plotting.html#bokeh.plotting.figure.Figure) from a Chartify chart object." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "gcvE5FqZZ0dS", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch = chartify.Chart(blank_labels=True, x_axis_type='categorical')\n", - "ch.plot.bar(data_frame=grouped_bar_data,\n", - " categorical_columns=['fruit', 'country'],\n", - " numeric_column='quantity',\n", - " color_column='fruit')\n", - "ch.figure" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Cz_c8Em0Z0dU", - "colab_type": "text" - }, - "source": [ - "- The following example shows how you can modify attributes not exposed in Chartify by accessing the Bokeh figure. See [Bokeh](http://bokeh.pydata.org/en/latest/) documentation for more details." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "Dvp7SRTZZ0dV", - "colab_type": "code", - "colab": {} - }, - "source": [ - "ch.figure.xaxis.axis_label_text_font_size = '30pt'\n", - "ch.figure.xaxis.axis_label_text_color = 'red'\n", - "ch.figure.height = 400\n", - "ch.axes.set_xaxis_label('A large xaxis label')\n", - "ch.show()" - ], - "execution_count": 0, - "outputs": [] - } - ] -} \ No newline at end of file diff --git a/06_chartify_tutorial.py b/06_chartify_tutorial.py new file mode 100644 index 0000000..f7f1770 --- /dev/null +++ b/06_chartify_tutorial.py @@ -0,0 +1,347 @@ +# -*- coding: utf-8 -*- +"""06 Chartify Tutorial.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/06%20Chartify%20Tutorial.ipynb + +Very slightly adapted from the developers' original tutorial [here](https://github.com/spotify/chartify/blob/master/examples/Chartify%20Tutorial.ipynb) + +

Table of Contents

+ +""" + +!pip install chartify + +# Copyright (c) 2017-2018 Spotify AB +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import chartify +import pandas as pd + +# needed to make the examples work in the notebook +chartify.examples._OUTPUT_FORMAT = 'html' + +"""# Chart object +- Run the cell below to instantiate a chart and assign to to a variable +""" + +ch = chartify.Chart() + +"""- Use .show() to render the chart.""" + +ch.show() + +"""- Note that the chart is blank at this point. +- The default labels provide directions for how to override their values. + +# Adding chart labels +- __Your turn__: Add labels to the following chart. Look at the default values for instruction. +- Title +- Subtitle +- Source +- X label +- Y label +""" + +ch = chartify.Chart() +# Add code here to overwrite the labels + + +ch.show() + +"""# Getting help +- From within a jupyter notebook you can see the available attributes of the chart object by pressing "tab" +- Select the space just after the "." character below and hit tab. +""" + +ch = chartify.Chart() +ch. + +"""- You can also use "?" to pull up documentation for objects and methods. +- Run the cell below to pull up the chartify.Chart documentation +""" + +chartify.Chart? + +"""- This can also be accomplished by pressing "shift + tab". +- Press "shift + tab" twice to see the expanded documentation. +- Try it with the next cell. +""" + +chartify.Chart + +"""# Callouts +- The chart object has a callout object (ch.callout) that contains methods for adding callouts to the chart. +- Callouts can be used to add text, lines, or shaded areas to annotate parts of your chart. +- __Your Turn:__ Fill in the code below to add a text callout that says "hi" at coordinate (10, 10) +- Look up the documentation for ch.callout.text if you need help +""" + +ch = chartify.Chart() +#ch.callout.text() + +ch.show() + +"""- Use tab below to see what callouts are available.""" + +ch.callout. + +"""# Axes +- The axes object contains methods for setting or getting axis properties. + +- __Your turn__: modify the chart below so the xaxis range goes from 0 to 100 +""" + +ch = chartify.Chart() +ch.callout.text('hi', 10, 10) +# Add code here to modify the xrange to (0, 100) +ch.show() + +"""# Method chaining +- Chart methods can be chained by wrapping the statments in parentheses. See the example below: +""" + +(chartify.Chart(blank_labels=True) + .callout.text('hi', 10, 10) + .axes.set_xaxis_range(0, 100) + .show() +) + +"""# Plotting + +## Input data format +Chartify expects the input data to be: +- Tidy (Each variable has its own column, each row corresponds to an observation) +- In the columns of a Pandas DataFrame. + +Below we'll explore some examples of valid and invalid input data + +- Run this cell to generate an example dataset +""" + +data = chartify.examples.example_data() +data.head() + +"""## Pivoted data: INVALID +- Pivoted data is not Tidy (note the `country` dimension has an observation in each column) +""" + +pivoted_data = pd.pivot_table(data, columns='country', values='quantity', index='fruit', aggfunc='sum') +pivoted_data + +"""### Melting pivoted data: VALID +- You can use pandas.melt to convert pivoted data into the tidy data format. +- The output of SQL queries with `groupby` produces output in tidy format. +""" + +value_columns = pivoted_data.columns + +melted_data = pd.melt(pivoted_data.reset_index(), # Need to reset the index to put "fruit" into a column. + id_vars='fruit', + value_vars=value_columns) +melted_data.head() + +"""## Pandas series: INVALID +- Data in a pandas Series must be converted to a DataFrame for use with Chartify. +""" + +data.groupby(['country'])['quantity'].sum() + +"""## Pandas index: INVALID +- The output below is a pandas DataFrame, but the country dimension is in the Index. +""" + +data.groupby(['country'])[['quantity']].sum() + +"""## Pandas DataFrame: VALID +- The code below produces a valid pandas DataFrame for use with Chartify. +- Notice how the country dimension is now in a column. +""" + +chart_data = data.groupby(['country'])['quantity'].sum().reset_index() +chart_data + +"""# Axis types +- Specify the x_axis_type and y_axis_type parameters when instantiating the chart object. +- Both are set to `linear` by default. +- Look at the chart object documentation to see the list of available options for x_axis_type and y_axis_type +""" + +chartify.Chart? + +"""- __The Chart axis types influence the plots that are available__ +- Look at how the plot methods change based on the axis types: +""" + +ch = chartify.Chart(x_axis_type='datetime', + y_axis_type='linear') + +"""When you've executed the cell above, the tab complete below will change""" + +ch.plot. + +ch = chartify.Chart(x_axis_type='categorical', + y_axis_type='linear') + +"""And again""" + +ch.plot. + +"""- __Your turn__: Create a chart with 'density' y and 'linear' x axis types. What type of plots are available?""" + +ch = chartify.Chart(# Your code goes here) + +"""# Vertical Bar plot +- __Your turn__: Create a bar plot based on the dataframe below. +""" + +bar_data = (data.groupby('country')[['quantity']].sum() + .reset_index() + ) +bar_data + +# Implement the bar plot here. +# Set the appropriate x_axis_type otherwise the bar method won't be available. +# Look at the bar documentation to figure out how to pass in the parameters. +# If you get stuck move on to the next section for hints. +ch = chartify.Chart(# Your code goes here) + +"""# Examples +- Chartify includes many examples. They're a good starting point if you're trying to create a chart that you're unfamiliar with. +""" + +chartify.examples. + +"""- Run the appropriate method to see examples and the corresponding code that generates the example.""" + +chartify.examples.plot_bar() + +"""# Bar plot - Horizontal vs. Vertical +- Copy your bar plot here, but make it horizantal instead of vertical. Look to the example above if you get stuck. +""" + + + +"""# Grouped bar plot +- __Your Turn__: Create a grouped bar plot with the data below. +""" + +grouped_bar_data = (data.groupby(['country', 'fruit'])[['quantity']].sum() + .reset_index() + ) +grouped_bar_data + +# Implement the grouped bar plot here. +# Look at the example for help if you get stuck. + +"""# Color palette types +- Chartify includes 4 different color palette types: `categorical`, `accent`, `sequential`, `diverging`. +- Note the differences in the examples below +""" + +chartify.examples.style_color_palette_categorical() + +chartify.examples.style_color_palette_accent() + +chartify.examples.style_color_palette_diverging() + +chartify.examples.style_color_palette_sequential() + +"""# Color palettes +- Chartify includes a set of pre-defined color palettes: +""" + +chartify.color_palettes + +"""- Use .show() to see the colors associated with each:""" + +chartify.color_palettes.show() + +"""- Assign the color palettes with `.set_color_palette`""" + +ch = chartify.Chart(x_axis_type='categorical', + blank_labels=True) +ch.style.set_color_palette('categorical', 'Dark2') +ch.plot.bar(data_frame=grouped_bar_data, + categorical_columns=['fruit', 'country'], + numeric_column='quantity', + color_column='fruit') +ch.show() + +"""- Color palette objects include methods for manipulation. See the examples below:""" + +dark2 = chartify.color_palettes['Dark2'] +dark2.show() + +"""- Sort""" + +sorted_dark2 = dark2.sort_by_hue() +sorted_dark2.show() + +"""- Expand""" + +dark2.expand_palette(20).show() + +"""- Shift""" + +shifted_dark2 = dark2.shift_palette('white', percent=20) +shifted_dark2.show() + +"""- Assign the shifted color palette to a chart:""" + +ch = chartify.Chart(x_axis_type='categorical', + blank_labels=True) +ch.style.set_color_palette('categorical', shifted_dark2) +ch.plot.bar(data_frame=grouped_bar_data, + categorical_columns=['fruit', 'country'], + numeric_column='quantity', + color_column='fruit') +ch.show() + +"""# Layouts +- Chartify layouts are tailored toward use in slides. +- Notice how the output changes for each of the slide layout options below: +""" + +layout_options = ['slide_100%', 'slide_75%', 'slide_50%', 'slide_25%'] +for option in layout_options: + ch = chartify.Chart(layout=option, blank_labels=True, x_axis_type='categorical') + ch.set_title('Layout: {}'.format(option)) + ch.plot.bar(data_frame=grouped_bar_data, + categorical_columns=['fruit', 'country'], + numeric_column='quantity', + color_column='fruit') + + ch.show() + +"""# Advanced usage with Bokeh +- Chartify is built on top of another visualization package called [Bokeh](http://bokeh.pydata.org/en/latest/) +- The example below shows how you can access the Bokeh [figure](https://bokeh.pydata.org/en/latest/docs/reference/plotting.html#bokeh.plotting.figure.Figure) from a Chartify chart object. +""" + +ch = chartify.Chart(blank_labels=True, x_axis_type='categorical') +ch.plot.bar(data_frame=grouped_bar_data, + categorical_columns=['fruit', 'country'], + numeric_column='quantity', + color_column='fruit') +ch.figure + +"""- The following example shows how you can modify attributes not exposed in Chartify by accessing the Bokeh figure. See [Bokeh](http://bokeh.pydata.org/en/latest/) documentation for more details.""" + +ch.figure.xaxis.axis_label_text_font_size = '30pt' +ch.figure.xaxis.axis_label_text_color = 'red' +ch.figure.height = 400 +ch.axes.set_xaxis_label('A large xaxis label') +ch.show() \ No newline at end of file diff --git a/06a Overview of visualisation libraries.ipynb b/06a Overview of visualisation libraries.ipynb deleted file mode 100644 index 44a8026..0000000 --- a/06a Overview of visualisation libraries.ipynb +++ /dev/null @@ -1,158 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "hide_input": false, - "slideshow": { - "slide_type": "slide" - } - }, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "fig, ax = plt.subplots()\n", - "fig.suptitle(\"Family tree of python plotting packages\")\n", - "ax.set_xlim(0,300)\n", - "ax.set_ylim(0,100)\n", - "ax.axison = False\n", - "ax.annotate(\"matplotlib.pyplot\",(0,50))\n", - "ax.annotate(\"pandas.plot\",(80,50),(155,80),arrowprops=dict(arrowstyle='<-'))\n", - "ax.annotate(\"seaborn\",(80,50),(155,50),arrowprops=dict(arrowstyle='<-'))\n", - "ax.annotate(\"bokeh\",(80,50),(155,20),arrowprops=dict(arrowstyle='<-'))\n", - "ax.annotate(\"holoviews\",(185,20),(255,50),arrowprops=dict(arrowstyle='<-'))\n", - "ax.annotate(\"chartify\",(185,20),(255,20),arrowprops=dict(arrowstyle='<-'))\n", - "ax.annotate(\"plotly\",(25,75))\n", - "ax.annotate(\"dash\",(50,75),(95,90),arrowprops=dict(arrowstyle='<-'));" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "- [matplotlib](https://matplotlib.org/)\n", - " - The oldest and most developed library\n", - " - Highly flexible\n", - " - Steep learning curve\n", - " - Two distinct styles of use which can be confusing because different examples might use different approaches\n", - " - Can be used to draw pretty much anything not just graphs" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "- [pandas plotting](https://pandas.pydata.org/pandas-docs/stable/visualization.html)\n", - " - Built on matplotlib.pyplot\n", - " - Plot directly from a dataframe\n", - " - Okay defaults but often need to access underlying pyplot anyway" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "- [seaborn](https://seaborn.pydata.org/)\n", - " - Also built on matplotlib.pyplot but better aesthetics\n", - " - Easy to create [composite plots with multiple factors](https://seaborn.pydata.org/examples/index.html#example-gallery)\n", - " - Automatically builds in things like regression lines with confidence intervals\n", - " - Static, though does work well with jupyter's interactive widgets" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "- [bokeh](https://bokeh.pydata.org/en/latest/)\n", - " - Again, built on matplotlib.pyplot\n", - " - Easy to create very interactive plots\n", - " - Works best as a live web app\n", - " - Doesn't always play well in jupyter notebooks" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "- [holoviews](http://holoviews.org/)\n", - " - Built on bokeh for interaction, but can also work directly with pyplot for static plots\n", - " - Describe the semantics of the data and let holoviews decide what to do with it" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "- [chartify](https://github.com/spotify/chartify)\n", - " - Spotify's take on bokeh\n", - " - Intuitive workflow\n", - " - Good choice of defaults\n", - " - Consistent naming of parts" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "- [plotly](https://github.com/plotly/plotly.py/blob/master/README.md)\n", - " - Not built on pyplot\n", - " - Hands off the plotting to a javascript library\n", - "- [dash](https://plot.ly/)\n", - " - Built on plotly\n", - " - Good for building data dashboards\n", - " - Has to run as a web app" - ] - } - ], - "metadata": { - "celltoolbar": "Slideshow", - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/06a_overview_of_visualisation_libraries.py b/06a_overview_of_visualisation_libraries.py new file mode 100644 index 0000000..7c82ccc --- /dev/null +++ b/06a_overview_of_visualisation_libraries.py @@ -0,0 +1,66 @@ +# -*- coding: utf-8 -*- +"""06a Overview of visualisation libraries.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/06a%20Overview%20of%20visualisation%20libraries.ipynb +""" + +import matplotlib.pyplot as plt +fig, ax = plt.subplots() +fig.suptitle("Family tree of python plotting packages") +ax.set_xlim(0,300) +ax.set_ylim(0,100) +ax.axison = False +ax.annotate("matplotlib.pyplot",(0,50)) +ax.annotate("pandas.plot",(80,50),(155,80),arrowprops=dict(arrowstyle='<-')) +ax.annotate("seaborn",(80,50),(155,50),arrowprops=dict(arrowstyle='<-')) +ax.annotate("bokeh",(80,50),(155,20),arrowprops=dict(arrowstyle='<-')) +ax.annotate("holoviews",(185,20),(255,50),arrowprops=dict(arrowstyle='<-')) +ax.annotate("chartify",(185,20),(255,20),arrowprops=dict(arrowstyle='<-')) +ax.annotate("plotly",(25,75)) +ax.annotate("dash",(50,75),(95,90),arrowprops=dict(arrowstyle='<-')); + +"""- [matplotlib](https://matplotlib.org/) + - The oldest and most developed library + - Highly flexible + - Steep learning curve + - Two distinct styles of use which can be confusing because different examples might use different approaches + - Can be used to draw pretty much anything not just graphs + +- [pandas plotting](https://pandas.pydata.org/pandas-docs/stable/visualization.html) + - Built on matplotlib.pyplot + - Plot directly from a dataframe + - Okay defaults but often need to access underlying pyplot anyway + +- [seaborn](https://seaborn.pydata.org/) + - Also built on matplotlib.pyplot but better aesthetics + - Easy to create [composite plots with multiple factors](https://seaborn.pydata.org/examples/index.html#example-gallery) + - Automatically builds in things like regression lines with confidence intervals + - Static, though does work well with jupyter's interactive widgets + +- [bokeh](https://bokeh.pydata.org/en/latest/) + - Again, built on matplotlib.pyplot + - Easy to create very interactive plots + - Works best as a live web app + - Doesn't always play well in jupyter notebooks + +- [holoviews](http://holoviews.org/) + - Built on bokeh for interaction, but can also work directly with pyplot for static plots + - Describe the semantics of the data and let holoviews decide what to do with it + +- [chartify](https://github.com/spotify/chartify) + - Spotify's take on bokeh + - Intuitive workflow + - Good choice of defaults + - Consistent naming of parts + +- [plotly](https://github.com/plotly/plotly.py/blob/master/README.md) + - Not built on pyplot + - Hands off the plotting to a javascript library +- [dash](https://plot.ly/) + - Built on plotly + - Good for building data dashboards + - Has to run as a web app +""" \ No newline at end of file diff --git a/06b Holoviews.ipynb b/06b Holoviews.ipynb deleted file mode 100644 index 141d0a2..0000000 --- a/06b Holoviews.ipynb +++ /dev/null @@ -1,223 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import holoviews as hv\n", - "hv.extension('bokeh')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "outputs": [], - "source": [ - "dataframe = pd.read_csv(\"Anthropometric data.csv\", thousands=',')\n", - "len(dataframe)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "We're going to take some of this `pandas.DataFrame` and turn it into a `holoviews.Table`.\n", - "\n", - "Holoviews will want to know which columns we are thinking of as *key dimensions* and which are *value dimensions*.\n", - "\n", - "Key and value dimensions are what we might otherwise call:\n", - "* independent and dependent variables\n", - "* explanatory and response variables\n", - "* predictor and response variables\n", - "* $X$ and $Y$" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Suppose we want to investigate `Weight_N` and `Height_mm` as functions of both `Age_months` and `Gender`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "table = hv.Table(dataframe, vdims=[\"Weight_N\",\"Height_mm\"], kdims=[\"Age_months\",\"Gender\"])\n", - "table" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Now we can get an interactive scatter plot in one line:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": false, - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "hv.HoloMap(table.to.scatter(kdims=[\"Height_mm\"],vdims=[\"Weight_N\"]))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "It's important to understand what's going on behind the scenes. The HoloMap function here is actually pregenerating a separate scatter plot for every possible pair of values of `Age_months` and `Gender`, which is why it takes a while to return." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Perhaps comparative boxplots would be more useful:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": false, - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "hv.HoloMap(table.to.box(vdims=[\"Height_mm\"],kdims=[\"Gender\"],groupby=\"Age_months\"))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataframe[\"Age_years\"]=dataframe[\"Age_months\"]//12" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "dataframe[\"Age_years\"]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "table = hv.Table(dataframe, vdims=[\"Weight_N\",\"Height_mm\"], kdims=[\"Age_years\",\"Gender\"])\n", - "table" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": false, - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "hv.HoloMap(table.to.scatter(kdims=[\"Height_mm\"],vdims=[\"Weight_N\"]))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": false, - "slideshow": { - "slide_type": "-" - } - }, - "outputs": [], - "source": [ - "hv.HoloMap(table.to.box(vdims=[\"Height_mm\"],kdims=[\"Gender\"],groupby=\"Age_years\"))" - ] - } - ], - "metadata": { - "celltoolbar": "Slideshow", - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/06b_holoviews.py b/06b_holoviews.py new file mode 100644 index 0000000..6809dde --- /dev/null +++ b/06b_holoviews.py @@ -0,0 +1,53 @@ +# -*- coding: utf-8 -*- +"""06b Holoviews.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/06b%20Holoviews.ipynb +""" + +import pandas as pd +import holoviews as hv +hv.extension('bokeh') + +dataframe = pd.read_csv("Anthropometric data.csv", thousands=',') +len(dataframe) + +"""We're going to take some of this `pandas.DataFrame` and turn it into a `holoviews.Table`. + +Holoviews will want to know which columns we are thinking of as *key dimensions* and which are *value dimensions*. + +Key and value dimensions are what we might otherwise call: +* independent and dependent variables +* explanatory and response variables +* predictor and response variables +* $X$ and $Y$ + +Suppose we want to investigate `Weight_N` and `Height_mm` as functions of both `Age_months` and `Gender`. +""" + +table = hv.Table(dataframe, vdims=["Weight_N","Height_mm"], kdims=["Age_months","Gender"]) +table + +"""Now we can get an interactive scatter plot in one line:""" + +hv.HoloMap(table.to.scatter(kdims=["Height_mm"],vdims=["Weight_N"])) + +"""It's important to understand what's going on behind the scenes. The HoloMap function here is actually pregenerating a separate scatter plot for every possible pair of values of `Age_months` and `Gender`, which is why it takes a while to return. + +Perhaps comparative boxplots would be more useful: +""" + +hv.HoloMap(table.to.box(vdims=["Height_mm"],kdims=["Gender"],groupby="Age_months")) + +dataframe["Age_years"]=dataframe["Age_months"]//12 + +dataframe["Age_years"] + +table = hv.Table(dataframe, vdims=["Weight_N","Height_mm"], kdims=["Age_years","Gender"]) +table + +hv.HoloMap(table.to.scatter(kdims=["Height_mm"],vdims=["Weight_N"])) + +hv.HoloMap(table.to.box(vdims=["Height_mm"],kdims=["Gender"],groupby="Age_years")) \ No newline at end of file diff --git a/07 Groundhog Day.ipynb b/07 Groundhog Day.ipynb deleted file mode 100644 index 1ebf09f..0000000 --- a/07 Groundhog Day.ipynb +++ /dev/null @@ -1,197 +0,0 @@ -{ - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - }, - "colab": { - "name": "07 Groundhog Day.ipynb", - "provenance": [] - } - }, - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "4x_19PZVeEdK", - "colab_type": "text" - }, - "source": [ - "# Can Punxsutawney Phil really predict the weather?\n", - "\n", - "> Punxsutawney Phil is the name of a groundhog in Punxsutawney, Pennsylvania. On February 2 (Groundhog Day) each year, the borough of Punxsutawney celebrates the legendary groundhog with a festive atmosphere of music and food. During the ceremony, which begins well before the winter sunrise, Phil emerges from his temporary home on Gobbler's Knob, located in a rural area about 2 miles (3 km) southeast of town. According to the tradition, if Phil sees his shadow and returns to his hole, he has predicted six more weeks of winter-like weather. If Phil does not see his shadow, he has predicted an \"early spring.\" The date of Phil's prognostication is known as Groundhog Day in the United States and Canada, and has been celebrated since 1887. Punxsutawney Phil became an international celebrity thanks to the 1993 movie Groundhog Day." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xnuk--yceEdL", - "colab_type": "text" - }, - "source": [ - "Historical data regarding Groundhog Day is available in the file `groundhog.csv`\n", - "\n", - "The average temperature is recorded for February and March in\n", - "\n", - "- Pennsylvania (which is in the north-eastern USA)\n", - "- The North-East more widely\n", - "- The Mid-Western US\n", - "- The whole country\n", - "\n", - "To start\n", - "\n", - "- Import the data\n", - "- Check the datatypes\n", - "- Make the year column a datetime object\n", - "- Set the year as the index\n", - "\n", - "Some initial exploration\n", - "\n", - "- How frequently does Phil see his full shadow?\n", - "- How do temperatures in the different regions compare?\n", - "- Has / how has average temperature changed over time?\n", - "\n", - "The main question\n", - "\n", - "- Is there a significant difference between average temperatures when Phil has or has not seen his shadow?\n", - " - In Pennsylvania? More widely?\n", - " \n", - "You might find the following functions useful\n", - "\n", - "- `df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')`" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "psHWd4uieEdN", - "colab_type": "code", - "colab": {} - }, - "source": [ - "import pandas as pd" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "QPr-fJLjeEdR", - "colab_type": "code", - "colab": {} - }, - "source": [ - "from scipy import stats" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "4cHnWh0YeEdU", - "colab_type": "code", - "colab": {} - }, - "source": [ - "phil = pd.read_csv(\"https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/groundhog.csv\")" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "Rw9hGuFbeEdX", - "colab_type": "code", - "colab": {} - }, - "source": [ - "phil = phil.dropna()[:-1]" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "3Z_3s7TYeEdZ", - "colab_type": "code", - "colab": {} - }, - "source": [ - "phil = phil.rename({\"Punxsutawney Phil\":\"Shadow\"}, axis='columns')" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "pM97slEOeEdc", - "colab_type": "code", - "colab": {} - }, - "source": [ - "phil.query('Shadow == \"Full Shadow\"').mean()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "8uZciUvfeEde", - "colab_type": "code", - "colab": {} - }, - "source": [ - "phil.query('Shadow == \"No Shadow\"').mean()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "Uee3CKOOeEdh", - "colab_type": "code", - "colab": {} - }, - "source": [ - "phil.query('Shadow == \"No Shadow\"').mean() - phil.query('Shadow == \"Full Shadow\"').mean()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "iRjFHbCheEdj", - "colab_type": "code", - "colab": {} - }, - "source": [ - "stats.ttest_ind(phil.query('Shadow == \"No Shadow\"')[\"February Average Temperature\"],phil.query('Shadow == \"Full Shadow\"')[\"February Average Temperature\"], equal_var=False)" - ], - "execution_count": 0, - "outputs": [] - } - ] -} \ No newline at end of file diff --git a/07_groundhog_day.py b/07_groundhog_day.py new file mode 100644 index 0000000..72e100a --- /dev/null +++ b/07_groundhog_day.py @@ -0,0 +1,61 @@ +# -*- coding: utf-8 -*- +"""07 Groundhog Day.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/07%20Groundhog%20Day.ipynb + +# Can Punxsutawney Phil really predict the weather? + +> Punxsutawney Phil is the name of a groundhog in Punxsutawney, Pennsylvania. On February 2 (Groundhog Day) each year, the borough of Punxsutawney celebrates the legendary groundhog with a festive atmosphere of music and food. During the ceremony, which begins well before the winter sunrise, Phil emerges from his temporary home on Gobbler's Knob, located in a rural area about 2 miles (3 km) southeast of town. According to the tradition, if Phil sees his shadow and returns to his hole, he has predicted six more weeks of winter-like weather. If Phil does not see his shadow, he has predicted an "early spring." The date of Phil's prognostication is known as Groundhog Day in the United States and Canada, and has been celebrated since 1887. Punxsutawney Phil became an international celebrity thanks to the 1993 movie Groundhog Day. + +Historical data regarding Groundhog Day is available in the file `groundhog.csv` + +The average temperature is recorded for February and March in + +- Pennsylvania (which is in the north-eastern USA) +- The North-East more widely +- The Mid-Western US +- The whole country + +To start + +- Import the data +- Check the datatypes +- Make the year column a datetime object +- Set the year as the index + +Some initial exploration + +- How frequently does Phil see his full shadow? +- How do temperatures in the different regions compare? +- Has / how has average temperature changed over time? + +The main question + +- Is there a significant difference between average temperatures when Phil has or has not seen his shadow? + - In Pennsylvania? More widely? + +You might find the following functions useful + +- `df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')` +""" + +import pandas as pd + +from scipy import stats + +phil = pd.read_csv("https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/groundhog.csv") + +phil = phil.dropna()[:-1] + +phil = phil.rename({"Punxsutawney Phil":"Shadow"}, axis='columns') + +phil.query('Shadow == "Full Shadow"').mean() + +phil.query('Shadow == "No Shadow"').mean() + +phil.query('Shadow == "No Shadow"').mean() - phil.query('Shadow == "Full Shadow"').mean() + +stats.ttest_ind(phil.query('Shadow == "No Shadow"')["February Average Temperature"],phil.query('Shadow == "Full Shadow"')["February Average Temperature"], equal_var=False) \ No newline at end of file diff --git a/09 The World.ipynb b/09 The World.ipynb deleted file mode 100644 index 53d1d6b..0000000 --- a/09 The World.ipynb +++ /dev/null @@ -1,453 +0,0 @@ -{ - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "celltoolbar": "Slideshow", - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - }, - "colab": { - "name": "09 The World.ipynb", - "provenance": [] - } - }, - "cells": [ - { - "cell_type": "code", - "metadata": { - "id": "4vX-4D4ceZqX", - "colab_type": "code", - "colab": {} - }, - "source": [ - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "import seaborn as sns" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "wmLJWFjSeZqd", - "colab_type": "code", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 292 - }, - "outputId": "2eaf65e3-9ba7-469b-928f-efbc44425cfb" - }, - "source": [ - "world = pd.read_csv(\"https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/world.csv\")\n", - "world.head()" - ], - "execution_count": 2, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
nocountrysubregionpopulationbirthratedeathratemedian_agelife_expectancylabor_forceunemploymentgdpphysician_densityhealth_expendituretotal_arealand_areawater_arealand_bordersdependency_status
01AlgeriaAfrica (Saharan)3954216623.674.3127.576.5911780000.012.414500.01.215.223817412381741.00YesNone
12EgyptAfrica (Saharan)8848739622.904.7725.373.7031960000.013.111800.02.835.01001450995450.06000YesNone
23LibyaAfrica (Saharan)641177618.033.5828.076.261153000.030.014600.01.903.917595401759540.00YesNone
34MoroccoAfrica (Saharan)3332269918.204.8128.576.7112230000.09.98200.00.626.4446550446300.0250YesNone
45TunisiaAfrica (Saharan)1103722516.645.9831.975.894038000.015.411400.01.227.0163610155360.08250YesNone
\n", - "
" - ], - "text/plain": [ - " no country subregion ... water_area land_borders dependency_status\n", - "0 1 Algeria Africa (Saharan) ... 0 Yes None\n", - "1 2 Egypt Africa (Saharan) ... 6000 Yes None\n", - "2 3 Libya Africa (Saharan) ... 0 Yes None\n", - "3 4 Morocco Africa (Saharan) ... 250 Yes None\n", - "4 5 Tunisia Africa (Saharan) ... 8250 Yes None\n", - "\n", - "[5 rows x 18 columns]" - ] - }, - "metadata": { - "tags": [] - }, - "execution_count": 2 - } - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "4ymRk7gPeZqg", - "colab_type": "code", - "colab": {} - }, - "source": [ - "world.set_index(\"country\").drop(columns=\"no\")" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "S0Yi9J3KeZqi", - "colab_type": "code", - "colab": {} - }, - "source": [ - "world = world.set_index(\"country\").drop(columns=\"no\")" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "D9IBRxMUeZql", - "colab_type": "text" - }, - "source": [ - "## Seaborn relplot\n", - "\n", - "This is a *facet* plot. Here we tell it to group the data by subregions in three columns." - ] - }, - { - "cell_type": "code", - "metadata": { - "scrolled": false, - "id": "B8edna4HeZqm", - "colab_type": "code", - "colab": {} - }, - "source": [ - "sns.relplot(data=world, x=\"physician_density\", y=\"birthrate\", col=\"subregion\", col_wrap=3)" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ji7irK4veZqp", - "colab_type": "text" - }, - "source": [ - "## Selecting numerical columns\n", - "\n", - "We can use `select_dtypes` to get only columns containing certain kinds of data, in this case floats and integers - ie numeric." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "qEuoDAeMeZqq", - "colab_type": "code", - "colab": {} - }, - "source": [ - "world.select_dtypes(include=['float64','int64'])" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Jvla-ECreZqs", - "colab_type": "text" - }, - "source": [ - "## Seaborn pairplot\n", - "\n", - "We can give that numeric data to seaborn's `pairplot` function to look for relationships between any pair of variables, and as a bonus see the distribution of each variable in a histogram." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "l5qu1uUyeZqt", - "colab_type": "code", - "colab": {} - }, - "source": [ - "sns.pairplot(world.select_dtypes(include=['float64','int64']).dropna())" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "PZ-kjN4MeZqw", - "colab_type": "text" - }, - "source": [ - "## Introduction to geopandas" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "QjgWqgN3eZqx", - "colab_type": "code", - "colab": {} - }, - "source": [ - "import geopandas" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DshmZeyMeZq1", - "colab_type": "text" - }, - "source": [ - "`geopandas` has some built-in maps. They're really just dataframes with some drawing information attached." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "EtIAe20HeZq1", - "colab_type": "code", - "colab": {} - }, - "source": [ - "gp_world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))\n", - "gp_world" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "3XqBwcgdeZq5", - "colab_type": "text" - }, - "source": [ - "### How to get your index back into a column" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "JQZ5KU5-eZq6", - "colab_type": "code", - "colab": {} - }, - "source": [ - "world = world.reset_index()" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nx-Xa42OeZq9", - "colab_type": "text" - }, - "source": [ - "## Merging on a common column with different names" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "xxdIIDZ0eZq9", - "colab_type": "code", - "colab": {} - }, - "source": [ - "gp_merged = gp_world.merge(world,right_on=\"country\", left_on=\"name\")\n", - "gp_merged" - ], - "execution_count": 0, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "eUBvb2Z1eZrA", - "colab_type": "text" - }, - "source": [ - "## Drawing a map\n", - "\n", - "(or choropleth diagram)" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "MN2QBFokeZrC", - "colab_type": "code", - "colab": {} - }, - "source": [ - "gp_merged.dropna().plot(column=\"gdp\")\n" - ], - "execution_count": 0, - "outputs": [] - } - ] -} \ No newline at end of file diff --git a/09_the_world.py b/09_the_world.py new file mode 100644 index 0000000..4b46673 --- /dev/null +++ b/09_the_world.py @@ -0,0 +1,65 @@ +# -*- coding: utf-8 -*- +"""09 The World.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/09%20The%20World.ipynb +""" + +import pandas as pd +import matplotlib.pyplot as plt +import seaborn as sns + +world = pd.read_csv("https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/world.csv") +world.head() + +world.set_index("country").drop(columns="no") + +world = world.set_index("country").drop(columns="no") + +"""## Seaborn relplot + +This is a *facet* plot. Here we tell it to group the data by subregions in three columns. +""" + +sns.relplot(data=world, x="physician_density", y="birthrate", col="subregion", col_wrap=3) + +"""## Selecting numerical columns + +We can use `select_dtypes` to get only columns containing certain kinds of data, in this case floats and integers - ie numeric. +""" + +world.select_dtypes(include=['float64','int64']) + +"""## Seaborn pairplot + +We can give that numeric data to seaborn's `pairplot` function to look for relationships between any pair of variables, and as a bonus see the distribution of each variable in a histogram. +""" + +sns.pairplot(world.select_dtypes(include=['float64','int64']).dropna()) + +"""## Introduction to geopandas""" + +import geopandas + +"""`geopandas` has some built-in maps. They're really just dataframes with some drawing information attached.""" + +gp_world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres')) +gp_world + +"""### How to get your index back into a column""" + +world = world.reset_index() + +"""## Merging on a common column with different names""" + +gp_merged = gp_world.merge(world,right_on="country", left_on="name") +gp_merged + +"""## Drawing a map + +(or choropleth diagram) +""" + +gp_merged.dropna().plot(column="gdp") \ No newline at end of file diff --git a/10 Contents.ipynb b/10 Contents.ipynb deleted file mode 100644 index 1a846b9..0000000 --- a/10 Contents.ipynb +++ /dev/null @@ -1,108 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Data Analysis with Python I\n", - "\n", - "## Import, clean\n", - "\n", - "- [Read csv](./01%20Tutorial.ipynb#Getting-the-data-into-Python)\n", - "- [Datetime format](./02%20Blackbirds.ipynb#Setting-a-column-to-datetime-format)\n", - "- [Ordering ordinal data](./02%20Blackbirds.ipynb#Ordinal-data)\n", - "- [Ordinal categories](./03%20Titanic.ipynb#Setting-ordinal-categories)\n", - "- [Read tab separated values](./03%20Titanic.ipynb#Tab-separated-values)\n", - "- [Deal with the thousands comma](./04%20Baby%20names.ipynb#Deal-with-the-thousands-comma)\n", - "- [Convert a column to integer](./04%20Baby%20names.ipynb#Converting-a-column-to-integer)\n", - "- [Drop a single row of data](./04%20Baby%20names.ipynb#Drop-a-single-row-of-)\n", - "\n", - "\n", - "## Index, sort, search, filter and pivot\n", - "\n", - "- [Setting an index](./01%20Tutorial.ipynb#Q2)\n", - "- [Sorting by the index](./04%20Baby%20names.ipynb#Setting-an-index-and-sorting-on-it)\n", - "- [Accessing columns](./01%20Tutorial.ipynb#Accessing-the-columns)\n", - "- [Basic sorting and filtering](./01%20Tutorial.ipynb#Sorting-and-filtering)\n", - "- [Groupby](./02%20Blackbirds.ipynb#Introducing-groupby)\n", - "- [Filter with 'like'](./04%20Baby%20names.ipynb#The-filter-function-with-like)\n", - "- [loc and iloc](./04a%20Exam%20results.ipynb#Indexing-with-loc,-iloc)\n", - "- [Filter with 'regex'](./04a%20Exam%20results.ipynb#The-filter-function-with-regex)\n", - "- [The query function](./04a%20Exam%20results.ipynb#The-query-function)\n", - "- [The (numpy) where function](./04a%20Exam%20results.ipynb#The-numpy-where-function)\n", - "- [Categorising data with cut](./04a%20Exam%20results.ipynb#Categorising-data-with-cut)\n", - "- [Crosstabulation](./05%20Hypothesis%20Tests.ipynb#Chi-squared)\n", - "- [Select by data type](./09%20The%20World.ipynb#Selecting-numerical-columns)\n", - "- [Get the index back as a column](./09%20The%20World.ipynb#How-to-get-your-index-back-into-a-column)\n", - "\n", - "## Concatenate and join\n", - "\n", - "- [Make new columns from existing columns](./01%20Tutorial.ipynb#Time-Series)\n", - "- [Make a new, constant, column](./04%20Baby%20names.ipynb#Making-a-new,-constant,-column)\n", - "- [Combining by concatenation](./04%20Baby%20names.ipynb#Combining-data-frames---concatenation)\n", - "- [Combining by joining on a common column](./04%20Baby%20names.ipynb#Join-two-dataframes-on-a-common-column)\n", - "- [When the shared column has a different name](./09%20The%20World.ipynb#Merging-on-a-common-column-with-different-names)\n", - "\n", - "## Measure and summarise\n", - "\n", - "- [Summary statistics](./01%20Tutorial.ipynb#Summary-statistics)\n", - "- [Correlation](./01%20Tutorial.ipynb#Investigating-relationships)\n", - "- [Describe subsets](./02%20Blackbirds.ipynb#Q10)\n", - "\n", - "## Visualise\n", - "\n", - "- [Overview of visualisation libraries](./06a%20Overview%20of%20visualisation%20libraries.ipynb)\n", - "- [Scatter plots with pandas and seaborn](./01%20Tutorial.ipynb#Investigating-relationships)\n", - "- [Time series](./01%20Tutorial.ipynb#Time-Series)\n", - "- [Also time series](./02%20Blackbirds.ipynb#Time-series)\n", - "- [Scatter plot with seaborn with markers and colours](./02%20Blackbirds.ipynb#Q9)\n", - "- [Seaborn distribution plot](./02%20Blackbirds.ipynb#Distribution-plots)\n", - "- [Grouped box plots](./02%20Blackbirds.ipynb#Boxplots)\n", - "- [Seaborn catplot for grouped count plots](./03%20Titanic.ipynb#Seaborn's-catplot)\n", - "- [Adding data labels in pyplot/seaborn](./04a%20Exam%20results.ipynb#Adding-labels-in-pyplot/seaborn)\n", - "- [Interactivity in notebooks](./05a%20numpy%20and%20interact.ipynb#The-interact-widget)\n", - "- [Chartify from Spotify](./06%20Chartify%20Tutorial.ipynb)\n", - "- [Holoviews](./06b%20Holoviews.ipynb)\n", - "- [Grouped plots in seaborn - relplot](./09%20The%20World.ipynb#Seaborn-relplot)\n", - "- [Multiple scatter plots - pairplot](./09%20The%20World.ipynb#Seaborn-pairplot)\n", - "- [Maps with geopandas](./09%20The%20World.ipynb#Introduction-to-geopandas)\n", - "\n", - "\n", - "## Test\n", - "\n", - "- [t-tests](./02%20Blackbirds.ipynb#Hypothesis-testing)\n", - "- [An example of a non-parametric test on a skewed distribution](./03%20Titanic.ipynb#A-non-parametric-test)\n", - "- [Chi squared test](./05%20Hypothesis%20Tests.ipynb#Chi-squared)\n", - "- [An interactive t-test](./05%20Hypothesis%20Tests.ipynb#t-test)\n", - "- [Binomial hypothesis testing](./05%20Hypothesis%20Tests.ipynb#Binomial)\n", - "- [Groundhog Day case study](./07%20Groundhog%20Day.ipynb)\n", - "\n", - "## Simulate\n", - "\n", - "- [Generating some simple data to understand a problem better](./03b%20Toytanic.ipynb#03b-Toytanic)\n", - "- [Simulate normally distributed data](./04a%20Exam%20results.ipynb#Fake-exam-results)\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/10_contents.py b/10_contents.py new file mode 100644 index 0000000..16aa810 --- /dev/null +++ b/10_contents.py @@ -0,0 +1,86 @@ +# -*- coding: utf-8 -*- +"""10 Contents.ipynb + +Automatically generated by Colaboratory. + +Original file is located at + https://colab.research.google.com/github/adaapp/dav-introductionToPandas/blob/master/10%20Contents.ipynb + +# Data Analysis with Python I + +## Import, clean + +- [Read csv](./01%20Tutorial.ipynb#Getting-the-data-into-Python) +- [Datetime format](./02%20Blackbirds.ipynb#Setting-a-column-to-datetime-format) +- [Ordering ordinal data](./02%20Blackbirds.ipynb#Ordinal-data) +- [Ordinal categories](./03%20Titanic.ipynb#Setting-ordinal-categories) +- [Read tab separated values](./03%20Titanic.ipynb#Tab-separated-values) +- [Deal with the thousands comma](./04%20Baby%20names.ipynb#Deal-with-the-thousands-comma) +- [Convert a column to integer](./04%20Baby%20names.ipynb#Converting-a-column-to-integer) +- [Drop a single row of data](./04%20Baby%20names.ipynb#Drop-a-single-row-of-) + + +## Index, sort, search, filter and pivot + +- [Setting an index](./01%20Tutorial.ipynb#Q2) +- [Sorting by the index](./04%20Baby%20names.ipynb#Setting-an-index-and-sorting-on-it) +- [Accessing columns](./01%20Tutorial.ipynb#Accessing-the-columns) +- [Basic sorting and filtering](./01%20Tutorial.ipynb#Sorting-and-filtering) +- [Groupby](./02%20Blackbirds.ipynb#Introducing-groupby) +- [Filter with 'like'](./04%20Baby%20names.ipynb#The-filter-function-with-like) +- [loc and iloc](./04a%20Exam%20results.ipynb#Indexing-with-loc,-iloc) +- [Filter with 'regex'](./04a%20Exam%20results.ipynb#The-filter-function-with-regex) +- [The query function](./04a%20Exam%20results.ipynb#The-query-function) +- [The (numpy) where function](./04a%20Exam%20results.ipynb#The-numpy-where-function) +- [Categorising data with cut](./04a%20Exam%20results.ipynb#Categorising-data-with-cut) +- [Crosstabulation](./05%20Hypothesis%20Tests.ipynb#Chi-squared) +- [Select by data type](./09%20The%20World.ipynb#Selecting-numerical-columns) +- [Get the index back as a column](./09%20The%20World.ipynb#How-to-get-your-index-back-into-a-column) + +## Concatenate and join + +- [Make new columns from existing columns](./01%20Tutorial.ipynb#Time-Series) +- [Make a new, constant, column](./04%20Baby%20names.ipynb#Making-a-new,-constant,-column) +- [Combining by concatenation](./04%20Baby%20names.ipynb#Combining-data-frames---concatenation) +- [Combining by joining on a common column](./04%20Baby%20names.ipynb#Join-two-dataframes-on-a-common-column) +- [When the shared column has a different name](./09%20The%20World.ipynb#Merging-on-a-common-column-with-different-names) + +## Measure and summarise + +- [Summary statistics](./01%20Tutorial.ipynb#Summary-statistics) +- [Correlation](./01%20Tutorial.ipynb#Investigating-relationships) +- [Describe subsets](./02%20Blackbirds.ipynb#Q10) + +## Visualise + +- [Overview of visualisation libraries](./06a%20Overview%20of%20visualisation%20libraries.ipynb) +- [Scatter plots with pandas and seaborn](./01%20Tutorial.ipynb#Investigating-relationships) +- [Time series](./01%20Tutorial.ipynb#Time-Series) +- [Also time series](./02%20Blackbirds.ipynb#Time-series) +- [Scatter plot with seaborn with markers and colours](./02%20Blackbirds.ipynb#Q9) +- [Seaborn distribution plot](./02%20Blackbirds.ipynb#Distribution-plots) +- [Grouped box plots](./02%20Blackbirds.ipynb#Boxplots) +- [Seaborn catplot for grouped count plots](./03%20Titanic.ipynb#Seaborn's-catplot) +- [Adding data labels in pyplot/seaborn](./04a%20Exam%20results.ipynb#Adding-labels-in-pyplot/seaborn) +- [Interactivity in notebooks](./05a%20numpy%20and%20interact.ipynb#The-interact-widget) +- [Chartify from Spotify](./06%20Chartify%20Tutorial.ipynb) +- [Holoviews](./06b%20Holoviews.ipynb) +- [Grouped plots in seaborn - relplot](./09%20The%20World.ipynb#Seaborn-relplot) +- [Multiple scatter plots - pairplot](./09%20The%20World.ipynb#Seaborn-pairplot) +- [Maps with geopandas](./09%20The%20World.ipynb#Introduction-to-geopandas) + + +## Test + +- [t-tests](./02%20Blackbirds.ipynb#Hypothesis-testing) +- [An example of a non-parametric test on a skewed distribution](./03%20Titanic.ipynb#A-non-parametric-test) +- [Chi squared test](./05%20Hypothesis%20Tests.ipynb#Chi-squared) +- [An interactive t-test](./05%20Hypothesis%20Tests.ipynb#t-test) +- [Binomial hypothesis testing](./05%20Hypothesis%20Tests.ipynb#Binomial) +- [Groundhog Day case study](./07%20Groundhog%20Day.ipynb) + +## Simulate + +- [Generating some simple data to understand a problem better](./03b%20Toytanic.ipynb#03b-Toytanic) +- [Simulate normally distributed data](./04a%20Exam%20results.ipynb#Fake-exam-results) +""" \ No newline at end of file