jupyter

jupytext

kernelspec

text_representation

extension	format_name	format_version	jupytext_version
.md	markdown	1.2	1.4.2

display_name	language	name
Python 3	python	python3

Data analysis using the Python ecosystem

Before pandas

Scientific Python

Python has a scientific computing ecosystem that has existed since the mid-90s

Numpy (as Numeric since 1995, current incarnation since 2006)
Scipy (2001)
Matplotlib (early 2000s)
Sympy (2007)

This ecosystem is meant to emulate Matlab, and is geared to numerical data

Natural Language Processing

Python has a mature natural language processing library, NLTK (Natural Language ToolKit), for symbolic and statistical natural language processing.

What was lacking

This ecosystem provided tools for data munging and analysis, but didn't necessarily make it easy.

No container for heterogeneous data types (a la data.frame in R) that can be easily manipulated
- Lists and dicts are around, but needed to be simpler
- Metadata (labeling rows and columns, referencing by labels)
- Manipulation and extraction using either array or label or component syntax
Easy handling of missing data
- Masked arrays in numpy are available
- Simple imputation
- Easy way to get complete or partially complete cases
Easy data munging capabilities
- reshaping data from wide to long and v.v
- subsetting
- split-apply-combine
- aggregation
Exploratory data analysis, summaries
Statistical modeling and machine learning
- Was rudimentary c. 2009

pandas

pandas (Python data analysis toolbox) was first released in 2008. The current version is 0.12, released in July.

Puts R in the bullseye
Wants to emulate R's capabilities in a more efficient computing environment
Provide a rich data analysis environment that can be easily integrated into production and web infrastructures

R makes users forget how fast a computer really is

John Myles White, SPDC, October 2013

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)

import numpy as np
import pandas as pd
pd.__version__

`pandas` has two main data structures: `Series` and `DataFrame`

Series

values = [5,3,4,8,2,9]
vals = pd.Series(values)
vals

Each value is now associated with an index. The index itself is an object of class Index and can be manipulated directly.

vals.index

vals.values

vals2 = pd.Series(values, index=['a','b','c','d','e','f'])
vals2

vals2[['b','d']]

vals2[['e','f','g']]

vals3 = vals2[['a','b','c','f','g','h']]
vals3

vals3.isnull()

vals3.dropna()

vals3.fillna(0)

vals3.fillna(vals3.mean())

vals3.fillna(method='ffill')

vals3.describe()

DataFrame

vals.index=pd.Index(['a','b','c','d','e','f'])
vals3=vals3[['a','c','d','e','z']]

dat = pd.DataFrame({'orig':vals,'new':vals3})
dat

dat.fillna(1)

cars = pd.read_csv('mtcars.csv')
cars[:10]

cars.cyl

cars.ix[[5,7]]

cars['kmpg']=cars['mpg']*1.6
cars[:4]

del cars['kmpg']
cars[:4]

cars.mpg[:10]

cars3=cars.stack()
cars3[:20]

cars3.unstack()[:5]

grouped=cars.groupby(['cyl','gear','carb'])
grouped['mpg'].mean()

stats = ['count','mean','median']
grouped.agg(stats)[['mpg','disp']]

grouped.first()

cars[cars.cyl==4]

tips = pd.read_csv('tips.csv')
tips[:5]

tips['tip_pct'] = tips['tip']/tips['total_bill']*100

groupedtips = tips.groupby(['sex','smoker'])
groupedtips['tip_pct'].agg('mean')

result=tips.groupby('smoker')['tip_pct'].describe()

result.unstack('smoker')

states = ['Ohio','New York','Vermont','Oregon','Washington','Nevada']
group_key=['East']*3 + ['West']*3
data = pd.Series(np.random.randn(6), index=states)
data[['Vermont','Washington']]=np.nan
data

data.groupby(group_key).mean()

fill_mean=lambda g: g.fillna(g.mean())

data.groupby(group_key).apply(fill_mean)

tips.pivot_table(rows=['sex','smoker'])

tips.pivot_table(['tip_pct','size'],rows=['sex','day'], cols='smoker')

Statistical modeling

statsmodels

Methods provided include

Linear regression
Generalized linear models
ANOVA
Nonparametric methods
Few others

import statsmodels as sm
import statsmodels.formula.api as smf
cars[:5]

mod1 = smf.ols('mpg~disp+hp+C(cyl)-1', data=cars) # change to category
mod1.fit().summary()

Machine learning

Scikits-learn

Methods include:

Cluster analysis
Dimension reduction
Generalized linear models
Support Vector Machines
Nearest neighbors
Decision Trees
Ensemble methods
Discriminant analysis
Cross-validation
Transformations

import sklearn as learn
from sklearn.ensemble import RandomForestRegressor
X = cars.values[:,1:]
y = cars.values[:,0]

rf = RandomForestRegressor(n_estimators=100)
rf = rf.fit(X, y)
ypred=rf.predict(X)

import matplotlib as mpl
import matplotlib.pyplot as plt
plt.plot(y,ypred,'.')

import seaborn as sns
sns.set(palette="Purples_r")
mpl.rc("figure", figsize=(5, 5))

d = pd.DataFrame({'y':y,'ypred':ypred})

sns.lmplot('y','ypred',d)

sns.lmplot("total_bill","tip",tips, col="sex",color="time")

from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data analysis using the Python ecosystem

Before pandas

Scientific Python

Natural Language Processing

What was lacking

pandas

`pandas` has two main data structures: `Series` and `DataFrame`

Series

DataFrame

Statistical modeling

statsmodels

Machine learning

Scikits-learn

FilesExpand file tree

PyData Talk.md

Latest commit

History

PyData Talk.md

File metadata and controls

Data analysis using the Python ecosystem

Before pandas

Scientific Python

Natural Language Processing

What was lacking

pandas

pandas has two main data structures: Series and DataFrame

Series

DataFrame

Statistical modeling

statsmodels

Machine learning

Scikits-learn

`pandas` has two main data structures: `Series` and `DataFrame`