Skip to content

Latest commit

 

History

History
363 lines (272 loc) · 6.64 KB

File metadata and controls

363 lines (272 loc) · 6.64 KB
jupyter
jupytext kernelspec
text_representation
extension format_name format_version jupytext_version
.md
markdown
1.2
1.4.2
display_name language name
Python 3
python
python3

Data analysis using the Python ecosystem

Before pandas

Scientific Python

Python has a scientific computing ecosystem that has existed since the mid-90s

  • Numpy (as Numeric since 1995, current incarnation since 2006)
  • Scipy (2001)
  • Matplotlib (early 2000s)
  • Sympy (2007)

This ecosystem is meant to emulate Matlab, and is geared to numerical data

Natural Language Processing

Python has a mature natural language processing library, NLTK (Natural Language ToolKit), for symbolic and statistical natural language processing.

What was lacking

This ecosystem provided tools for data munging and analysis, but didn't necessarily make it easy.

  • No container for heterogeneous data types (a la data.frame in R) that can be easily manipulated

    • Lists and dicts are around, but needed to be simpler
    • Metadata (labeling rows and columns, referencing by labels)
    • Manipulation and extraction using either array or label or component syntax
  • Easy handling of missing data

    • Masked arrays in numpy are available
    • Simple imputation
    • Easy way to get complete or partially complete cases
  • Easy data munging capabilities

    • reshaping data from wide to long and v.v
    • subsetting
    • split-apply-combine
    • aggregation
  • Exploratory data analysis, summaries

  • Statistical modeling and machine learning

    • Was rudimentary c. 2009

pandas

pandas (Python data analysis toolbox) was first released in 2008. The current version is 0.12, released in July.

  • Puts R in the bullseye
  • Wants to emulate R's capabilities in a more efficient computing environment
  • Provide a rich data analysis environment that can be easily integrated into production and web infrastructures

R makes users forget how fast a computer really is

John Myles White, SPDC, October 2013

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
import numpy as np
import pandas as pd
pd.__version__

pandas has two main data structures: Series and DataFrame

Series

values = [5,3,4,8,2,9]
vals = pd.Series(values)
vals

Each value is now associated with an index. The index itself is an object of class Index and can be manipulated directly.

vals.index
vals.values
vals2 = pd.Series(values, index=['a','b','c','d','e','f'])
vals2
vals2[['b','d']]
vals2[['e','f','g']]
vals3 = vals2[['a','b','c','f','g','h']]
vals3
vals3.isnull()
vals3.dropna()
vals3.fillna(0)
vals3.fillna(vals3.mean())
vals3.fillna(method='ffill')
vals3.describe()

DataFrame

vals.index=pd.Index(['a','b','c','d','e','f'])
vals3=vals3[['a','c','d','e','z']]
dat = pd.DataFrame({'orig':vals,'new':vals3})
dat
dat.fillna(1)
cars = pd.read_csv('mtcars.csv')
cars[:10]
cars.cyl
cars.ix[[5,7]]
cars['kmpg']=cars['mpg']*1.6
cars[:4]
del cars['kmpg']
cars[:4]
cars.mpg[:10]
cars3=cars.stack()
cars3[:20]
cars3.unstack()[:5]
grouped=cars.groupby(['cyl','gear','carb'])
grouped['mpg'].mean()
stats = ['count','mean','median']
grouped.agg(stats)[['mpg','disp']]
grouped.first()
cars[cars.cyl==4]
tips = pd.read_csv('tips.csv')
tips[:5]
tips['tip_pct'] = tips['tip']/tips['total_bill']*100
groupedtips = tips.groupby(['sex','smoker'])
groupedtips['tip_pct'].agg('mean')
result=tips.groupby('smoker')['tip_pct'].describe()
result.unstack('smoker')
states = ['Ohio','New York','Vermont','Oregon','Washington','Nevada']
group_key=['East']*3 + ['West']*3
data = pd.Series(np.random.randn(6), index=states)
data[['Vermont','Washington']]=np.nan
data
data.groupby(group_key).mean()
fill_mean=lambda g: g.fillna(g.mean())

data.groupby(group_key).apply(fill_mean)
tips.pivot_table(rows=['sex','smoker'])
tips.pivot_table(['tip_pct','size'],rows=['sex','day'], cols='smoker')

Statistical modeling

statsmodels

Methods provided include

  • Linear regression
  • Generalized linear models
  • ANOVA
  • Nonparametric methods
  • Few others
import statsmodels as sm
import statsmodels.formula.api as smf
cars[:5]
mod1 = smf.ols('mpg~disp+hp+C(cyl)-1', data=cars) # change to category
mod1.fit().summary()

Machine learning

Scikits-learn

Methods include:

  • Cluster analysis
  • Dimension reduction
  • Generalized linear models
  • Support Vector Machines
  • Nearest neighbors
  • Decision Trees
  • Ensemble methods
  • Discriminant analysis
  • Cross-validation
  • Transformations
import sklearn as learn
from sklearn.ensemble import RandomForestRegressor
X = cars.values[:,1:]
y = cars.values[:,0]

rf = RandomForestRegressor(n_estimators=100)
rf = rf.fit(X, y)
ypred=rf.predict(X)
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.plot(y,ypred,'.')
import seaborn as sns
sns.set(palette="Purples_r")
mpl.rc("figure", figsize=(5, 5))

d = pd.DataFrame({'y':y,'ypred':ypred})
sns.lmplot('y','ypred',d)
sns.lmplot("total_bill","tip",tips, col="sex",color="time")
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()