| jupyter |
|
|---|
Python has a scientific computing ecosystem that has existed since the mid-90s
- Numpy (as Numeric since 1995, current incarnation since 2006)
- Scipy (2001)
- Matplotlib (early 2000s)
- Sympy (2007)
This ecosystem is meant to emulate Matlab, and is geared to numerical data
Python has a mature natural language processing library, NLTK (Natural Language ToolKit), for symbolic and statistical natural language processing.
This ecosystem provided tools for data munging and analysis, but didn't necessarily make it easy.
-
No container for heterogeneous data types (a la
data.framein R) that can be easily manipulated- Lists and dicts are around, but needed to be simpler
- Metadata (labeling rows and columns, referencing by labels)
- Manipulation and extraction using either array or label or component syntax
-
Easy handling of missing data
- Masked arrays in numpy are available
- Simple imputation
- Easy way to get complete or partially complete cases
-
Easy data munging capabilities
- reshaping data from wide to long and v.v
- subsetting
- split-apply-combine
- aggregation
-
Exploratory data analysis, summaries
-
Statistical modeling and machine learning
- Was rudimentary c. 2009
pandas (Python data analysis toolbox) was first released in 2008. The current version is 0.12, released in July.
- Puts R in the bullseye
- Wants to emulate R's capabilities in a more efficient computing environment
- Provide a rich data analysis environment that can be easily integrated into production and web infrastructures
R makes users forget how fast a computer really is
John Myles White, SPDC, October 2013
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)import numpy as np
import pandas as pd
pd.__version__values = [5,3,4,8,2,9]
vals = pd.Series(values)
valsEach value is now associated with an index. The index itself is an object of class Index and can be manipulated directly.
vals.indexvals.valuesvals2 = pd.Series(values, index=['a','b','c','d','e','f'])
vals2vals2[['b','d']]vals2[['e','f','g']]vals3 = vals2[['a','b','c','f','g','h']]
vals3vals3.isnull()vals3.dropna()vals3.fillna(0)vals3.fillna(vals3.mean())vals3.fillna(method='ffill')vals3.describe()vals.index=pd.Index(['a','b','c','d','e','f'])
vals3=vals3[['a','c','d','e','z']]dat = pd.DataFrame({'orig':vals,'new':vals3})
datdat.fillna(1)cars = pd.read_csv('mtcars.csv')
cars[:10]cars.cylcars.ix[[5,7]]cars['kmpg']=cars['mpg']*1.6
cars[:4]del cars['kmpg']
cars[:4]cars.mpg[:10]cars3=cars.stack()
cars3[:20]cars3.unstack()[:5]grouped=cars.groupby(['cyl','gear','carb'])
grouped['mpg'].mean()stats = ['count','mean','median']
grouped.agg(stats)[['mpg','disp']]grouped.first()cars[cars.cyl==4]tips = pd.read_csv('tips.csv')
tips[:5]tips['tip_pct'] = tips['tip']/tips['total_bill']*100groupedtips = tips.groupby(['sex','smoker'])
groupedtips['tip_pct'].agg('mean')result=tips.groupby('smoker')['tip_pct'].describe()result.unstack('smoker')states = ['Ohio','New York','Vermont','Oregon','Washington','Nevada']
group_key=['East']*3 + ['West']*3
data = pd.Series(np.random.randn(6), index=states)
data[['Vermont','Washington']]=np.nan
datadata.groupby(group_key).mean()fill_mean=lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)tips.pivot_table(rows=['sex','smoker'])tips.pivot_table(['tip_pct','size'],rows=['sex','day'], cols='smoker')Methods provided include
- Linear regression
- Generalized linear models
- ANOVA
- Nonparametric methods
- Few others
import statsmodels as sm
import statsmodels.formula.api as smf
cars[:5]mod1 = smf.ols('mpg~disp+hp+C(cyl)-1', data=cars) # change to category
mod1.fit().summary()Methods include:
- Cluster analysis
- Dimension reduction
- Generalized linear models
- Support Vector Machines
- Nearest neighbors
- Decision Trees
- Ensemble methods
- Discriminant analysis
- Cross-validation
- Transformations
import sklearn as learn
from sklearn.ensemble import RandomForestRegressor
X = cars.values[:,1:]
y = cars.values[:,0]
rf = RandomForestRegressor(n_estimators=100)
rf = rf.fit(X, y)
ypred=rf.predict(X)import matplotlib as mpl
import matplotlib.pyplot as plt
plt.plot(y,ypred,'.')import seaborn as sns
sns.set(palette="Purples_r")
mpl.rc("figure", figsize=(5, 5))
d = pd.DataFrame({'y':y,'ypred':ypred})sns.lmplot('y','ypred',d)sns.lmplot("total_bill","tip",tips, col="sex",color="time")from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()