Pandas data wrangling¶

Pandas used in Jupyter notebook is my favorable way these days to inspect and wrangle with data. Here are some common usage:

Import Pandas:
import pandas as pd
import numpy as np
Retrieve, or Read data
df = pd.read_csv()
data size:
df.shape
show a couple rows:
df.head(), df.tail(), df[n1:n2], df.ix[index_as_label]
statistics of a data column
df[col].describe()
df[col].unique()
df[col].value_counts()
Any missing values?
sum(df[col].isnull())
Fill missing value:
df.fillna()
Two dataframes have same index?:
df1[~df1.index.isin(df2.index)]
Set two dataframes to same index:
filled_df2 = df2.reindex(df1.index, method='bfill')
merge two dataframes:
df_new = pd.concat([df_a, df_b], axis=1)
pd.merge(df_new, df_n, left_on='subject_id', right_on='subject_id')
pd.merge(df_a, df_b, on='subject_id', how='inner')
split dataframe into groups:
grouped = df.groupby([key1, key2, ...])
reshape dataframe to pivot table with aggregation:
table = pivot_table(df, values='D', index=['A', 'B'],
              columns=['C'], aggfunc=np.sum)
compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors:
pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1) #with percentage