Pandas data wranglingΒΆ
Pandas used in Jupyter notebook is my favorable way these days to inspect and wrangle with data. Here are some common usage:
Import Pandas:
import pandas as pd import numpy as npRetrieve, or Read data
df = pd.read_csv()data size:
df.shapeshow a couple rows:
df.head(), df.tail(), df[n1:n2], df.ix[index_as_label]statistics of a data column
df[col].describe() df[col].unique() df[col].value_counts()Any missing values?
sum(df[col].isnull())Fill missing value:
df.fillna()Two dataframes have same index?:
df1[~df1.index.isin(df2.index)]Set two dataframes to same index:
filled_df2 = df2.reindex(df1.index, method='bfill')merge two dataframes:
df_new = pd.concat([df_a, df_b], axis=1) pd.merge(df_new, df_n, left_on='subject_id', right_on='subject_id') pd.merge(df_a, df_b, on='subject_id', how='inner')split dataframe into groups:
grouped = df.groupby([key1, key2, ...])reshape dataframe to pivot table with aggregation:
table = pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum)compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors:
pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1) #with percentage