Pandas data wranglingΒΆ

Pandas used in Jupyter notebook is my favorable way these days to inspect and wrangle with data. Here are some common usage:

  1. Import Pandas:

    import pandas as pd
    import numpy as np
    
  2. Retrieve, or Read data

    df = pd.read_csv()
    
  3. data size:

    df.shape
    
  4. show a couple rows:

    df.head(), df.tail(), df[n1:n2], df.ix[index_as_label]
    
  5. statistics of a data column

    df[col].describe()
    df[col].unique()
    df[col].value_counts()
    
  6. Any missing values?

    sum(df[col].isnull())
    
  7. Fill missing value:

    df.fillna()
    
  8. Two dataframes have same index?:

    df1[~df1.index.isin(df2.index)]
    
  9. Set two dataframes to same index:

    filled_df2 = df2.reindex(df1.index, method='bfill')
    
  10. merge two dataframes:

    df_new = pd.concat([df_a, df_b], axis=1)
    pd.merge(df_new, df_n, left_on='subject_id', right_on='subject_id')
    pd.merge(df_a, df_b, on='subject_id', how='inner')
    
  11. split dataframe into groups:

    grouped = df.groupby([key1, key2, ...])
    
  12. reshape dataframe to pivot table with aggregation:

    table = pivot_table(df, values='D', index=['A', 'B'],
                  columns=['C'], aggfunc=np.sum)
    
  13. compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors:

    pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1) #with percentage