Pandas data wranglingΒΆ

Pandas used in Jupyter notebook is my favorable way these days to inspect and wrangle with data. Here are some common usage:

  1. Import Pandas:

    import pandas as pd
    import numpy as np
  2. Retrieve, or Read data

    df = pd.read_csv()
  3. data size:

  4. show a couple rows:

    df.head(), df.tail(), df[n1:n2], df.ix[index_as_label]
  5. statistics of a data column

  6. Any missing values?

  7. Fill missing value:

  8. Two dataframes have same index?:

  9. Set two dataframes to same index:

    filled_df2 = df2.reindex(df1.index, method='bfill')
  10. merge two dataframes:

    df_new = pd.concat([df_a, df_b], axis=1)
    pd.merge(df_new, df_n, left_on='subject_id', right_on='subject_id')
    pd.merge(df_a, df_b, on='subject_id', how='inner')
  11. split dataframe into groups:

    grouped = df.groupby([key1, key2, ...])
  12. reshape dataframe to pivot table with aggregation:

    table = pivot_table(df, values='D', index=['A', 'B'],
                  columns=['C'], aggfunc=np.sum)
  13. compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors:

    pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1) #with percentage