Pandas data wrangling — Python Module 9 | LeadAfrik Public Economics Hub

Pandas data wrangling — filtering, grouping, merging, pivoting, and reshaping — is what you'll do for 80% of any real analysis. The vocabulary is small: five verbs cover almost everything.

Filter — select rows

python

# Boolean mask
tier1 = df[df['tier'] == 1]
big = df[df['assets_bn'] > 1000]
both = df[(df['tier'] == 1) & (df['assets_bn'] > 1000)]

# query() — readable for complex filters
df.query('tier == 1 and assets_bn > 1000')

Group-by aggregate — split, apply, combine

groupby is the most powerful single operation in pandas. It splits the DataFrame by a key, applies an aggregation, and combines the results.

python

df.groupby('tier')['assets_bn'].sum()
df.groupby('tier').agg({'assets_bn': 'sum', 'bank': 'count'})

# Multiple keys
df.groupby(['tier', 'region'])['assets_bn'].mean()

Merge — join two DataFrames

python

merged = pd.merge(df_left, df_right, on='bank', how='inner')
# how can be: 'inner', 'left', 'right', 'outer'

Pivot and melt — reshape wide↔long

python

# Long → wide
wide = df.pivot(index='month', columns='bank', values='rate')

# Wide → long
long = wide.melt(ignore_index=False, var_name='bank', value_name='rate').reset_index()

Method chaining

Pandas methods return DataFrames, which lets you chain operations into a pipeline. This is the single most readable way to express a multi-step transformation.

python

result = (
    df
    .query('tier == 1')
    .groupby('region')['assets_bn']
    .sum()
    .sort_values(ascending=False)
    .head(5)
)

Method chaining is the Pythonic data style

Each step does one thing. The pipeline reads top-to-bottom. Compare to nesting: head(5)(sort_values(False)(...)) is unreadable. Chain it.

Exercise

Using bankrates, group by month and compute the mean lending_rate.