Skip to content
Module 09 of 1265 min readBeginner

Pandas data wrangling

Filter, group-by, merge, pivot, melt — the five verbs that cover 90% of analysis work.

75%

Listen along

Read “Pandas data wrangling” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Filter rows with boolean masks and the .query() method
  • 02Apply groupby split-apply-combine for aggregations and transformations
  • 03Merge two DataFrames using inner, left, right, and outer joins
  • 04Reshape data between wide and long formats with pivot, melt, and stack/unstack

Pandas data wrangling — filtering, grouping, merging, pivoting, and reshaping — is what you'll do for 80% of any real analysis. The vocabulary is small: five verbs cover almost everything.

Filter — select rows

python
# Boolean mask
tier1 = df[df['tier'] == 1]
big = df[df['assets_bn'] > 1000]
both = df[(df['tier'] == 1) & (df['assets_bn'] > 1000)]
# query() — readable for complex filters
df.query('tier == 1 and assets_bn > 1000')

Group-by aggregate — split, apply, combine

groupby is the most powerful single operation in pandas. It splits the DataFrame by a key, applies an aggregation, and combines the results.

python
df.groupby('tier')['assets_bn'].sum()
df.groupby('tier').agg({'assets_bn': 'sum', 'bank': 'count'})
# Multiple keys
df.groupby(['tier', 'region'])['assets_bn'].mean()

Merge — join two DataFrames

python
merged = pd.merge(df_left, df_right, on='bank', how='inner')
# how can be: 'inner', 'left', 'right', 'outer'

Pivot and melt — reshape wide↔long

python
# Long → wide
wide = df.pivot(index='month', columns='bank', values='rate')
# Wide → long
long = wide.melt(ignore_index=False, var_name='bank', value_name='rate').reset_index()

Method chaining

Pandas methods return DataFrames, which lets you chain operations into a pipeline. This is the single most readable way to express a multi-step transformation.

python
result = (
df
.query('tier == 1')
.groupby('region')['assets_bn']
.sum()
.sort_values(ascending=False)
.head(5)
)

Method chaining is the Pythonic data style

Each step does one thing. The pipeline reads top-to-bottom. Compare to nesting: head(5)(sort_values(False)(...)) is unreadable. Chain it.

Exercise

Using bankrates, group by month and compute the mean lending_rate.

Key takeaways

  • groupby is the most powerful single operation in pandas — internalise the split-apply-combine pattern
  • Method chaining (df.query().groupby().agg().sort_values()) is the idiomatic pandas style
  • Long format is for analysis; wide format is for reporting. Convert at the boundaries
  • Five verbs cover 90% of analysis: filter, select, mutate, group_by+summarise, sort

Further reading

  1. 01

    Python for Data Analysis, Chapter 10: Data Aggregation and Group Operations

    Wes McKinney · O'Reilly · 2022

  2. 02

    Tidy Data

    Hadley Wickham · Journal of Statistical Software · 2014

  3. 03
Loading progress…
LeadAfrikPublic Economics Hub