Skip to content
Module 06 of 1260 min readBeginner

dplyr: filter, select, mutate, summarise

The five verbs of data manipulation, pipe %>% (and |>), and group_by.

50%

Listen along

Read “dplyr: filter, select, mutate, summarise” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Apply the five dplyr verbs: filter, select, mutate, summarise, arrange
  • 02Use the pipe (|> or %>%) to chain operations into a top-to-bottom pipeline
  • 03Combine group_by with summarise for split-apply-combine analyses
  • 04Choose between |> (R 4.1+ native) and %>% (magrittr/tidyverse) based on context

dplyr is the data-manipulation grammar of the tidyverse. Five verbs — filter, select, mutate, summarise, arrange — plus group_by cover most analysis work. The pipe operator chains them into readable pipelines.

The pipe |> (or %>%)

The pipe takes the result of one expression and passes it as the first argument of the next. R 4.1+ has a built-in |>; the tidyverse %>% predates it and is functionally equivalent. Either is fine; use whichever your team uses.

r
library(dplyr)
# Without pipe — nested, hard to read
arrange(filter(bankrates, lending_rate > 0.13), desc(month))
# With pipe — top to bottom
bankrates |>
filter(lending_rate > 0.13) |>
arrange(desc(month))

filter — select rows

r
bankrates |> filter(lending_rate > 0.13)
bankrates |> filter(lending_rate > 0.13, month >= "2024-01") # AND

select — pick columns

r
bankrates |> select(month, lending_rate)
bankrates |> select(-deposit_rate) # exclude
bankrates |> select(starts_with("l")) # helpers

mutate — create new columns

r
bankrates |>
mutate(spread = lending_rate - deposit_rate,
spread_pct = spread * 100)

summarise + group_by — split-apply-combine

r
# Single summary
bankrates |> summarise(mean_lending = mean(lending_rate))
# Grouped
bankrates |>
group_by(year) |>
summarise(
mean_lending = mean(lending_rate),
mean_deposit = mean(deposit_rate),
n = n()
)

arrange — sort

r
bankrates |> arrange(lending_rate) # ascending
bankrates |> arrange(desc(lending_rate)) # descending
bankrates |> arrange(year, desc(lending_rate)) # multi-key

The five-verb workflow

filter rows → select columns → mutate to create new columns → group_by + summarise to aggregate → arrange to sort. 90% of analysis pipelines fit this template.

Exercise

Using bankrates, compute the spread (lending_rate − deposit_rate) and arrange in descending order.

Key takeaways

  • The five-verb workflow: filter rows → select columns → mutate to create → group_by + summarise → arrange
  • The pipe operator reads top-to-bottom, beating deeply nested function calls every time
  • group_by() is just a partition — it does nothing until summarise() or mutate() acts on each group
  • |> and %>% are functionally interchangeable; use whichever your team uses
Loading progress…
LeadAfrikPublic Economics Hub