dplyr: filter, select, mutate, summarise — R Module 6

dplyr is the data-manipulation grammar of the tidyverse. Five verbs — filter, select, mutate, summarise, arrange — plus group_by cover most analysis work. The pipe operator chains them into readable pipelines.

The pipe |> (or %>%)

The pipe takes the result of one expression and passes it as the first argument of the next. R 4.1+ has a built-in |>; the tidyverse %>% predates it and is functionally equivalent. Either is fine; use whichever your team uses.

library(dplyr)

# Without pipe — nested, hard to read
arrange(filter(bankrates, lending_rate > 0.13), desc(month))

# With pipe — top to bottom
bankrates |>
    filter(lending_rate > 0.13) |>
    arrange(desc(month))

filter — select rows

bankrates |> filter(lending_rate > 0.13)
bankrates |> filter(lending_rate > 0.13, month >= "2024-01")  # AND

select — pick columns

bankrates |> select(month, lending_rate)
bankrates |> select(-deposit_rate)              # exclude
bankrates |> select(starts_with("l"))           # helpers

mutate — create new columns

bankrates |>
    mutate(spread = lending_rate - deposit_rate,
           spread_pct = spread * 100)

summarise + group_by — split-apply-combine

# Single summary
bankrates |> summarise(mean_lending = mean(lending_rate))

# Grouped
bankrates |>
    group_by(year) |>
    summarise(
        mean_lending = mean(lending_rate),
        mean_deposit = mean(deposit_rate),
        n = n()
    )

arrange — sort

bankrates |> arrange(lending_rate)               # ascending
bankrates |> arrange(desc(lending_rate))         # descending
bankrates |> arrange(year, desc(lending_rate))   # multi-key

The five-verb workflow

filter rows → select columns → mutate to create new columns → group_by + summarise to aggregate → arrange to sort. 90% of analysis pipelines fit this template.

Exercise

Using bankrates, compute the spread (lending_rate − deposit_rate) and arrange in descending order.