Skip to content
Module 10 of 1250 min readBeginner

Factors and model formulas

Factors and levels, the model formula DSL (y ~ x + z), interactions with *, and contrasts.

83%

Listen along

Read “Factors and model formulas” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Create factors with explicit level ordering and use them in regression
  • 02Read the formula DSL: + for main effects, * for main + interaction, : for interaction only, - for exclusion
  • 03Recognise R's default treatment contrasts and how to change them
  • 04Use the margins package to translate factor-variable coefficients into marginal effects

Factors are R's way of representing categorical data. Model formulas (the y ~ x DSL) are how regressions are specified. Knowing how factors interact with formulas is the difference between competent and confused regression in R.

Factors — categorical with levels

r
x <- c("low", "high", "low", "medium", "high")
f <- factor(x, levels = c("low", "medium", "high"), ordered = TRUE)
levels(f)
table(f)

The order of levels matters: the first level is the reference category in regression. lm() with a factor predictor estimates one coefficient per non-reference level, capturing the difference from the reference.

Formula DSL

  • y ~ x — simple regression
  • y ~ x1 + x2 — multiple regression
  • y ~ x1 + x2 + x1:x2 — with interaction
  • y ~ x1 * x2 — main effects + interaction (shorthand for above)
  • y ~ . — all other variables in the data frame
  • y ~ x1 - 1 — no intercept
  • y ~ I(x^2) — use the literal expression (I() escapes formula special meaning)

Interactions in detail

r
# Interact a continuous and a categorical
lm(y ~ x * group, data = df)
# Equivalent to: y ~ x + group + x:group
# Reads as: 'effect of x is allowed to differ by group'

Contrasts — what the coefficients mean

By default, R uses 'treatment contrasts': the first factor level is the reference, and other coefficients give the difference from reference. You can change to 'sum-to-zero' contrasts, polynomial contrasts (for ordered factors), etc., via the contrasts() function.

Margins — making coefficients human-readable

r
library(margins)
model <- lm(y ~ x * group, data = df)
margins(model) # average marginal effects
margins(model, at = list(x = c(0, 1, 2)))

Factor levels in unexpected order

If you read 'low', 'medium', 'high' as character data, factor() defaults to alphabetical levels: high, low, medium. Always specify levels = ... explicitly when order matters.

Exercise

Convert c('low', 'high', 'medium') to a factor with levels ordered low, medium, high.

Key takeaways

  • Factor levels matter: the first level is the regression reference category by default
  • y ~ x * z = y ~ x + z + x:z (main effects + interaction). Use * to interact, : for interaction only
  • Treatment contrasts (the default) give coefficients as differences from the reference category
  • Use I() to protect literal expressions: lm(y ~ I(x^2)) keeps ^ as exponentiation instead of formula meaning

Further reading

  1. 01
  2. 02

    Applied Regression Analysis and Generalized Linear Models (3rd Edition)

    John Fox · Sage · 2016

  3. 03
Loading progress…
LeadAfrikPublic Economics Hub