Factors are R's way of representing categorical data. Model formulas (the y ~ x DSL) are how regressions are specified. Knowing how factors interact with formulas is the difference between competent and confused regression in R.
Factors — categorical with levels
x <- c("low", "high", "low", "medium", "high")f <- factor(x, levels = c("low", "medium", "high"), ordered = TRUE)levels(f)table(f)
The order of levels matters: the first level is the reference category in regression. lm() with a factor predictor estimates one coefficient per non-reference level, capturing the difference from the reference.
Formula DSL
- y ~ x — simple regression
- y ~ x1 + x2 — multiple regression
- y ~ x1 + x2 + x1:x2 — with interaction
- y ~ x1 * x2 — main effects + interaction (shorthand for above)
- y ~ . — all other variables in the data frame
- y ~ x1 - 1 — no intercept
- y ~ I(x^2) — use the literal expression (I() escapes formula special meaning)
Interactions in detail
# Interact a continuous and a categoricallm(y ~ x * group, data = df)# Equivalent to: y ~ x + group + x:group# Reads as: 'effect of x is allowed to differ by group'
Contrasts — what the coefficients mean
By default, R uses 'treatment contrasts': the first factor level is the reference, and other coefficients give the difference from reference. You can change to 'sum-to-zero' contrasts, polynomial contrasts (for ordered factors), etc., via the contrasts() function.
Margins — making coefficients human-readable
library(margins)model <- lm(y ~ x * group, data = df)margins(model) # average marginal effectsmargins(model, at = list(x = c(0, 1, 2)))
Factor levels in unexpected order
If you read 'low', 'medium', 'high' as character data, factor() defaults to alphabetical levels: high, low, medium. Always specify levels = ... explicitly when order matters.
Exercise
Convert c('low', 'high', 'medium') to a factor with levels ordered low, medium, high.