Factors and model formulas — R Module 10 | LeadAfrik Public Economics Hub

Factors are R's way of representing categorical data. Model formulas (the y ~ x DSL) are how regressions are specified. Knowing how factors interact with formulas is the difference between competent and confused regression in R.

Factors — categorical with levels

x <- c("low", "high", "low", "medium", "high")
f <- factor(x, levels = c("low", "medium", "high"), ordered = TRUE)
levels(f)
table(f)

The order of levels matters: the first level is the reference category in regression. lm() with a factor predictor estimates one coefficient per non-reference level, capturing the difference from the reference.

Formula DSL

y ~ x — simple regression
y ~ x1 + x2 — multiple regression
y ~ x1 + x2 + x1:x2 — with interaction
y ~ x1 * x2 — main effects + interaction (shorthand for above)
y ~ . — all other variables in the data frame
y ~ x1 - 1 — no intercept
y ~ I(x^2) — use the literal expression (I() escapes formula special meaning)

Interactions in detail

# Interact a continuous and a categorical
lm(y ~ x * group, data = df)
# Equivalent to: y ~ x + group + x:group
# Reads as: 'effect of x is allowed to differ by group'

Contrasts — what the coefficients mean

By default, R uses 'treatment contrasts': the first factor level is the reference, and other coefficients give the difference from reference. You can change to 'sum-to-zero' contrasts, polynomial contrasts (for ordered factors), etc., via the contrasts() function.

Margins — making coefficients human-readable

library(margins)
model <- lm(y ~ x * group, data = df)
margins(model)            # average marginal effects
margins(model, at = list(x = c(0, 1, 2)))

Factor levels in unexpected order

If you read 'low', 'medium', 'high' as character data, factor() defaults to alphabetical levels: high, low, medium. Always specify levels = ... explicitly when order matters.

Exercise

Convert c('low', 'high', 'medium') to a factor with levels ordered low, medium, high.