Sometimes the dependent variable doesn't fit the linear-regression mould. It's binary (defaults yes/no), categorical (which insurance plan), censored (wages are observed only for employed people), count (number of doctor visits), or bounded (probability between 0 and 1). Each calls for different machinery.
The linear probability model
When y is binary (0/1), running OLS gives you the linear probability model:
Pr(yᵢ = 1 | xᵢ) = β₀ + β₁ x₁ᵢ + ... + βₖ xₖᵢ
Coefficients are marginal effects on probability. Easy to interpret. But two issues:
- Predicted probabilities can fall outside [0, 1] — economically meaningless
- Errors are heteroskedastic by construction (variance depends on Pr(y=1))
LPM is more popular than it looks
Despite the issues, modern empirical economists often use LPM with robust SEs because: (1) coefficients ARE marginal effects without post-estimation transformations, (2) it plays nicely with fixed effects and IV, (3) substantive results are usually similar to logit/probit. The issues with predicted-probabilities-outside-[0,1] are mostly aesthetic in causal-inference contexts.
Logit
Constrains predicted probabilities to [0, 1] via the logistic function:
Pr(yᵢ = 1 | xᵢ) = exp(xᵢβ) / (1 + exp(xᵢβ))
Estimated by maximum likelihood. The coefficient β is a log-odds ratio — not a marginal effect on probability. exp(β) is the odds ratio. To get marginal effects on probability, evaluate the derivative ∂P/∂x at a specific value of x.
Marginal effects: MEM vs AME
- Marginal effect at the means (MEM): plug in mean values of x, compute ∂P/∂x there
- Average marginal effect (AME): compute ∂P/∂x for each observation, then average
- AME is generally preferred — robust to skewed regressors and binary controls
In Stata: margins after logit. In R: marginaleffects::avg_slopes(model). In Python: results.get_margeff() in statsmodels.
Probit
Same idea as logit but with the standard-normal CDF instead of the logistic:
Pr(yᵢ = 1 | xᵢ) = Φ(xᵢβ)
Logit and probit give nearly identical fits in practice. Choice is largely tradition: probit is common in IO and macro; logit dominates in epidemiology and machine learning. Coefficients are NOT directly comparable — logit β ≈ 1.6 × probit β as a rule of thumb.
Tobit and censored regression
When y is observed only above (or below) a threshold but the latent variable is continuous. Examples:
- Wage offers below reservation wage → person is unemployed, hours = 0
- Tax-deductible expenses below threshold are not reported
- Loan amounts below minimum aren't disbursed
- Test scores ceiling- or floor-censored
Tobit jointly estimates the censoring threshold and the underlying continuous distribution via maximum likelihood. The key assumption — normality of the latent error — is more restrictive than in OLS. The Heckman two-step (selection model) is an alternative when censoring is selection-driven (we observe wages only for those who chose to work, and that choice depends on potential wages).
Multinomial logit
When y is unordered categorical with K > 2 outcomes. Travel mode (car / bus / bike / walk), party voted for, industry of employment. One category is the baseline; K−1 sets of coefficients describe relative log-odds.
IIA: independence of irrelevant alternatives
MNL assumes that adding or removing alternatives doesn't change relative odds among the others. The classic counterexample: red bus / blue bus problem. If commuters are indifferent between equivalent buses, adding a blue bus to a market with car and red bus shifts probability from red bus, not from car. Nested logit and mixed logit (random coefficients) relax IIA.
Ordered logit/probit
When y is ordered categorical (Likert: strongly disagree → strongly agree; bond ratings: AAA → D). Estimates a single β vector plus K−1 cutoffs. Parallel-regressions assumption: the same β applies across all category cutoffs. Test with Brant test (ordered logit) — if it fails, fall back to multinomial logit and accept the loss of efficiency.
Count data: Poisson and negative binomial
When y is a non-negative integer count (doctor visits, patents, accidents). Poisson regression assumes equal mean and variance. Negative binomial relaxes this by allowing overdispersion (var > mean) — almost universally observed in real count data. Zero-inflated variants handle excess zeros (the population that never visits a doctor regardless of x).
What estimator should you use?
- Binary y, causal inference: linear probability model with robust SEs (allows IV, FE, clean interpretation)
- Binary y, prediction: logit or probit (better calibration of predicted probabilities)
- Categorical y, K outcomes unordered: multinomial logit (test IIA)
- Categorical y, ordered: ordered logit/probit (test parallel regressions)
- Count y: negative binomial (default), Poisson if dispersion is well-behaved
- Censored y, latent continuous: Tobit (or Heckman if selection is the censoring mechanism)
Exercise
You have data on 5,000 SACCO members and want to estimate how monthly contribution affects loan-default probability. The dependent variable is a binary default flag. State (a) which estimator you'd use for causal inference, (b) what marginal effect to report, (c) what robustness check you'd add.