Skip to content
Module 11 of 1260 min readBeginner

Regression with statsmodels

OLS via statsmodels, robust SEs, formula-style, predict, and reading the summary table.

92%

Listen along

Read “Regression with statsmodels” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Fit OLS with statsmodels using both the formula API and the matrix API
  • 02Read every column of the statsmodels summary table (coef, std err, t, p, CI, R-squared)
  • 03Apply HC-robust and cluster-robust standard errors with cov_type
  • 04Recognise why scikit-learn's LinearRegression is not the right tool for econometric inference

statsmodels is the workhorse for econometric-style regression in Python. Unlike scikit-learn (which is built for prediction and machine learning), statsmodels gives you the t-statistics, standard errors, R-squared, and diagnostic tests that a regression analysis actually requires.

OLS — the formula API

python
import statsmodels.formula.api as smf
model = smf.ols('lending_rate ~ deposit_rate + month', data=bankrates).fit()
print(model.summary())

The R-style formula syntax (y ~ x1 + x2) is concise and readable. statsmodels parses the formula, builds the design matrix, fits OLS, and returns a results object with everything you'll need for inference.

Reading the summary table

  • coef — point estimate of the slope coefficient
  • std err — standard error
  • t — t-statistic = coef / std err
  • P>|t| — two-sided p-value (probability of |t| this large under H0: coef = 0)
  • [0.025 0.975] — 95% confidence interval for the coefficient
  • R-squared — fraction of variance in y explained by the model
  • F-statistic — joint test of all coefficients = 0

Robust standard errors

python
# HC3 robust to heteroskedasticity
robust = smf.ols('y ~ x', data=df).fit(cov_type='HC3')
# Cluster-robust
clustered = smf.ols('y ~ x', data=df).fit(
cov_type='cluster',
cov_kwds={'groups': df['firm_id']}
)

Predictions

python
model.predict(new_data)
model.get_prediction(new_data).summary_frame() # with confidence intervals

The matrix API for full control

If you need full control (custom design matrix, weights, etc.), use the matrix API instead of the formula API.

python
import statsmodels.api as sm
X = sm.add_constant(df[['x1', 'x2']])
y = df['y']
model = sm.OLS(y, X).fit()

scikit-learn LinearRegression is not for econometrics

scikit-learn's LinearRegression has no standard errors, no p-values, no R-squared (well, it has score(), but...). It's built for prediction, not inference. For econometrics, always reach for statsmodels.

Exercise

Run an OLS regression of lending_rate on deposit_rate using bankrates and print the summary.

Key takeaways

  • statsmodels.formula.api is the cleanest path: smf.ols('y ~ x + z', data=df).fit()
  • Always inspect coefficients, SEs, t-stats, p-values, R-squared, and the F-statistic
  • Robust SEs: cov_type='HC3' for heteroskedasticity, cov_type='cluster' for clustered groups
  • scikit-learn is for prediction; statsmodels is for inference — pick the right tool

Further reading

  1. 01
  2. 02

    An Introduction to Statistical Learning with Applications in Python

    Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani & Jonathan Taylor · Springer · 2023

  3. 03

    Mostly Harmless Econometrics: An Empiricist's Companion

    Joshua Angrist & Jörn-Steffen Pischke · Princeton University Press · 2009Language-agnostic but essential intuition for any regression analyst.

Loading progress…
LeadAfrikPublic Economics Hub