Regression with statsmodels — Python Module 11 | LeadAfrik Public Economics Hub

statsmodels is the workhorse for econometric-style regression in Python. Unlike scikit-learn (which is built for prediction and machine learning), statsmodels gives you the t-statistics, standard errors, R-squared, and diagnostic tests that a regression analysis actually requires.

OLS — the formula API

python

import statsmodels.formula.api as smf

model = smf.ols('lending_rate ~ deposit_rate + month', data=bankrates).fit()
print(model.summary())

The R-style formula syntax (y ~ x1 + x2) is concise and readable. statsmodels parses the formula, builds the design matrix, fits OLS, and returns a results object with everything you'll need for inference.

Reading the summary table

coef — point estimate of the slope coefficient
std err — standard error
t — t-statistic = coef / std err
P>|t| — two-sided p-value (probability of |t| this large under H0: coef = 0)
[0.025 0.975] — 95% confidence interval for the coefficient
R-squared — fraction of variance in y explained by the model
F-statistic — joint test of all coefficients = 0

Robust standard errors

python

# HC3 robust to heteroskedasticity
robust = smf.ols('y ~ x', data=df).fit(cov_type='HC3')

# Cluster-robust
clustered = smf.ols('y ~ x', data=df).fit(
    cov_type='cluster',
    cov_kwds={'groups': df['firm_id']}
)

Predictions

python

model.predict(new_data)
model.get_prediction(new_data).summary_frame()  # with confidence intervals

The matrix API for full control

If you need full control (custom design matrix, weights, etc.), use the matrix API instead of the formula API.

python

import statsmodels.api as sm

X = sm.add_constant(df[['x1', 'x2']])
y = df['y']
model = sm.OLS(y, X).fit()

scikit-learn LinearRegression is not for econometrics

scikit-learn's LinearRegression has no standard errors, no p-values, no R-squared (well, it has score(), but...). It's built for prediction, not inference. For econometrics, always reach for statsmodels.

Exercise

Run an OLS regression of lending_rate on deposit_rate using bankrates and print the summary.