statsmodels is the workhorse for econometric-style regression in Python. Unlike scikit-learn (which is built for prediction and machine learning), statsmodels gives you the t-statistics, standard errors, R-squared, and diagnostic tests that a regression analysis actually requires.
OLS — the formula API
import statsmodels.formula.api as smfmodel = smf.ols('lending_rate ~ deposit_rate + month', data=bankrates).fit()print(model.summary())
The R-style formula syntax (y ~ x1 + x2) is concise and readable. statsmodels parses the formula, builds the design matrix, fits OLS, and returns a results object with everything you'll need for inference.
Reading the summary table
- coef — point estimate of the slope coefficient
- std err — standard error
- t — t-statistic = coef / std err
- P>|t| — two-sided p-value (probability of |t| this large under H0: coef = 0)
- [0.025 0.975] — 95% confidence interval for the coefficient
- R-squared — fraction of variance in y explained by the model
- F-statistic — joint test of all coefficients = 0
Robust standard errors
# HC3 robust to heteroskedasticityrobust = smf.ols('y ~ x', data=df).fit(cov_type='HC3')# Cluster-robustclustered = smf.ols('y ~ x', data=df).fit(cov_type='cluster',cov_kwds={'groups': df['firm_id']})
Predictions
model.predict(new_data)model.get_prediction(new_data).summary_frame() # with confidence intervals
The matrix API for full control
If you need full control (custom design matrix, weights, etc.), use the matrix API instead of the formula API.
import statsmodels.api as smX = sm.add_constant(df[['x1', 'x2']])y = df['y']model = sm.OLS(y, X).fit()
scikit-learn LinearRegression is not for econometrics
scikit-learn's LinearRegression has no standard errors, no p-values, no R-squared (well, it has score(), but...). It's built for prediction, not inference. For econometrics, always reach for statsmodels.
Exercise
Run an OLS regression of lending_rate on deposit_rate using bankrates and print the summary.