Stata tips & idioms
The patterns, shortcuts, and gotchas that distinguish economists who use Stata fluently from those who fight it. Read once, internalise over a few months.
Reproducibility first
Three habits that make every analysis you ever do replicable. Adopt them once and never look back.
Always work in a do-file
Type interactively to explore, but commit your steps to a `.do` file before you trust the result. Run the do-file from the command line via `do myfile.do`. Future you will thank present you.
// At the top of every do-file: clear all set more off capture log close log using "analysis_$(date).log", replace text
Set seed for any randomness
Whenever a command involves randomness — sampling, simulation, bootstrap — set the seed first. Otherwise your results aren\'t replicable.
set seed 20260509 bsample 100
Use relative paths
Set the working directory once at the top of your do-file with `cd "<project_root>"` and use relative paths after. Hard-coded absolute paths break the moment anyone else runs your code.
cd "C:/Users/yourname/projects/kenya-pension" use "data/raw/pension.dta", clear
Inspecting data before you trust it
Five minutes of inspection prevents five hours of bug-hunting downstream.
Three commands every time you load
Always run `describe`, `summarize`, and `list in 1/5` immediately after `use`. They take three seconds and catch about half the data problems you'd otherwise hit later.
use bankrates.dta, clear describe summarize list in 1/5
Check for missingness explicitly
`misstable summarize` shows the count of missing values per variable. `misstable patterns` shows which combinations of variables are missing together — surprisingly informative.
misstable summarize misstable patterns
Tabulate every categorical you'll touch
Before you condition on or interact with a categorical variable, `tab` it to see what's actually in there. Most 'unexpected coefficient' surprises trace back to a category you didn't know existed.
tab year tab year half, missing
if and missing values
Where most subtle Stata bugs live. Get these right and your filters always do what you think.
Missing is the largest value
Stata represents missing as `.` and treats it as larger than any number. This means `if x > 100` includes missing values silently. Always be explicit when you mean missing.
// Wrong — includes missings: count if lending > 15 // Right — excludes missings: count if lending > 15 & !missing(lending)
Use missing() for clarity
`missing(x)` is more readable than `x == .` and works for both numeric and string variables. Use it.
drop if missing(lending)
inrange and inlist
Two functions that beat chains of AND/OR: `inrange(x, lo, hi)` for inclusive ranges and `inlist(x, a, b, c)` for membership. Cleaner and faster.
keep if inrange(year, 2018, 2024) keep if inlist(year, 2018, 2020, 2022)
Reading regression output fluently
The single most important Stata skill. Practise reading these tables until they're as natural as reading prose.
Always pick a standard error story
Plain SEs assume i.i.d. errors — almost never true. Use `, robust` for cross-sectional data, `, vce(cluster id)` for grouped or panel data. State which you used in the paper.
regress lending deposit, robust xtreg lending deposit, fe vce(cluster fund_id)
Read coefficients in context
A coefficient on a logged regressor is approximately a percentage. A coefficient on an interaction is the change in slope, not the slope itself. A coefficient on `i.year` is relative to the omitted year.
regress lending c.deposit##i.year
Use lincom and test for any combination
Want the sum of two coefficients with its CI? `lincom`. Want to test a joint hypothesis? `test`. Don't compute these by hand — Stata propagates the standard errors correctly.
regress y x1 x2 x3 lincom x1 + x2 test x1 = x2 = 0
Reporting that survives review
How to produce tables and figures that stand up to a referee report.
Use esttab for publication tables
Install once with `ssc install estout`. Then store estimates with `estimates store m1`, repeat for each model, and `esttab m1 m2 m3 using table.tex, ...` exports a clean side-by-side table in LaTeX, RTF, or HTML.
estimates store m1 regress y x1 x2, robust estimates store m2 esttab m1 m2 using "table.tex", b(3) se(3) star(* 0.10 ** 0.05 *** 0.01)
Save graphs as .gph and .png
Always save the editable Stata format alongside the export. The .gph lets you re-style later without re-running the analysis.
twoway line lending year, ytitle("Lending rate (%)")
graph save "lending.gph", replace
graph export "lending.png", replaceUse locals for parameterised analysis
Hard-coding a variable name once is fine; hard-coding it ten times across a do-file is a bug waiting to happen. Use a local at the top: `local outcome "lending"` and reference `\`outcome\`` everywhere.
local outcome "lending" local controls "deposit savings i.year" regress `outcome' `controls', robust esttab using "`outcome'_results.tex", replace