Skip to content

Stata tips & idioms

The patterns, shortcuts, and gotchas that distinguish economists who use Stata fluently from those who fight it. Read once, internalise over a few months.

Reproducibility first

Three habits that make every analysis you ever do replicable. Adopt them once and never look back.

Always work in a do-file

Type interactively to explore, but commit your steps to a `.do` file before you trust the result. Run the do-file from the command line via `do myfile.do`. Future you will thank present you.

// At the top of every do-file:
clear all
set more off
capture log close
log using "analysis_$(date).log", replace text

Set seed for any randomness

Whenever a command involves randomness — sampling, simulation, bootstrap — set the seed first. Otherwise your results aren\'t replicable.

set seed 20260509
bsample 100

Use relative paths

Set the working directory once at the top of your do-file with `cd "<project_root>"` and use relative paths after. Hard-coded absolute paths break the moment anyone else runs your code.

cd "C:/Users/yourname/projects/kenya-pension"
use "data/raw/pension.dta", clear

Inspecting data before you trust it

Five minutes of inspection prevents five hours of bug-hunting downstream.

Three commands every time you load

Always run `describe`, `summarize`, and `list in 1/5` immediately after `use`. They take three seconds and catch about half the data problems you'd otherwise hit later.

use bankrates.dta, clear
describe
summarize
list in 1/5

Check for missingness explicitly

`misstable summarize` shows the count of missing values per variable. `misstable patterns` shows which combinations of variables are missing together — surprisingly informative.

misstable summarize
misstable patterns

Tabulate every categorical you'll touch

Before you condition on or interact with a categorical variable, `tab` it to see what's actually in there. Most 'unexpected coefficient' surprises trace back to a category you didn't know existed.

tab year
tab year half, missing

if and missing values

Where most subtle Stata bugs live. Get these right and your filters always do what you think.

Missing is the largest value

Stata represents missing as `.` and treats it as larger than any number. This means `if x > 100` includes missing values silently. Always be explicit when you mean missing.

// Wrong — includes missings:
count if lending > 15

// Right — excludes missings:
count if lending > 15 & !missing(lending)

Use missing() for clarity

`missing(x)` is more readable than `x == .` and works for both numeric and string variables. Use it.

drop if missing(lending)

inrange and inlist

Two functions that beat chains of AND/OR: `inrange(x, lo, hi)` for inclusive ranges and `inlist(x, a, b, c)` for membership. Cleaner and faster.

keep if inrange(year, 2018, 2024)
keep if inlist(year, 2018, 2020, 2022)

Reading regression output fluently

The single most important Stata skill. Practise reading these tables until they're as natural as reading prose.

Always pick a standard error story

Plain SEs assume i.i.d. errors — almost never true. Use `, robust` for cross-sectional data, `, vce(cluster id)` for grouped or panel data. State which you used in the paper.

regress lending deposit, robust
xtreg lending deposit, fe vce(cluster fund_id)

Read coefficients in context

A coefficient on a logged regressor is approximately a percentage. A coefficient on an interaction is the change in slope, not the slope itself. A coefficient on `i.year` is relative to the omitted year.

regress lending c.deposit##i.year

Use lincom and test for any combination

Want the sum of two coefficients with its CI? `lincom`. Want to test a joint hypothesis? `test`. Don't compute these by hand — Stata propagates the standard errors correctly.

regress y x1 x2 x3
lincom x1 + x2
test x1 = x2 = 0

Reporting that survives review

How to produce tables and figures that stand up to a referee report.

Use esttab for publication tables

Install once with `ssc install estout`. Then store estimates with `estimates store m1`, repeat for each model, and `esttab m1 m2 m3 using table.tex, ...` exports a clean side-by-side table in LaTeX, RTF, or HTML.

estimates store m1
regress y x1 x2, robust
estimates store m2
esttab m1 m2 using "table.tex", b(3) se(3) star(* 0.10 ** 0.05 *** 0.01)

Save graphs as .gph and .png

Always save the editable Stata format alongside the export. The .gph lets you re-style later without re-running the analysis.

twoway line lending year, ytitle("Lending rate (%)")
graph save "lending.gph", replace
graph export "lending.png", replace

Use locals for parameterised analysis

Hard-coding a variable name once is fine; hard-coding it ten times across a do-file is a bug waiting to happen. Use a local at the top: `local outcome "lending"` and reference `\`outcome\`` everywhere.

local outcome "lending"
local controls "deposit savings i.year"
regress `outcome' `controls', robust
esttab using "`outcome'_results.tex", replace