generate, replace, and egen are the three commands you'll use to create and modify variables. They cover almost every transformation you'll need.
generate — create a new variable
generate spread = lending_rate - deposit_rategenerate log_assets = ln(assets)generate is_tier1 = (tier == 1) // boolean as 0/1generate year = year(date)
replace — modify an existing variable
replace spread = . if month == "2024-12" // set missingreplace stance = "tight" if rate > 0.13replace stance = "loose" if rate < 0.10
egen — extended generate, for derived stats
egen handles transformations that need information from multiple rows: means, medians, ranks, totals, group operations.
egen mean_rate = mean(lending_rate) // overall meanegen mean_by_year = mean(lending_rate), by(year) // by-group meanegen rank = rank(lending_rate) // rank orderegen median_rate = median(lending_rate)
Recoding categorical variables
recode tier (1 = 1 "Tier 1") (2/3 = 2 "Tier 2-3") (4/. = 3 "Tier 4+"), generate(tier_grouped)
Encoding text to numeric
encode bank_name, generate(bank_id)* Stata stores numerics for analysis, with a label showing original text
Working with dates
generate date = date(date_string, "YMD")format date %tdgenerate month = mofd(date)format month %tm
Missing in Stata is dot, not blank
Missing numeric values are stored as `.`. They sort last (treated as +infinity) — so `if x > 0` includes missings. Use `if x > 0 & !missing(x)` to exclude them.
Exercise
Generate a new variable spread = lending_rate - deposit_rate.