Skip to content
Module 03 of 1255 min readIntermediate

Generate, replace, egen — variable transformation

Create variables, transform variables, recode variables, and use egen for the operations the basic generate cannot do.

25%

Listen along

Read “Generate, replace, egen — variable transformation” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Create new variables with generate and modify existing ones with replace
  • 02Use egen for derived statistics that require by-group operations
  • 03Recode categorical variables and encode string variables to numeric with labels
  • 04Handle dates correctly using Stata's internal date format

generate, replace, and egen are the three commands you'll use to create and modify variables. They cover almost every transformation you'll need.

generate — create a new variable

stata
generate spread = lending_rate - deposit_rate
generate log_assets = ln(assets)
generate is_tier1 = (tier == 1) // boolean as 0/1
generate year = year(date)

replace — modify an existing variable

stata
replace spread = . if month == "2024-12" // set missing
replace stance = "tight" if rate > 0.13
replace stance = "loose" if rate < 0.10

egen — extended generate, for derived stats

egen handles transformations that need information from multiple rows: means, medians, ranks, totals, group operations.

stata
egen mean_rate = mean(lending_rate) // overall mean
egen mean_by_year = mean(lending_rate), by(year) // by-group mean
egen rank = rank(lending_rate) // rank order
egen median_rate = median(lending_rate)

Recoding categorical variables

stata
recode tier (1 = 1 "Tier 1") (2/3 = 2 "Tier 2-3") (4/. = 3 "Tier 4+"), generate(tier_grouped)

Encoding text to numeric

stata
encode bank_name, generate(bank_id)
* Stata stores numerics for analysis, with a label showing original text

Working with dates

stata
generate date = date(date_string, "YMD")
format date %td
generate month = mofd(date)
format month %tm

Missing in Stata is dot, not blank

Missing numeric values are stored as `.`. They sort last (treated as +infinity) — so `if x > 0` includes missings. Use `if x > 0 & !missing(x)` to exclude them.

Exercise

Generate a new variable spread = lending_rate - deposit_rate.

Key takeaways

  • generate creates; replace modifies. Stata enforces this strictly
  • egen handles by-group derived stats: mean, median, rank, total — operations needing multiple rows
  • encode converts strings to numeric while preserving the original text as labels — the right way to handle categorical text
  • Missing values are stored as `.` and sort as +infinity — `if x > 0` includes missings unless you add `& !missing(x)`

Further reading

  1. 01
  2. 02
  3. 03

    Data Management Using Stata

    Michael N. Mitchell · Stata Press · 2010

Loading progress…
LeadAfrikPublic Economics Hub