Before you model anything, you summarise it. Stata's summarize, tabulate, and table commands cover most descriptive needs; bysort lets you stratify them.
summarize — continuous variables
summarize lending_ratesummarize lending_rate, detail // with percentilessummarize lending_rate if year == 2024bysort year: summarize lending_rate
tabulate — categorical and cross-tabs
tabulate tier // one-waytabulate tier region // two-waytabulate tier, summarize(lending_rate) // mean by tier* Common chi-square optiontabulate tier region, chi2 row
table — flexible cross-tabs (Stata 17+)
table tier, statistic(mean lending_rate) statistic(sd lending_rate)table (tier) (year), statistic(mean lending_rate)
tabstat — multi-statistic summaries
tabstat lending_rate, by(year) statistics(mean sd min max n)
by and bysort
by: prefixes a command and runs it once per group. The data must be sorted by the by-variable first. bysort: combines sort and by.
bysort year: summarize lending_ratebysort year: egen mean_rate = mean(lending_rate)
collapse — aggregate to a smaller dataset
collapse (mean) lending_rate (sd) sd_rate=lending_rate, by(year)* result: one row per year, with mean and SD
The exploration template
describe → codebook → summarize → bysort tier: summarize → tabulate → corr. In ten lines, you'll know your dataset better than 90% of analysts ever do.
Exercise
Compute the mean lending_rate by year using bysort.