Skip to content
Module 05 of 1250 min readIntermediate

Summarizing and tabulating

summarize, tabulate, table, tabstat, by:, bysort:, collapse. How to read a dataset before you model it.

42%

Listen along

Read “Summarizing and tabulating” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Use summarize, detail, summarize by group, and bysort: for grouped statistics
  • 02Apply tabulate for one-way and two-way frequency tables with chi-square tests
  • 03Use the modern table command (Stata 17+) for flexible cross-tabulations
  • 04Use collapse to aggregate a dataset into a smaller summary dataset

Before you model anything, you summarise it. Stata's summarize, tabulate, and table commands cover most descriptive needs; bysort lets you stratify them.

summarize — continuous variables

stata
summarize lending_rate
summarize lending_rate, detail // with percentiles
summarize lending_rate if year == 2024
bysort year: summarize lending_rate

tabulate — categorical and cross-tabs

stata
tabulate tier // one-way
tabulate tier region // two-way
tabulate tier, summarize(lending_rate) // mean by tier
* Common chi-square option
tabulate tier region, chi2 row

table — flexible cross-tabs (Stata 17+)

stata
table tier, statistic(mean lending_rate) statistic(sd lending_rate)
table (tier) (year), statistic(mean lending_rate)

tabstat — multi-statistic summaries

stata
tabstat lending_rate, by(year) statistics(mean sd min max n)

by and bysort

by: prefixes a command and runs it once per group. The data must be sorted by the by-variable first. bysort: combines sort and by.

stata
bysort year: summarize lending_rate
bysort year: egen mean_rate = mean(lending_rate)

collapse — aggregate to a smaller dataset

stata
collapse (mean) lending_rate (sd) sd_rate=lending_rate, by(year)
* result: one row per year, with mean and SD

The exploration template

describe → codebook → summarize → bysort tier: summarize → tabulate → corr. In ten lines, you'll know your dataset better than 90% of analysts ever do.

Exercise

Compute the mean lending_rate by year using bysort.

Key takeaways

  • summarize, detail gives mean, SD, percentiles, skewness, kurtosis — far more than the default
  • bysort var: command runs the command once per group of var
  • tabulate var1 var2, chi2 row gives a cross-tab with chi-square independence test
  • collapse irreversibly aggregates — use preserve/restore if you need the row-level data back

Further reading

  1. 01
  2. 02
  3. 03

    A Gentle Introduction to Stata (7th Edition)

    Alan C. Acock · Stata Press · 2023

Loading progress…
LeadAfrikPublic Economics Hub