Module 04 of 1240 min readIntermediate

Filtering, sorting, and the if/in qualifiers

keep, drop, sort, gsort, the if and in syntax, _n and _N — the verbs of subsetting Stata data.

33%

Listen along

Read “Filtering, sorting, and the if/in qualifiers” aloud

VoiceSpeed

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

01Use keep and drop to subset rows and columns
02Apply the if and in qualifiers to limit any command to a subset
03Use preserve/restore to operate on a snapshot without destroying the original data
04Reference observations positionally with _n and _N (current and total observation numbers)

keep, drop, sort, and the if/in qualifiers are how you subset and order observations in Stata. They also explain the most common Stata gotcha: 'keep' permanently modifies the data; commands run on whatever is in memory.

keep and drop

stata

keep if year >= 2023                  // keep matching rows
drop if missing(lending_rate)         // drop missings

keep month lending_rate                // keep only these columns
drop deposit_rate                      // drop these columns

keep/drop are permanent

Once you keep or drop, the data is gone from memory. Always preserve the original by either: (1) saving a backup with save raw.dta first, (2) using preserve / restore around the analysis, or (3) computing on a copy.

preserve / restore

stata

preserve
    keep if year == 2024
    summarize lending_rate
restore
* data is back to its original state

if and in qualifiers

stata

summarize lending_rate if year == 2024
summarize lending_rate in 1/10           // first 10 observations
list if lending_rate > 0.13 & year == 2024

_n and _N

_n is the current observation number (the row index, after any sort); _N is the total number of observations. Combined with sort, they give you positional references.

stata

sort year month
generate first_obs = (_n == 1)
generate last_obs = (_n == _N)
generate prev_rate = lending_rate[_n - 1]   // lag

sort and gsort

stata

sort year month
gsort -lending_rate          // descending; - prefix means reverse

Exercise

Keep only observations where year is 2024.

Key takeaways

◆keep and drop are PERMANENT — always use preserve/restore around exploratory subsets
◆if applies a logical filter; in applies a position-based slice (in 1/10 means observations 1 through 10)
◆_n is the current observation; _N is the total. lending_rate[_n-1] gives the previous row's value
◆sort orders ascending; gsort orders with - prefix for descending