Once you have a frame, how do you select the sample — and how do you analyse the result correctly? This module covers sampling design and the often-misunderstood matter of survey weights. Getting these wrong produces biased estimates and wrong standard errors even from perfect data, and ignoring weights is one of the most common errors in applied work with survey data.
Probability sampling
Why known probabilities matter
Probability sampling means every unit in the frame has a KNOWN, NON-ZERO probability of selection. This is the foundation that lets you INFER from the sample to the population with quantifiable uncertainty (standard errors, confidence intervals) — because you know how the sample relates to the population, you can correct for the selection (via weights) and quantify the sampling error. NON-probability sampling (convenience samples, quota samples, 'whoever we could reach') breaks this: without known selection probabilities, you cannot validly infer to the population or quantify uncertainty — the sample may be arbitrarily biased and you can't tell. (This is the deep problem with many big-data and online samples — module 7 — which are large but non-probability, so size doesn't cure their bias.) Probability sampling is what makes a survey a basis for valid statistical inference, and it is the dividing line between data you can generalise from and data you can't.
Three sampling designs
- Simple random sampling (SRS) — every unit has an equal chance; each is drawn independently. The conceptual benchmark, but often impractical (you need a complete frame of individuals, and units may be geographically scattered, making face-to-face fieldwork expensive).
- Stratified sampling — divide the population into STRATA (regions, urban/rural, income groups) and sample within each. Benefits: GUARANTEES representation of each stratum (vs relying on chance), ALLOWS OVERSAMPLING of small but important groups (e.g., sample the rich or a minority more heavily to study them — then weight back), and IMPROVES PRECISION (if strata are internally homogeneous, stratifying reduces sampling variance). Widely used.
- Cluster sampling — sample GROUPS (villages, enumeration areas) first, then units within them. Benefits: far CHEAPER for face-to-face surveys (interviewers visit a few clusters rather than scattered individuals across the whole country) and needs only a frame of clusters, not of all individuals. Cost: statistical inefficiency (below).
Real surveys typically combine these in MULTISTAGE designs: stratify, then sample clusters within strata, then households within clusters — balancing cost, precision, and representativeness.
The design effect
Why cluster samples carry less information
Cluster sampling is cheaper but statistically LESS EFFICIENT, because units within a cluster are CORRELATED — people in the same village share unobserved characteristics (climate, local economy, services, culture), so they are more alike than randomly-chosen individuals (the intra-cluster correlation, ICC). Because of this correlation, each additional unit within a cluster adds LESS NEW INFORMATION than an independent unit would — a cluster of 20 households does NOT give 20 independent observations' worth of information. The DESIGN EFFECT quantifies this: it is the factor by which the variance of an estimate is inflated (and the effective sample size reduced) relative to simple random sampling, and it grows with both the cluster size and the ICC. The practical consequences: (1) cluster samples need MORE total units to achieve a given precision (you must add CLUSTERS, not just units within clusters); and (2) standard errors computed as if the data were a simple random sample are TOO SMALL — you must account for the clustering (cluster-robust standard errors, survey-design-based variance estimation) or you'll overstate your precision and find spurious significance. This is exactly the same intra-cluster-correlation issue as cluster-randomised RCTs (the Impact Evaluation course) — a unifying concept across the data-and-methods specialization.
Survey weights
Why ignoring weights gives wrong answers
When units are sampled with DIFFERENT probabilities (as in stratified designs that oversample some groups, or because of unequal cluster sizes and non-response), the raw sample is NOT representative of the population — some units stand for more of the population than others. SURVEY WEIGHTS correct this: each unit's weight is (roughly) the INVERSE of its selection probability (adjusted for non-response and calibrated to known population totals), so a unit sampled with low probability gets a high weight (it represents many population units) and vice versa. Weighted estimates recover population quantities; UNWEIGHTED estimates are BIASED whenever selection probabilities differ. Concretely: if you oversampled the rich (to study them) and then compute an unweighted average income, you'll get a number far too high (the rich are over-represented in the raw sample) — you MUST weight to get the population average. Ignoring survey weights is one of the most common and serious errors in applied work with survey microdata: it produces biased estimates of means, totals, and relationships whenever the design is not self-weighting. The rule: with complex survey data, ALWAYS use the survey weights for population estimates, and account for the design (strata, clusters) in standard errors. The weights and the design are not optional technicalities — they are required to get the right answer.
Exercise
A national survey uses a stratified, clustered design: it divides the country into urban and rural strata, OVERSAMPLES urban areas (to study them in detail), then samples villages/neighbourhoods (clusters) and households within them. An analyst computes the national poverty rate as the simple unweighted average across all sampled households, using simple-random-sample standard errors. (1) Explain why the unweighted poverty rate is wrong and its likely direction. (2) Explain how survey weights fix it. (3) Explain why the standard errors are also wrong. (4) Explain the precision cost of the clustering and how to design for it.