Skip to content
Module 08 of 845 min readIntermediate

Trusting queries — sanity checks and NULL traps

Pre-share checklist, the three NULL traps that cause silent wrong answers, UNION/JOIN/EXISTS by intent.

100%

Listen along

Read “Trusting queries — sanity checks and NULL traps” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Sanity-check a query before shipping it: count rows, check NULL behaviour, compare to a known total
  • 02Recognise the three NULL traps that cause silent wrong answers
  • 03Choose between UNION, UNION ALL, JOIN, and EXISTS with intent

An analyst's reputation is built on queries other people can trust. The difference between an intermediate analyst and a junior one is not query-writing speed — it's the discipline of checking the query before sharing the answer.

The pre-share checklist

  1. Row count of the final result — does it match a back-of-envelope expectation?
  2. Row count of each input table — did you accidentally lose half the data with an inner join?
  3. Sum of a key metric — does it tie to a known total (last month's revenue, last week's signups)?
  4. NULL handling — did any of your aggregates silently ignore NULLs you cared about?
  5. Duplicates — did a JOIN explode the row count? (Common when joining a 1:many or many:many without thinking.)
  6. Boundary conditions — what does the query return for the empty case (no rows match)? For NULL inputs? For the most recent day?

The three NULL traps

NULLs cause more silent wrong answers than any other SQL feature. Three patterns to know cold:

Trap 1: NULL != NULL

sql
-- Looks like it filters NULL country, but doesn't:
SELECT * FROM customers WHERE country = NULL; -- always empty
-- Correct:
SELECT * FROM customers WHERE country IS NULL;

Trap 2: aggregates skip NULLs silently

sql
-- AVG(salary) ignores NULL salaries — your 'average' is over only known salaries.
-- Sometimes that's what you want. Sometimes it's a bug.
SELECT AVG(salary) FROM employees; -- silently skips NULLs
-- Treat NULL as zero if that's the intent:
SELECT AVG(COALESCE(salary, 0)) FROM employees;

Trap 3: NOT IN with NULL in the subquery → empty result

sql
-- If any row in excluded_users.user_id is NULL,
-- this returns ZERO rows — not what you want.
SELECT * FROM users
WHERE user_id NOT IN (SELECT user_id FROM excluded_users);
-- Safer:
SELECT * FROM users u
WHERE NOT EXISTS (
SELECT 1 FROM excluded_users e WHERE e.user_id = u.user_id
);

Trap 3 is the assassin

NOT IN + NULL is the single most-common cause of 'why is my query empty?' in support tickets. Use NOT EXISTS or LEFT JOIN … WHERE … IS NULL instead. The reason: SQL three-valued logic treats x NOT IN (1, 2, NULL) as 'NOT (x = 1 OR x = 2 OR x = NULL)' — and the NULL leg evaluates to UNKNOWN, which propagates through OR to UNKNOWN overall.

UNION vs UNION ALL vs JOIN vs EXISTS

  • UNION — combines two result sets with the same columns, removes duplicate rows (slow: implicit sort + dedup)
  • UNION ALL — combines but keeps duplicates (fast: just concatenates). Use unless you specifically need dedup.
  • JOIN — pairs rows from two tables on a key. Use when you need columns from both sides.
  • EXISTS — boolean test of 'does any matching row exist'. Use when you only need to filter, not pull columns. Often faster than JOIN+DISTINCT or IN(subquery).
sql
-- Customers who have placed at least one order — three ways, all equivalent in result:
-- 1. JOIN + DISTINCT (often slowest)
SELECT DISTINCT c.* FROM customers c JOIN orders o ON o.customer_id = c.id;
-- 2. IN subquery
SELECT * FROM customers c WHERE c.id IN (SELECT customer_id FROM orders);
-- 3. EXISTS (usually fastest because the engine can stop at the first match)
SELECT * FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

Tying back to a known total

The single highest-ROI sanity check: pick one number from your final result and verify it matches a different source. If your monthly revenue says $4.2M and the finance team's GL says $3.9M, you have a discrepancy to investigate before you publish.

Two queries are better than one

When in doubt, write the same answer two different ways (e.g., once aggregating from orders, once aggregating from invoices). If they agree, ship. If they don't, you've found a bug — either in your query or in the data.

Exercise

You wrote a query that says 'we had 4,212 unique customers in March.' The CRM team is sure the number is closer to 4,500. (1) List five hypotheses for the discrepancy. (2) For each, what query would you write to test it? (3) Which one is the most common root cause in practice?

Key takeaways

  • Run a pre-share checklist: row counts, NULL behaviour, dedup, boundary conditions, tie to known total
  • Three NULL traps: = NULL (use IS NULL), AVG-skips-NULL, NOT IN + NULL = empty result (use NOT EXISTS)
  • UNION ALL ≫ UNION when you don't need dedup; EXISTS often faster than JOIN + DISTINCT
  • Two queries from two different angles is the strongest sanity check there is

Further reading

  1. 01

    Joe Celko's SQL for Smarties

    Joe Celko · Morgan Kaufmann · 2010The canonical book on advanced SQL technique. Chapter on three-valued logic is the definitive treatment of NULLs.

Loading progress…
LeadAfrikPublic Economics Hub