Trusting queries — sanity checks and NULL traps — SQL Intermediate Module 8

An analyst's reputation is built on queries other people can trust. The difference between an intermediate analyst and a junior one is not query-writing speed — it's the discipline of checking the query before sharing the answer.

The pre-share checklist

Row count of the final result — does it match a back-of-envelope expectation?
Row count of each input table — did you accidentally lose half the data with an inner join?
Sum of a key metric — does it tie to a known total (last month's revenue, last week's signups)?
NULL handling — did any of your aggregates silently ignore NULLs you cared about?
Duplicates — did a JOIN explode the row count? (Common when joining a 1:many or many:many without thinking.)
Boundary conditions — what does the query return for the empty case (no rows match)? For NULL inputs? For the most recent day?

The three NULL traps

NULLs cause more silent wrong answers than any other SQL feature. Three patterns to know cold:

Trap 1: NULL != NULL

sql

-- Looks like it filters NULL country, but doesn't:
SELECT * FROM customers WHERE country = NULL;  -- always empty

-- Correct:
SELECT * FROM customers WHERE country IS NULL;

Trap 2: aggregates skip NULLs silently

sql

-- AVG(salary) ignores NULL salaries — your 'average' is over only known salaries.
-- Sometimes that's what you want. Sometimes it's a bug.
SELECT AVG(salary) FROM employees;  -- silently skips NULLs

-- Treat NULL as zero if that's the intent:
SELECT AVG(COALESCE(salary, 0)) FROM employees;

Trap 3: NOT IN with NULL in the subquery → empty result

sql

-- If any row in excluded_users.user_id is NULL,
-- this returns ZERO rows — not what you want.
SELECT * FROM users
WHERE user_id NOT IN (SELECT user_id FROM excluded_users);

-- Safer:
SELECT * FROM users u
WHERE NOT EXISTS (
  SELECT 1 FROM excluded_users e WHERE e.user_id = u.user_id
);

Trap 3 is the assassin

NOT IN + NULL is the single most-common cause of 'why is my query empty?' in support tickets. Use NOT EXISTS or LEFT JOIN … WHERE … IS NULL instead. The reason: SQL three-valued logic treats x NOT IN (1, 2, NULL) as 'NOT (x = 1 OR x = 2 OR x = NULL)' — and the NULL leg evaluates to UNKNOWN, which propagates through OR to UNKNOWN overall.

UNION vs UNION ALL vs JOIN vs EXISTS

UNION — combines two result sets with the same columns, removes duplicate rows (slow: implicit sort + dedup)
UNION ALL — combines but keeps duplicates (fast: just concatenates). Use unless you specifically need dedup.
JOIN — pairs rows from two tables on a key. Use when you need columns from both sides.
EXISTS — boolean test of 'does any matching row exist'. Use when you only need to filter, not pull columns. Often faster than JOIN+DISTINCT or IN(subquery).

sql

-- Customers who have placed at least one order — three ways, all equivalent in result:

-- 1. JOIN + DISTINCT (often slowest)
SELECT DISTINCT c.* FROM customers c JOIN orders o ON o.customer_id = c.id;

-- 2. IN subquery
SELECT * FROM customers c WHERE c.id IN (SELECT customer_id FROM orders);

-- 3. EXISTS (usually fastest because the engine can stop at the first match)
SELECT * FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

Tying back to a known total

The single highest-ROI sanity check: pick one number from your final result and verify it matches a different source. If your monthly revenue says $4.2M and the finance team's GL says $3.9M, you have a discrepancy to investigate before you publish.

Two queries are better than one

When in doubt, write the same answer two different ways (e.g., once aggregating from orders, once aggregating from invoices). If they agree, ship. If they don't, you've found a bug — either in your query or in the data.

Exercise

You wrote a query that says 'we had 4,212 unique customers in March.' The CRM team is sure the number is closer to 4,500. (1) List five hypotheses for the discrepancy. (2) For each, what query would you write to test it? (3) Which one is the most common root cause in practice?