CASE expressions, strings, and dates — SQL Module 8

Real datasets are messy. CASE expressions, string functions, and date arithmetic are the daily tools for cleaning, classifying, and time-bucketing.

CASE WHEN

sql

SELECT first_name,
       CASE 
         WHEN total_spent > 1000 THEN 'high'
         WHEN total_spent > 100  THEN 'mid'
         ELSE 'low'
       END AS segment
FROM customers;

If/else inside a query. Often used inside aggregations to count specific categories:

sql

SELECT 
  COUNT(*) AS total,
  COUNT(CASE WHEN country = 'Kenya' THEN 1 END) AS kenya_count,
  AVG(CASE WHEN country = 'Kenya' THEN total_amount END) AS kenya_avg_amount
FROM orders;

String functions

LOWER(s), UPPER(s) — case conversion
TRIM(s), LTRIM, RTRIM — strip whitespace
CONCAT(a, b) or a || b — string concatenation
SUBSTRING(s FROM x FOR y) — extract substring
REPLACE(s, old, new) — find-and-replace
LENGTH(s) — character count
POSITION(sub IN s) — find substring location
REGEXP_REPLACE(s, pattern, replacement) — full regex (Postgres)

Date functions

sql

-- Truncate to month: 2024-03-15 → 2024-03-01
SELECT date_trunc('month', created_at) AS month, COUNT(*)
FROM orders
GROUP BY 1
ORDER BY 1;

-- Difference in days
SELECT id, AGE(NOW(), created_at) AS days_since_signup
FROM customers;

Different databases use different functions (date_trunc in Postgres, strftime in SQLite, DATE_TRUNC in BigQuery). The concepts are universal: truncate to a period, compute differences, extract parts (year, month, dow).

Always store timestamps in UTC

Time zone confusion is a top-10 source of bugs in analytics. Store everything in UTC, convert at display time. If you have to mix zones, use TIMESTAMPTZ in Postgres and be explicit.

Exercise

Bucket customers into 'new' (<30 days), 'active' (<180 days), or 'inactive' based on their last order date.