Real datasets are messy. CASE expressions, string functions, and date arithmetic are the daily tools for cleaning, classifying, and time-bucketing.
CASE WHEN
SELECT first_name,CASEWHEN total_spent > 1000 THEN 'high'WHEN total_spent > 100 THEN 'mid'ELSE 'low'END AS segmentFROM customers;
If/else inside a query. Often used inside aggregations to count specific categories:
SELECTCOUNT(*) AS total,COUNT(CASE WHEN country = 'Kenya' THEN 1 END) AS kenya_count,AVG(CASE WHEN country = 'Kenya' THEN total_amount END) AS kenya_avg_amountFROM orders;
String functions
- LOWER(s), UPPER(s) — case conversion
- TRIM(s), LTRIM, RTRIM — strip whitespace
- CONCAT(a, b) or a || b — string concatenation
- SUBSTRING(s FROM x FOR y) — extract substring
- REPLACE(s, old, new) — find-and-replace
- LENGTH(s) — character count
- POSITION(sub IN s) — find substring location
- REGEXP_REPLACE(s, pattern, replacement) — full regex (Postgres)
Date functions
-- Truncate to month: 2024-03-15 → 2024-03-01SELECT date_trunc('month', created_at) AS month, COUNT(*)FROM ordersGROUP BY 1ORDER BY 1;-- Difference in daysSELECT id, AGE(NOW(), created_at) AS days_since_signupFROM customers;
Different databases use different functions (date_trunc in Postgres, strftime in SQLite, DATE_TRUNC in BigQuery). The concepts are universal: truncate to a period, compute differences, extract parts (year, month, dow).
Always store timestamps in UTC
Time zone confusion is a top-10 source of bugs in analytics. Store everything in UTC, convert at display time. If you have to mix zones, use TIMESTAMPTZ in Postgres and be explicit.
Exercise
Bucket customers into 'new' (<30 days), 'active' (<180 days), or 'inactive' based on their last order date.