JSON and semi-structured data — SQL Advanced Module 6

Modern apps log a lot of semi-structured data: event payloads, webhook bodies, configuration. SQL has evolved to query it natively. Postgres JSONB, MySQL 8 JSON, Snowflake VARIANT, BigQuery JSON, DuckDB JSON — all give you operators to read into the structure without unmarshalling in application code.

-> vs ->> in Postgres

sql

-- events.payload is JSONB
SELECT
  payload -> 'user'           AS user_json,        -- returns JSONB
  payload -> 'user' ->> 'id'  AS user_id_text,     -- returns text
  (payload -> 'user' ->> 'id')::int AS user_id_int -- cast to int
FROM events
WHERE payload ->> 'event_type' = 'purchase';

-> returns JSONB (typed sub-value)
->> returns text (the JSON scalar coerced to a string)
Chain -> for nested access, finish with ->> to compare to a string in WHERE
Cast with ::type when you need a numeric or date comparison

Containment with @>

sql

-- Find events whose payload contains user.country = 'Kenya' at any nesting level
SELECT * FROM events
WHERE payload @> '{"user": {"country": "Kenya"}}'::jsonb;

The @> operator tests JSON containment: does the left value contain the right structure as a subset? It's the natural fit for 'has this set of attributes' queries and — critically — it's GIN-indexable.

GIN indexes for JSONB

sql

-- Default GIN: supports @>, ?, ?&, ?| operators
CREATE INDEX idx_events_payload_gin ON events USING GIN (payload);

-- Path-ops GIN: smaller index, only supports @> (most common operator)
CREATE INDEX idx_events_payload_pathops ON events USING GIN (payload jsonb_path_ops);

-- Expression index on a specific JSON path (cheapest if you only query one path)
CREATE INDEX idx_events_user_country ON events ((payload -> 'user' ->> 'country'));

jsonb_path_ops is usually the right choice

It's about 30% the size of the default GIN and faster to maintain. Only switch to default GIN if you need ? operators (key-existence tests) or full-text style searches inside JSON.

JSON path expressions (SQL/JSON, Postgres 12+)

sql

-- Returns true if any event in items[] has type='premium'
SELECT id FROM events
WHERE payload @? '$.items[*] ? (@.type == "premium")';

-- Extract all amounts from items[] as a JSONB array
SELECT jsonb_path_query_array(payload, '$.items[*].amount') FROM events;

SQL/JSON path is more powerful than -> / ->> for queries over arrays and conditional traversal. Standardised in SQL:2016, supported by Postgres 12+, Oracle 12c+, SQL Server 2017+, MySQL 8+.

When to denormalise to columns

JSON columns are flexible — schema-on-read, no migrations to add a field. But for fields you query frequently, denormalising them into real columns is faster and lets you use proper B-tree indexes. The break-even is roughly:

Read >100 times for every write of the JSON blob → denormalise
Filter or sort by the field in most queries → denormalise
Field is part of foreign-key relationships → must denormalise (FKs don't work on JSON)
Field is rarely read or only read by ad-hoc analytics → keep in JSON
Schema is genuinely uncertain or differs by record type → keep in JSON

The 'just put it in JSON' anti-pattern

Schemas that should be relational often start as 'we'll put it in JSON for flexibility' and end up unmaintainable. If you find yourself writing payload ->> 'customer_id' in every query, that field deserves to be a column.

Exercise

Your app stores webhook payloads in events(id, received_at, source, payload JSONB). Engineers query the table in three patterns: (a) all events where payload.event_type IN ('purchase', 'refund') in the last 7 days, (b) all events for a given external user — payload.user.external_id — across all time, (c) ad-hoc 'what's in here?' exploration. (1) Design the indexes. (2) Argue for/against extracting event_type to a real column.