Skip to content
Module 06 of 850 min readAdvanced

JSON and semi-structured data

Postgres ->/->>, containment with @>, GIN indexing, when to denormalise to columns.

75%

Listen along

Read “JSON and semi-structured data” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Read and write JSONB columns in Postgres with -> and ->> operators
  • 02Index JSONB columns with GIN for efficient containment queries
  • 03Decide when to denormalise JSON into columns vs keep it semi-structured

Modern apps log a lot of semi-structured data: event payloads, webhook bodies, configuration. SQL has evolved to query it natively. Postgres JSONB, MySQL 8 JSON, Snowflake VARIANT, BigQuery JSON, DuckDB JSON — all give you operators to read into the structure without unmarshalling in application code.

-> vs ->> in Postgres

sql
-- events.payload is JSONB
SELECT
payload -> 'user' AS user_json, -- returns JSONB
payload -> 'user' ->> 'id' AS user_id_text, -- returns text
(payload -> 'user' ->> 'id')::int AS user_id_int -- cast to int
FROM events
WHERE payload ->> 'event_type' = 'purchase';
  • -> returns JSONB (typed sub-value)
  • ->> returns text (the JSON scalar coerced to a string)
  • Chain -> for nested access, finish with ->> to compare to a string in WHERE
  • Cast with ::type when you need a numeric or date comparison

Containment with @>

sql
-- Find events whose payload contains user.country = 'Kenya' at any nesting level
SELECT * FROM events
WHERE payload @> '{"user": {"country": "Kenya"}}'::jsonb;

The @> operator tests JSON containment: does the left value contain the right structure as a subset? It's the natural fit for 'has this set of attributes' queries and — critically — it's GIN-indexable.

GIN indexes for JSONB

sql
-- Default GIN: supports @>, ?, ?&, ?| operators
CREATE INDEX idx_events_payload_gin ON events USING GIN (payload);
-- Path-ops GIN: smaller index, only supports @> (most common operator)
CREATE INDEX idx_events_payload_pathops ON events USING GIN (payload jsonb_path_ops);
-- Expression index on a specific JSON path (cheapest if you only query one path)
CREATE INDEX idx_events_user_country ON events ((payload -> 'user' ->> 'country'));

jsonb_path_ops is usually the right choice

It's about 30% the size of the default GIN and faster to maintain. Only switch to default GIN if you need ? operators (key-existence tests) or full-text style searches inside JSON.

JSON path expressions (SQL/JSON, Postgres 12+)

sql
-- Returns true if any event in items[] has type='premium'
SELECT id FROM events
WHERE payload @? '$.items[*] ? (@.type == "premium")';
-- Extract all amounts from items[] as a JSONB array
SELECT jsonb_path_query_array(payload, '$.items[*].amount') FROM events;

SQL/JSON path is more powerful than -> / ->> for queries over arrays and conditional traversal. Standardised in SQL:2016, supported by Postgres 12+, Oracle 12c+, SQL Server 2017+, MySQL 8+.

When to denormalise to columns

JSON columns are flexible — schema-on-read, no migrations to add a field. But for fields you query frequently, denormalising them into real columns is faster and lets you use proper B-tree indexes. The break-even is roughly:

  • Read >100 times for every write of the JSON blob → denormalise
  • Filter or sort by the field in most queries → denormalise
  • Field is part of foreign-key relationships → must denormalise (FKs don't work on JSON)
  • Field is rarely read or only read by ad-hoc analytics → keep in JSON
  • Schema is genuinely uncertain or differs by record type → keep in JSON

The 'just put it in JSON' anti-pattern

Schemas that should be relational often start as 'we'll put it in JSON for flexibility' and end up unmaintainable. If you find yourself writing payload ->> 'customer_id' in every query, that field deserves to be a column.

Exercise

Your app stores webhook payloads in events(id, received_at, source, payload JSONB). Engineers query the table in three patterns: (a) all events where payload.event_type IN ('purchase', 'refund') in the last 7 days, (b) all events for a given external user — payload.user.external_id — across all time, (c) ad-hoc 'what's in here?' exploration. (1) Design the indexes. (2) Argue for/against extracting event_type to a real column.

Key takeaways

  • -> returns JSONB; ->> returns text. Chain -> for nested, finish ->> for comparison
  • @> is the containment operator; GIN-indexable; usually the right choice for 'has these attributes' queries
  • Expression indexes for known JSON paths beat full GIN for size and speed
  • Denormalise hot fields to columns; keep flexible-schema fields in JSON

Further reading

  1. 01

    PostgreSQL JSON Types Documentation

    Postgres's JSON / JSONB reference is the most thorough free guide on the topic.

Loading progress…
LeadAfrikPublic Economics Hub