Nested Data

Real-world data is not flat. A JSON API returns nested objects. A Kafka event has arrays of tags. A user profile has a map of preferences. Modern data warehouses (BigQuery, Snowflake, Databricks) can store this nested data natively using STRUCT, ARRAY, and MAP types. This lesson teaches you when to nest vs normalize, how each nested type works, the performance implications for columnar storage, and the UNNEST pitfalls that silently corrupt aggregations.

The Nesting Decision

Daily Life

Interviews

Decide when to embed vs. normalize

When to Nest vs When to Normalize

The fundamental question: should related data live inside the parent row (nested) or in a separate table (normalized)? Both are valid. The choice depends on how the data is queried, how often it changes, and whether the nested data needs to be joined to independently.

Nesting works well when the child data is always read alongside the parent. A user's shipping addresses are almost always retrieved when loading the user record. Normalizing this into a separate addresses table forces a JOIN on every user load. Nesting the addresses as an ARRAY inside the user row eliminates that JOIN.

NestNormalizeHybrid

Nest

Embed in the parent row

Best when child data is always read with the parent, rarely queried independently, and has bounded cardinality (a user has 1-5 addresses, not 10,000).

Normalize

Separate table with FK

Best when child data is queried independently, has high cardinality, changes frequently, or is shared across multiple parent tables.

Hybrid

Nest for reads, normalize for writes

Store normalized in the source system, nest during ETL into the analytical model. Common pattern for denormalized read-optimized tables.

Decision Framework

✓Nest (Embed)

Always read with parent
Low cardinality (1-10 items)
Rarely queried independently
Preserves source JSON structure
Eliminates JOINs for common queries

✓Normalize (Separate Table)

Queried independently of parent
High cardinality (100+ items)
Shared across multiple tables
Changes independently of parent
Needs FK constraints and indexing

A common mistake: nesting everything because the source data is JSON. Just because the API returns nested JSON does not mean your analytical table should be nested. If you frequently filter or JOIN on the nested fields, normalize them.

TIP

The hybrid pattern is the most common in production: source systems store normalized data in PostgreSQL, the ETL pipeline denormalizes and nests it into BigQuery or Snowflake for fast analytical reads. Write-path normalization, read-path nesting.

STRUCT: Embedded Objects

Daily Life

Interviews

Group related fields into one column

Grouping Related Fields Into One Column

A STRUCT is a named group of fields embedded inside a row. Instead of storing address_line1, address_city, address_state, address_zip as four separate columns, you store them as one address STRUCT with four sub-fields. Accessing them uses dot notation: address.city.

STRUCTs do not add rows. They add structure within a row. The fact table row count stays the same. This is the key difference from ARRAY (which can add logical rows via UNNEST).

Approach	Columns	Query
Flat columns	address_line1, address_city, address_state, address_zip	WHERE address_city = 'Seattle'
STRUCT column	address STRUCT<line1, city, state, zip>	WHERE address.city = 'Seattle'

Functionally, both approaches work. The STRUCT version is cleaner when you have many related field groups: shipping_address, billing_address, contact_info. Without STRUCTs, you end up with 12 flat columns that are hard to visually group. With STRUCTs, you have 3 columns, each with sub-fields.

Schema evolution trap in Snowflake: VARIANT columns (Snowflake's semi-structured type) accept any structure but cannot be indexed or used for partition pruning. If you query a VARIANT field frequently, extract it into a typed column for performance.

TIP

Use STRUCTs for logical grouping of related fields that are always read together. Do not use them for fields that need independent indexing, filtering, or JOIN operations. Those should be top-level columns.

ARRAY: Ordered Collections

Daily Life

Interviews

Store lists of values inside a row

Storing Lists Inside a Row

An ARRAY column stores an ordered list of values inside a single row. A product can have an array of tags: ['electronics', 'sale', 'clearance']. An order can have an array of line item STRUCTs. ARRAYs eliminate the need for a separate junction or child table for simple one-to-many relationships.

To query individual elements of an ARRAY, you use UNNEST. UNNEST flattens the array into rows, producing one output row per array element. This is powerful but dangerous: if you UNNEST and then aggregate without accounting for the multiplication, your metrics inflate.

orders

PKorder_idBIGINT

FKcustomer_idBIGINT

amountNUMERIC

order_line_items

PKline_idBIGINT

FKorder_idBIGINT

product_idBIGINT

qtyINT

The normalized form an ARRAY column replaces: instead of a separate order_line_items child table joined per query, line items can live as an ARRAY<STRUCT> inside the order row, read together with the parent (UNNEST to query elements).

The UNNEST Trap

A product has 3 tags and costs $100. If you UNNEST the tags array and then SUM(price), you get $300 instead of $100. The UNNEST multiplied the product row by 3 (one per tag), and the SUM counted the price 3 times. This is the fan-out trap for nested data.

product_id	price	tag (after UNNEST)
P1	$100	electronics
P1	$100	sale
P1	$100	clearance

If you UNNEST just to filter (WHERE 'sale' IN UNNEST(tags)), aggregate BEFORE or use EXISTS instead.
Use ARRAY_LENGTH(tags) instead of UNNEST + COUNT(*). No row multiplication.
If you must UNNEST and aggregate, GROUP BY the original row key first to avoid inflating metrics from other columns.

The CROSS JOIN UNNEST pattern drops rows where the array is NULL or empty. Use LEFT JOIN UNNEST to preserve rows with empty arrays. This is the second most common UNNEST bug after fan-out inflation.

TIP

ARRAYs are excellent for low-cardinality, read-with-parent data (tags, labels, small lists). For high-cardinality or independently-queryable data (order line items, transaction history), normalize into a separate table.

MAP: Dynamic Key-Value Pairs

Daily Life

Interviews

Handle flexible attributes per record

Flexible Attributes Without Schema Changes

A MAP column stores key-value pairs where the keys are not predefined in the schema. A user preferences column might contain {'theme': 'dark', 'language': 'en', 'notifications': 'off'}. Each user can have different keys. Adding a new preference does not require ALTER TABLE.

MAPs are the escape hatch for data that does not have a fixed schema. Event properties, configuration settings, A/B test parameters, custom metadata. Anything where the set of keys varies per row and evolves over time.

The Tradeoff

✓MAP Column

Flexible: any keys, no schema change
Good for sparse, varying attributes
No column explosion for 100+ optional fields
Cannot be indexed or used for partition pruning

✓Typed Columns

Fixed: schema change to add a field
Good for frequently-queried attributes
Can be indexed and used for partitioning
Column explosion if many optional fields

The hybrid pattern: store the full event payload as a MAP or JSON column for flexibility, and extract frequently-queried fields as top-level typed columns for performance. The JSON column preserves everything; the typed columns make the hot-path queries fast.

TIP

If you query a MAP field in more than 30% of your queries against the table, extract it into a typed top-level column. The query performance difference between map_col['key'] and a native column can be 10-100x due to predicate pushdown and columnar encoding.

Columnar Storage & Nesting

Daily Life

Interviews

Optimize nested data for query engines

How Nested Data Interacts with Columnar Engines

Modern analytical databases (BigQuery, Snowflake, Databricks) use columnar storage: data is stored column-by-column, not row-by-row. A query that reads only 3 columns out of 50 only scans those 3 columns. This is why analytical queries are fast. But nested data adds complexity.

STRUCT sub-fields are stored as individual columns in columnar format. address.city is stored alongside address.state and address.zip as separate column chunks. Accessing one sub-field does not read the others. This is efficient.

ARRAY elements are stored using repetition and definition levels (Parquet's Dremel encoding). This is more complex than flat columns. Deeply nested arrays (arrays of structs of arrays) create encoding overhead that degrades scan performance.

Performance Implications

STRUCT: sub-fields are independent columns. Same performance as flat columns for individual field access.

ARRAY with few elements: minimal overhead. A tags array with 3-5 elements is fine.

ARRAY with many elements: encoding overhead grows. A 10,000-element array per row defeats the purpose of columnar storage.

Deeply nested structures (ARRAY<STRUCT<ARRAY<...>>>): each nesting level adds encoding complexity. Flatten if possible.

MAP/VARIANT: stored as semi-structured blobs. Cannot use columnar encoding, predicate pushdown, or min/max statistics.

Best Practices

Use STRUCTs freely for logical grouping. They have zero performance cost in columnar engines.

Keep arrays small (under ~100 elements). Large arrays should be normalized into separate tables.

Extract frequently-filtered fields from MAPs into typed columns for predicate pushdown.

Limit nesting depth to 2-3 levels. Deeper nesting degrades both query performance and developer comprehension.

The general rule: columnar engines are optimized for flat, typed columns. Nesting is a convenience feature that trades some query performance for schema flexibility. Use it where it makes the data model cleaner, but do not over-nest. When in doubt, flatten.

❯❯❯PUTTING IT ALL TOGETHER

> You are modeling e-commerce order data in BigQuery. Each order has line items, each with a product reference and a list of applied discounts.

You nest line items as ARRAY<STRUCT<product_id, quantity, price>> because they are always read with the order and have bounded cardinality (1-50 items).

You normalize products into a separate dim_product table because product attributes are shared across orders and queried independently.

You store discount codes as ARRAY<STRING> on the line item STRUCT. When querying 'orders using discount X,' you use EXISTS with UNNEST instead of CROSS JOIN UNNEST to avoid inflating order totals.

KEY TAKEAWAYS

Nest for reads, normalize for writes: embed data that is always read with its parent; normalize data queried independently

STRUCT = logical grouping: zero performance cost, sub-fields stored as independent columns

ARRAY = embedded one-to-many: keep small; UNNEST multiplies rows and can inflate aggregates

MAP = schema flexibility: extract hot fields into typed columns for performance

Do not over-nest: columnar engines are optimized for flat typed columns; nest only where it simplifies the model

When flat tables meet nested reality

Category: Data Modeling
Duration: 15 minutes
Challenges: 12 hands-on challenges

Topics covered: The Nesting Decision, STRUCT: Embedded Objects, ARRAY: Ordered Collections, MAP: Dynamic Key-Value Pairs, Columnar Storage & Nesting

Lesson Sections

The Nesting Decision (concepts: dmDenormalization)
When to Nest vs When to Normalize The fundamental question: should related data live inside the parent row (nested) or in a separate table (normalized)? Both are valid. The choice depends on how the data is queried, how often it changes, and whether the nested data needs to be joined to independently. Nesting works well when the child data is always read alongside the parent. A user's shipping addresses are almost always retrieved when loading the user record. Normalizing this into a separate ad
STRUCT: Embedded Objects (concepts: dmStructType)
Grouping Related Fields Into One Column A STRUCT is a named group of fields embedded inside a row. Instead of storing address_line1, address_city, address_state, address_zip as four separate columns, you store them as one address STRUCT with four sub-fields. Accessing them uses dot notation: address.city. STRUCTs do not add rows. They add structure within a row. The fact table row count stays the same. This is the key difference from ARRAY (which can add logical rows via UNNEST). Functionally, b
ARRAY: Ordered Collections (concepts: dmArrayType)
Storing Lists Inside a Row An ARRAY column stores an ordered list of values inside a single row. A product can have an array of tags: ['electronics', 'sale', 'clearance']. An order can have an array of line item STRUCTs. ARRAYs eliminate the need for a separate junction or child table for simple one-to-many relationships. To query individual elements of an ARRAY, you use UNNEST. UNNEST flattens the array into rows, producing one output row per array element. This is powerful but dangerous: if yo
MAP: Dynamic Key-Value Pairs (concepts: dmMapType)
Flexible Attributes Without Schema Changes A MAP column stores key-value pairs where the keys are not predefined in the schema. A user preferences column might contain {'theme': 'dark', 'language': 'en', 'notifications': 'off'}. Each user can have different keys. Adding a new preference does not require ALTER TABLE. MAPs are the escape hatch for data that does not have a fixed schema. Event properties, configuration settings, A/B test parameters, custom metadata. Anything where the set of keys v
Columnar Storage & Nesting (concepts: dmColumnarStorage)
How Nested Data Interacts with Columnar Engines Modern analytical databases (BigQuery, Snowflake, Databricks) use columnar storage: data is stored column-by-column, not row-by-row. A query that reads only 3 columns out of 50 only scans those 3 columns. This is why analytical queries are fast. But nested data adds complexity. STRUCT sub-fields are stored as individual columns in columnar format. address.city is stored alongside address.state and address.zip as separate column chunks. Accessing on