Nested Data

Real-world data is not flat. A JSON API returns nested objects. A Kafka event has arrays of tags. A user profile has a map of preferences. Modern data warehouses (BigQuery, Snowflake, Databricks) can store this nested data natively using STRUCT, ARRAY, and MAP types. This lesson teaches you when to nest vs normalize, how each nested type works, the performance implications for columnar storage, and the UNNEST pitfalls that silently corrupt aggregations.

The Nesting Decision

Daily Life
Interviews

Decide when to embed vs. normalize

When to Nest vs When to Normalize

The fundamental question: should related data live inside the parent row (nested) or in a separate table (normalized)? Both are valid. The choice depends on how the data is queried, how often it changes, and whether the nested data needs to be joined to independently.
Nesting works well when the child data is always read alongside the parent. A user's shipping addresses are almost always retrieved when loading the user record. Normalizing this into a separate addresses table forces a JOIN on every user load. Nesting the addresses as an ARRAY inside the user row eliminates that JOIN.
NestNormalizeHybrid
Nest
Embed in the parent row
Best when child data is always read with the parent, rarely queried independently, and has bounded cardinality (a user has 1-5 addresses, not 10,000).
Normalize
Separate table with FK
Best when child data is queried independently, has high cardinality, changes frequently, or is shared across multiple parent tables.
Hybrid
Nest for reads, normalize for writes
Store normalized in the source system, nest during ETL into the analytical model. Common pattern for denormalized read-optimized tables.

Decision Framework

Nest (Embed)
  • Always read with parent
  • Low cardinality (1-10 items)
  • Rarely queried independently
  • Preserves source JSON structure
  • Eliminates JOINs for common queries
Normalize (Separate Table)
  • Queried independently of parent
  • High cardinality (100+ items)
  • Shared across multiple tables
  • Changes independently of parent
  • Needs FK constraints and indexing

A common mistake: nesting everything because the source data is JSON. Just because the API returns nested JSON does not mean your analytical table should be nested. If you frequently filter or JOIN on the nested fields, normalize them.

TIP
The hybrid pattern is the most common in production: source systems store normalized data in PostgreSQL, the ETL pipeline denormalizes and nests it into BigQuery or Snowflake for fast analytical reads. Write-path normalization, read-path nesting.

STRUCT: Embedded Objects

Daily Life
Interviews

Group related fields into one column

Grouping Related Fields Into One Column

A STRUCT is a named group of fields embedded inside a row. Instead of storing address_line1, address_city, address_state, address_zip as four separate columns, you store them as one address STRUCT with four sub-fields. Accessing them uses dot notation: address.city.
STRUCTs do not add rows. They add structure within a row. The fact table row count stays the same. This is the key difference from ARRAY (which can add logical rows via UNNEST).
ApproachColumnsQuery
Flat columnsaddress_line1, address_city, address_state, address_zipWHERE address_city = 'Seattle'
STRUCT columnaddress STRUCT<line1, city, state, zip>WHERE address.city = 'Seattle'
Functionally, both approaches work. The STRUCT version is cleaner when you have many related field groups: shipping_address, billing_address, contact_info. Without STRUCTs, you end up with 12 flat columns that are hard to visually group. With STRUCTs, you have 3 columns, each with sub-fields.

Schema evolution trap in Snowflake: VARIANT columns (Snowflake's semi-structured type) accept any structure but cannot be indexed or used for partition pruning. If you query a VARIANT field frequently, extract it into a typed column for performance.

TIP
Use STRUCTs for logical grouping of related fields that are always read together. Do not use them for fields that need independent indexing, filtering, or JOIN operations. Those should be top-level columns.

ARRAY: Ordered Collections

Daily Life
Interviews

Store lists of values inside a row

Storing Lists Inside a Row

An ARRAY column stores an ordered list of values inside a single row. A product can have an array of tags: ['electronics', 'sale', 'clearance']. An order can have an array of line item STRUCTs. ARRAYs eliminate the need for a separate junction or child table for simple one-to-many relationships.
To query individual elements of an ARRAY, you use UNNEST. UNNEST flattens the array into rows, producing one output row per array element. This is powerful but dangerous: if you UNNEST and then aggregate without accounting for the multiplication, your metrics inflate.

The UNNEST Trap

A product has 3 tags and costs $100. If you UNNEST the tags array and then SUM(price), you get $300 instead of $100. The UNNEST multiplied the product row by 3 (one per tag), and the SUM counted the price 3 times. This is the fan-out trap for nested data.
product_idpricetag (after UNNEST)
P1$100electronics
P1$100sale
P1$100clearance
  • If you UNNEST just to filter (WHERE 'sale' IN UNNEST(tags)), aggregate BEFORE or use EXISTS instead.
  • Use ARRAY_LENGTH(tags) instead of UNNEST + COUNT(*). No row multiplication.
  • If you must UNNEST and aggregate, GROUP BY the original row key first to avoid inflating metrics from other columns.

The CROSS JOIN UNNEST pattern drops rows where the array is NULL or empty. Use LEFT JOIN UNNEST to preserve rows with empty arrays. This is the second most common UNNEST bug after fan-out inflation.

TIP
ARRAYs are excellent for low-cardinality, read-with-parent data (tags, labels, small lists). For high-cardinality or independently-queryable data (order line items, transaction history), normalize into a separate table.

MAP: Dynamic Key-Value Pairs

Daily Life
Interviews

Handle flexible attributes per record

Flexible Attributes Without Schema Changes

A MAP column stores key-value pairs where the keys are not predefined in the schema. A user preferences column might contain {'theme': 'dark', 'language': 'en', 'notifications': 'off'}. Each user can have different keys. Adding a new preference does not require ALTER TABLE.
MAPs are the escape hatch for data that does not have a fixed schema. Event properties, configuration settings, A/B test parameters, custom metadata. Anything where the set of keys varies per row and evolves over time.

The Tradeoff

MAP Column
  • Flexible: any keys, no schema change
  • Good for sparse, varying attributes
  • No column explosion for 100+ optional fields
  • Cannot be indexed or used for partition pruning
Typed Columns
  • Fixed: schema change to add a field
  • Good for frequently-queried attributes
  • Can be indexed and used for partitioning
  • Column explosion if many optional fields
The hybrid pattern: store the full event payload as a MAP or JSON column for flexibility, and extract frequently-queried fields as top-level typed columns for performance. The JSON column preserves everything; the typed columns make the hot-path queries fast.
TIP
If you query a MAP field in more than 30% of your queries against the table, extract it into a typed top-level column. The query performance difference between map_col['key'] and a native column can be 10-100x due to predicate pushdown and columnar encoding.

Columnar Storage & Nesting

Daily Life
Interviews

Optimize nested data for query engines

How Nested Data Interacts with Columnar Engines

Modern analytical databases (BigQuery, Snowflake, Databricks) use columnar storage: data is stored column-by-column, not row-by-row. A query that reads only 3 columns out of 50 only scans those 3 columns. This is why analytical queries are fast. But nested data adds complexity.
STRUCT sub-fields are stored as individual columns in columnar format. address.city is stored alongside address.state and address.zip as separate column chunks. Accessing one sub-field does not read the others. This is efficient.
ARRAY elements are stored using repetition and definition levels (Parquet's Dremel encoding). This is more complex than flat columns. Deeply nested arrays (arrays of structs of arrays) create encoding overhead that degrades scan performance.

Performance Implications

check
STRUCT: sub-fields are independent columns. Same performance as flat columns for individual field access.
check
ARRAY with few elements: minimal overhead. A tags array with 3-5 elements is fine.
alert
ARRAY with many elements: encoding overhead grows. A 10,000-element array per row defeats the purpose of columnar storage.
alert
Deeply nested structures (ARRAY<STRUCT<ARRAY<...>>>): each nesting level adds encoding complexity. Flatten if possible.
alert
MAP/VARIANT: stored as semi-structured blobs. Cannot use columnar encoding, predicate pushdown, or min/max statistics.

Best Practices

compress
Use STRUCTs freely for logical grouping. They have zero performance cost in columnar engines.
filter
Keep arrays small (under ~100 elements). Large arrays should be normalized into separate tables.
partition
Extract frequently-filtered fields from MAPs into typed columns for predicate pushdown.
table
Limit nesting depth to 2-3 levels. Deeper nesting degrades both query performance and developer comprehension.
The general rule: columnar engines are optimized for flat, typed columns. Nesting is a convenience feature that trades some query performance for schema flexibility. Use it where it makes the data model cleaner, but do not over-nest. When in doubt, flatten.
PUTTING IT ALL TOGETHER

> You are modeling e-commerce order data in BigQuery. Each order has line items, each with a product reference and a list of applied discounts.

You nest line items as ARRAY<STRUCT<product_id, quantity, price>> because they are always read with the order and have bounded cardinality (1-50 items).
You normalize products into a separate dim_product table because product attributes are shared across orders and queried independently.
You store discount codes as ARRAY<STRING> on the line item STRUCT. When querying 'orders using discount X,' you use EXISTS with UNNEST instead of CROSS JOIN UNNEST to avoid inflating order totals.
KEY TAKEAWAYS
Nest for reads, normalize for writes: embed data that is always read with its parent; normalize data queried independently
STRUCT = logical grouping: zero performance cost, sub-fields stored as independent columns
ARRAY = embedded one-to-many: keep small; UNNEST multiplies rows and can inflate aggregates
MAP = schema flexibility: extract hot fields into typed columns for performance
Do not over-nest: columnar engines are optimized for flat typed columns; nest only where it simplifies the model

When flat tables meet nested reality

Category
Data Modeling
Duration
15 minutes
Challenges
12 hands-on challenges

Topics covered: The Nesting Decision, STRUCT: Embedded Objects, ARRAY: Ordered Collections, MAP: Dynamic Key-Value Pairs, Columnar Storage & Nesting

Lesson Sections

  1. The Nesting Decision

    When to Nest vs When to Normalize The fundamental question: should related data live inside the parent row (nested) or in a separate table (normalized)? Both are valid. The choice depends on how the data is queried, how often it changes, and whether the nested data needs to be joined to independently. Nesting works well when the child data is always read alongside the parent. A user's shipping addresses are almost always retrieved when loading the user record. Normalizing this into a separate ad

  2. STRUCT: Embedded Objects (concepts: dmStructType)

    Grouping Related Fields Into One Column A STRUCT is a named group of fields embedded inside a row. Instead of storing address_line1, address_city, address_state, address_zip as four separate columns, you store them as one address STRUCT with four sub-fields. Accessing them uses dot notation: address.city. STRUCTs do not add rows. They add structure within a row. The fact table row count stays the same. This is the key difference from ARRAY (which can add logical rows via UNNEST). Functionally, b

  3. ARRAY: Ordered Collections (concepts: dmArrayType)

    Storing Lists Inside a Row An ARRAY column stores an ordered list of values inside a single row. A product can have an array of tags: ['electronics', 'sale', 'clearance']. An order can have an array of line item STRUCTs. ARRAYs eliminate the need for a separate junction or child table for simple one-to-many relationships. To query individual elements of an ARRAY, you use UNNEST. UNNEST flattens the array into rows, producing one output row per array element. This is powerful but dangerous: if yo

  4. MAP: Dynamic Key-Value Pairs (concepts: dmMapType)

    Flexible Attributes Without Schema Changes A MAP column stores key-value pairs where the keys are not predefined in the schema. A user preferences column might contain {'theme': 'dark', 'language': 'en', 'notifications': 'off'}. Each user can have different keys. Adding a new preference does not require ALTER TABLE. MAPs are the escape hatch for data that does not have a fixed schema. Event properties, configuration settings, A/B test parameters, custom metadata. Anything where the set of keys v

  5. Columnar Storage & Nesting (concepts: dmColumnarStorage)

    How Nested Data Interacts with Columnar Engines Modern analytical databases (BigQuery, Snowflake, Databricks) use columnar storage: data is stored column-by-column, not row-by-row. A query that reads only 3 columns out of 50 only scans those 3 columns. This is why analytical queries are fast. But nested data adds complexity. STRUCT sub-fields are stored as individual columns in columnar format. address.city is stored alongside address.state and address.zip as separate column chunks. Accessing on