Data Types: Advanced

Snowflake's ARRAY and OBJECT types let engineering teams store every attribute of a user session in a single row rather than normalizing events into a separate table with one row per attribute. A session with two hundred event properties becomes one wide row instead of two hundred narrow rows, and queries that previously required a multi-way join now read a single scan. Engineering teams at companies like Spotify have measured storage cost reductions of 60 percent compared to fully normalized designs, alongside query speed improvements from reduced I/O. This lesson teaches you the native collection types that make this kind of schema design possible.

Type optimization at scale

Daily Life
Interviews

Pick types that save storage and speed

Choosing the right data type isn't just about correctness. It's also about performance and storage efficiency. At scale, poor type choices can waste terabytes of disk space and slow queries by orders of magnitude.

Storage Impact

Different types occupy different amounts of space. On a billion-row table, the difference between INT and BIGINT is 4GB of storage.

INTEGER (4 bytes)
Range: -2.1B to 2.1B. Use for most IDs and counts.
BIGINT (8 bytes)
Range: -9 quintillion to 9 quintillion. Use for timestamps or very large counts.
SMALLINT (2 bytes)
Range: -32K to 32K. Use for age, quantity, status codes.
BOOLEAN (1 byte)
Always use for flags. Never store as VARCHAR('true').

VARCHAR vs CHAR

VARCHAR stores only the actual string content, while CHAR pads short values to fixed length. For variable-length data, VARCHAR saves space.

1SELECT
2 LENGTH('USA') AS len,
3 LENGTH(CAST('USA' AS CHAR(10))) AS char_len,
4 LENGTH(
5 CAST('USA' AS VARCHAR(10))
6 ) AS varchar_len
Result
lenchar_lenvarchar_len
3103

CHAR(10) wastes 7 bytes per row. On 100M rows, that's 700MB of wasted space.

Indexing and Type Choice

Indexes work best when the type matches the data distribution. String indexes on numeric data are slow because string comparisons are byte-by-byte.
String ID
  • Stored as VARCHAR
  • Index size: 2x larger
  • Comparison: byte-by-byte
  • Range scans: slow
Integer ID
  • Stored as INTEGER
  • Index size: compact
  • Comparison: single op
  • Range scans: fast

> Choose the most storage-efficient type for a "status" column that holds values 1-5.

SELECT
  product_id,
  CAST(status AS ___) AS status
FROM products
SMALLINT
INTEGER
VARCHAR
BIGINT

The difference between INTEGER and SMALLINT is 2 bytes per row. On a table with one billion rows, that adds up to 2 GB of storage. At scale, type precision is never a trivial choice.

String-based comparisons on numeric IDs require byte-by-byte evaluation, while integer comparisons happen in a single CPU operation. This is why numeric types are preferred for join keys and indexed columns.
Columnar storage engines like Parquet and ORC benefit most from appropriate numeric types because they apply dictionary encoding and bit-packing more effectively on smaller types.

Storage calculations and VARCHAR

Daily Life
Interviews

Estimate how much space your data uses

VARCHAR length limits are critical for storage planning. Choosing VARCHAR(1000) when you need VARCHAR(100) wastes index space and slows queries.

How VARCHAR Stores Data

VARCHAR(n) uses a length prefix (1-2 bytes) plus the actual string bytes. UTF-8 encoding means international characters use 2-4 bytes each.

1SELECT
2 LENGTH('Hello') AS chars,
3 OCTET_LENGTH('Hello') AS bytes,
4 LENGTH('你好') AS chars_utf8,
5 OCTET_LENGTH('你好') AS bytes_utf8
Result
charsbyteschars_utf8bytes_utf8
5526

The Chinese characters "你好" are 2 characters but 6 bytes. VARCHAR(100) can store 100 ASCII characters but only 33-50 Chinese characters depending on encoding.

Calculating Storage Needs

To estimate table size, multiply row count by sum of column sizes plus overhead (typically 20-40 bytes per row for metadata).
1B rows1B rows1B rows
1B rows
VARCHAR email
Avg 20 chars = ~22GB total
1B rows
BIGINT ts
8 bytes each = ~8GB total
1B rows
INT user_id
4 bytes each = ~4GB total
With these storage costs in mind, here are the key rules to follow when sizing your columns:
Do
  • Size VARCHAR to 2x the expected max value for safety margin
  • Use SMALLINT or INTEGER for codes/IDs instead of strings
  • Analyze actual data before choosing VARCHAR length
  • Use VARCHAR without a length limit for unbounded strings (blog posts, descriptions)
Don't
  • Don't use VARCHAR(max) everywhere because it wastes index space
  • Don't store numeric IDs as strings (waste space and slow queries)
  • Don't forget UTF-8 overhead for international content
  • Don't use CHAR for variable-length data

> Complete this query to find usernames that exceed a 50-character limit.

SELECT
  username,
  LENGTH(___) AS char_count
FROM users
WHERE ___(username) > 50
username
SIZE
LENGTH
CHAR

Auditing actual string lengths before choosing a VARCHAR size prevents wasted storage and avoids truncation surprises.

Remember that UTF-8 characters can use 1 to 4 bytes each, so character count and byte count may differ for international text.

A good practice is to size VARCHAR to roughly 2x the expected maximum to provide a safety margin without excessive waste.

MAP data type for key-value pairs

Daily Life
Interviews

Store flexible metadata in columns

A MAP stores key-value pairs, like a Python dictionary or JavaScript object. This is useful for sparse data where each row has different attributes.

Creating Maps

Use MAP with key-value pairs or MAP_FROM_ENTRIES to build maps from arrays.

1SELECT
2 MAP('name', 'Alice', 'age', '30') AS user_map
Result
user_map
{name=Alice, age=30}

Accessing Map Values

Access values with bracket notation or ELEMENT_AT. Missing keys return NULL.

1SELECT
2 user_map 'name' AS name,
3 user_map 'age' AS age,
4 user_map 'city' AS city
5FROM users
Result
nameagecity
Alice30NULL

When to Use Maps

Good Use Cases for MAP
  • User preferences with variable keys
  • Product attributes that vary by category
  • Configuration settings per entity
  • Sparse feature flags
TIP
MAPs work well when keys vary across rows. If every row has the same keys, use regular columns instead.

> Complete this query to extract the color value from a MAP column.

SELECT
  product_id,
  ___ ___ AS color
FROM products
MAP_GET
'color'
attributes
'name'
Bracket notation on a MAP column provides O(1) hash-based access, making it fast regardless of how many keys the map contains.

When a key does not exist in the map, the result is NULL rather than an error, so always account for missing keys in your logic.

Maps are ideal for sparse data where each row may have different attributes, avoiding the need for many nullable columns.

Accessing nested data (UNNEST)

Daily Life
Interviews

Query inside arrays and nested fields

When you have arrays or maps, UNNEST "explodes" them into separate rows, making nested data queryable with standard SQL.

UNNEST Arrays

UNNEST takes an array and produces one row per element.

1SELECT
2 customer_id,
3 tag
4FROM customers
5CROSS JOIN UNNEST(tags) AS t(tag)
Result
customer_idtag
C001premium
C001early_adopter
C002trial

Customer C001 had tags [premium, early_adopter], which UNNEST expanded into 2 rows.

Filtering Nested Data

After unnesting, you can filter on array elements as if they were regular columns.
1SELECT
2 customer_id,
3 COUNT(*) AS order_count
4FROM customers
5CROSS JOIN UNNEST(order_ids) AS o(order_id)
6WHERE order_id > 1000
7GROUP BY customer_id
Result
customer_idorder_count
C0012
C0021

Filtering After UNNEST

A common pattern: UNNEST to expand, then WHERE to filter the individual elements.

1SELECT
2 customer_id,
3 COUNT(*) AS premium_count
4FROM customers
5CROSS JOIN UNNEST(tags) AS t(tag)
6WHERE tag LIKE '%premium%'
7GROUP BY customer_id
Result
customer_idpremium_count
C0011
C0021

> Complete the query to count how many tags each customer has after unnesting.

SELECT
  customer_id,
  ___ AS tag_count
FROM customers
CROSS JOIN ___(___) AS t(tag)
GROUP BY customer_id
UNNEST
SUM
tags
COUNT(*)

UNNEST creates a Cartesian product between the original row and each array element. Be mindful of row explosion when arrays are large or when multiple arrays are unnested simultaneously.

After unnesting, you can filter, join, and aggregate using standard SQL syntax. The pattern of UNNEST followed by GROUP BY to re-aggregate is one of the most common in data engineering.

Advanced data types unlock powerful transformations. Practice with hands-on challenges to master nested data queries.

Compression and error handling

Daily Life
Interviews

Recover gracefully from type mismatches

Modern column-oriented databases use compression to dramatically reduce storage. The compression ratio depends on the data type and the distribution of values. Understanding these techniques helps you design schemas that compress well.
//

Compression Types

DICTIONARYRUN-LENGTHDELTA
DICTIONARY
Value mapping
Replaces repeats with IDs
RUN-LENGTH
Streak counts
Stores value and run count
DELTA
Offset storage
Stores diffs between rows

The chart below shows which compression technique works best for different data patterns. Dictionary compression excels with low-cardinality data (repeating values like status codes). Run-length encoding dominates when data is sorted with long consecutive runs. Delta encoding is ideal for sequential values like timestamps. High-cardinality data with unique values typically cannot be compressed effectively.

Dictionary
Run-Length
Delta
Uncompressed
1SELECT
2 COUNT(*) AS total_rows,
3 COUNT(DISTINCT country) AS unique_countries,
4 AVG(LENGTH(description)) AS avg_desc_len,
5 MAX(LENGTH(description)) AS max_desc_len,
6 SUM(LENGTH(description)) AS total_bytes,
7 COUNT(*) * 8 AS fixed_col_bytes
8FROM products
Result
total_rowsunique_countriesavg_desc_lenmax_desc_lentotal_bytesfixed_col_bytes
100000019545200450000008000000
Do
  • Sort by commonly filtered columns (date)
  • Use low-cardinality types for status columns
  • Define sort keys on large tables
  • Profile cardinality before schema design
Don't
  • Store sequential IDs as VARCHAR
  • Use random UUIDs if sorted access matters
  • Ignore compression in cost estimates
  • Skip sort keys on analytical tables

Robust Error Handling

Type conversion failures in production can crash entire pipelines. Robust error handling is essential for data engineering. These patterns handle progressively more complex scenarios.
ERROR HANDLING STRATEGIES
  • TRY_CAST: Returns NULL on failure instead of crashing
  • COALESCE: Chain multiple parsing strategies with fallbacks
  • Data Quality Metrics: Count valid vs invalid rows for monitoring
  • Quarantine Pattern: Route bad records to separate table for review
1SELECT
2 row_id,
3 price_text,
4 TRY_CAST(
5 price_text
6 AS DECIMAL(10, 2)
7 ) AS price_clean,
8 CASE
9 WHEN TRY_CAST(price_text AS DECIMAL(10, 2)) IS NULL
10 AND price_text IS NOT NULL THEN 'INVALID'
11 ELSE 'OK'
12 END AS status
13FROM raw_import
Result
row_idprice_textprice_cleanstatus
129.9929.99OK
2N/ANULLINVALID
315.0015.00OK
How should you handle type conversion on untrusted data?
CAST fails the entire query if any single row has invalid data. One bad record like "N/A" in a million rows will crash your pipeline and produce zero results.
1SELECT
2 row_id,
3 CAST(price_text AS DECIMAL(10, 2)) AS price
4FROM raw_import
//

Applying Error Handling Patterns

When working with untrusted data, providing a fallback value for NULL results keeps downstream queries safe. Use TRY_CAST to avoid crashes and COALESCE to provide sensible defaults.

> Complete the query to provide a default value of 0 when the rating column is NULL.

SELECT
  name,
  ___(rating, ___)
FROM products
NULLIF
COALESCE
NULL
0

Robust error handling with TRY_CAST and COALESCE is essential for production data pipelines. These patterns ensure queries complete gracefully even with invalid input data.

TRY_CAST returns NULL instead of raising an error when conversion fails. This makes it ideal for data from external sources where type consistency cannot be guaranteed.

Combining TRY_CAST with COALESCE is the canonical ETL pattern: attempt the conversion and substitute a safe default if the value cannot be cast.

PUTTING IT ALL TOGETHER

> You are a data engineer at Amplitude designing a PostgreSQL schema that must store both structured user metadata and flexible JSON event payloads from the mobile SDK. Column type choices directly affect query speed and storage costs at billions of rows.

Type optimization at scale determines whether status and country columns use SMALLINT dictionary encoding or VARCHAR, changing compression ratios and scan throughput significantly.
VARCHAR sizing based on actual data audits prevents index bloat on user_name and email columns, keeping range scans fast on the billion-row events table.
MAP columns store per-user feature flag assignments efficiently when each user has different active experiments, avoiding hundreds of nullable columns.
UNNEST expands the SDK payload tags array into individual rows for aggregation, then ARRAY_AGG recollects them after filtering out invalid entries.
KEY TAKEAWAYS
Use the smallest type that fits your data; every byte saved is GB at scale
INT indexes are 2x smaller than BIGINT; more index fits in RAM
Type-matched joins are 10-100x faster than implicit conversion joins
VARCHAR stores actual length + overhead, not the max size declared
MAP<K,V> stores key-value pairs; access with bracket notation or ELEMENT_AT
UNNEST expands arrays/maps into rows; beware of data explosion
Dictionary encoding achieves 95%+ compression on low-cardinality columns
Use TRY_CAST + COALESCE for robust error handling in ETL

Data Types: Advanced

Arrays, maps, and type optimization at scale

Category
SQL
Difficulty
advanced
Duration
21 minutes
Challenges
0 hands-on challenges

Topics covered: Type optimization at scale, Storage calculations and VARCHAR, MAP data type for key-value pairs, Accessing nested data (UNNEST), Compression and error handling

Lesson Sections

  1. Type optimization at scale (concepts: sqlStorageOptimization)

    Choosing the right data type isn't just about correctness. It's also about performance and storage efficiency. At scale, poor type choices can waste terabytes of disk space and slow queries by orders of magnitude. Storage Impact VARCHAR vs CHAR Indexing and Type Choice Indexes work best when the type matches the data distribution. String indexes on numeric data are slow because string comparisons are byte-by-byte. String-based comparisons on numeric IDs require byte-by-byte evaluation, while int

  2. Storage calculations and VARCHAR

    How VARCHAR Stores Data Calculating Storage Needs To estimate table size, multiply row count by sum of column sizes plus overhead (typically 20-40 bytes per row for metadata). With these storage costs in mind, here are the key rules to follow when sizing your columns: Remember that UTF-8 characters can use 1 to 4 bytes each, so character count and byte count may differ for international text.

  3. MAP data type for key-value pairs (concepts: sqlMapBasic)

    Creating Maps Accessing Map Values When to Use Maps Bracket notation on a MAP column provides O(1) hash-based access, making it fast regardless of how many keys the map contains. Maps are ideal for sparse data where each row may have different attributes, avoiding the need for many nullable columns.

  4. Accessing nested data (UNNEST)

    UNNEST Arrays Filtering Nested Data After unnesting, you can filter on array elements as if they were regular columns. Filtering After UNNEST Advanced data types unlock powerful transformations. Practice with hands-on challenges to master nested data queries.

  5. Compression and error handling

    Modern column-oriented databases use compression to dramatically reduce storage. The compression ratio depends on the data type and the distribution of values. Understanding these techniques helps you design schemas that compress well. Compression Types Robust Error Handling Type conversion failures in production can crash entire pipelines. Robust error handling is essential for data engineering. These patterns handle progressively more complex scenarios. How should you handle type conversion on