Data Types: Advanced

Snowflake's ARRAY and OBJECT types let engineering teams store every attribute of a user session in a single row rather than normalizing events into a separate table with one row per attribute. A session with two hundred event properties becomes one wide row instead of two hundred narrow rows, and queries that previously required a multi-way join now read a single scan. Engineering teams at companies like Spotify have measured storage cost reductions of 60 percent compared to fully normalized designs, alongside query speed improvements from reduced I/O. This lesson teaches you the native collection types that make this kind of schema design possible.

Type optimization at scale

Daily Life

Interviews

Pick types that save storage and speed

Choosing the right data type isn't just about correctness. It's also about performance and storage efficiency. At scale, poor type choices can waste terabytes of disk space and slow queries by orders of magnitude.

Storage Impact

Different types occupy different amounts of space. On a billion-row table, the difference between INT and BIGINT is 4GB of storage.

INTEGER (4 bytes)

Range: -2.1B to 2.1B. Use for most IDs and counts.

BIGINT (8 bytes)

Range: -9 quintillion to 9 quintillion. Use for timestamps or very large counts.

SMALLINT (2 bytes)

Range: -32K to 32K. Use for age, quantity, status codes.

BOOLEAN (1 byte)

Always use for flags. Never store as VARCHAR('true').

VARCHAR vs CHAR

VARCHAR stores only the actual string content, while CHAR pads short values to fixed length. For variable-length data, VARCHAR saves space.

	SELECT
	LENGTH('USA') AS len,
	LENGTH(CAST('USA' AS CHAR(10))) AS char_len,
	LENGTH(
	CAST('USA' AS VARCHAR(10))
	) AS varchar_len

Result

len	char_len	varchar_len
3	10	3

CHAR(10) wastes 7 bytes per row. On 100M rows, that's 700MB of wasted space.

Indexing and Type Choice

Indexes work best when the type matches the data distribution. String indexes on numeric data are slow because string comparisons are byte-by-byte.

•String ID

Stored as VARCHAR
Index size: 2x larger
Comparison: byte-by-byte
Range scans: slow

•Integer ID

Stored as INTEGER
Index size: compact
Comparison: single op
Range scans: fast

> Choose the most storage-efficient type for a "status" column that holds values 1-5.

SELECT
  product_id,
  CAST(status AS ___) AS status
FROM products

SMALLINT

BIGINT

VARCHAR

INTEGER

The difference between INTEGER and SMALLINT is 2 bytes per row. On a table with one billion rows, that adds up to 2 GB of storage. At scale, type precision is never a trivial choice.

String-based comparisons on numeric IDs require byte-by-byte evaluation, while integer comparisons happen in a single CPU operation. This is why numeric types are preferred for join keys and indexed columns.

Columnar storage engines like Parquet and ORC benefit most from appropriate numeric types because they apply dictionary encoding and bit-packing more effectively on smaller types.

Storage calculations and VARCHAR

Daily Life

Interviews

Estimate how much space your data uses

VARCHAR length limits are critical for storage planning. Choosing VARCHAR(1000) when you need VARCHAR(100) wastes index space and slows queries.

How VARCHAR Stores Data

VARCHAR(n) uses a length prefix (1-2 bytes) plus the actual string bytes. UTF-8 encoding means international characters use 2-4 bytes each.

	SELECT
	LENGTH('Hello') AS chars,
	OCTET_LENGTH('Hello') AS bytes,
	LENGTH('你好') AS chars_utf8,
	OCTET_LENGTH('你好') AS bytes_utf8

Result

chars	bytes	chars_utf8	bytes_utf8
5	5	2	6

The Chinese characters "你好" are 2 characters but 6 bytes. VARCHAR(100) can store 100 ASCII characters but only 33-50 Chinese characters depending on encoding.

Calculating Storage Needs

To estimate table size, multiply row count by sum of column sizes plus overhead (typically 20-40 bytes per row for metadata).

1B rows1B rows1B rows

1B rows

VARCHAR email

Avg 20 chars = ~22GB total

1B rows

BIGINT ts

8 bytes each = ~8GB total

1B rows

INT user_id

4 bytes each = ~4GB total

With these storage costs in mind, here are the key rules to follow when sizing your columns:

✓Do

Size VARCHAR to 2x the expected max value for safety margin
Use SMALLINT or INTEGER for codes/IDs instead of strings
Analyze actual data before choosing VARCHAR length
Use VARCHAR without a length limit for unbounded strings (blog posts, descriptions)

✗Don't

Don't use VARCHAR(max) everywhere because it wastes index space
Don't store numeric IDs as strings (waste space and slow queries)
Don't forget UTF-8 overhead for international content
Don't use CHAR for variable-length data

> Complete this query to find usernames that exceed a 50-character limit.

SELECT
  username,
  LENGTH(___) AS char_count
FROM users
WHERE ___(username) > 50

LENGTH

username

SIZE

CHAR

Auditing actual string lengths before choosing a VARCHAR size prevents wasted storage and avoids truncation surprises.

Remember that UTF-8 characters can use 1 to 4 bytes each, so character count and byte count may differ for international text.

A good practice is to size VARCHAR to roughly 2x the expected maximum to provide a safety margin without excessive waste.

MAP data type for key-value pairs

Daily Life

Interviews

Store flexible metadata in columns

A MAP stores key-value pairs, like a Python dictionary or JavaScript object. This is useful for sparse data where each row has different attributes.

Creating Maps

Use MAP with key-value pairs or MAP_FROM_ENTRIES to build maps from arrays.

	SELECT
	MAP('name', 'Alice', 'age', '30') AS user_map

Result

user_map
{name=Alice, age=30}

Accessing Map Values

Access values with bracket notation or ELEMENT_AT. Missing keys return NULL.

	SELECT
	user_map 'name' AS name,
	user_map 'age' AS age,
	user_map 'city' AS city
	FROM users

Result

name	age	city
Alice	30	NULL

When to Use Maps

Good Use Cases for MAP

User preferences with variable keys
Product attributes that vary by category
Configuration settings per entity
Sparse feature flags

TIP

MAPs work well when keys vary across rows. If every row has the same keys, use regular columns instead.

> Complete this query to extract the color value from a MAP column.

SELECT
  product_id,
  ___ ___ AS color
FROM products

attributes

'color'

MAP_GET

'name'

Bracket notation on a MAP column provides O(1) hash-based access, making it fast regardless of how many keys the map contains.

When a key does not exist in the map, the result is NULL rather than an error, so always account for missing keys in your logic.

Maps are ideal for sparse data where each row may have different attributes, avoiding the need for many nullable columns.

Accessing nested data (UNNEST)

Daily Life

Interviews

Query inside arrays and nested fields

When you have arrays or maps, UNNEST "explodes" them into separate rows, making nested data queryable with standard SQL.

UNNEST Arrays

UNNEST takes an array and produces one row per element.

	SELECT
	customer_id,
	tag
	FROM customers
	CROSS JOIN UNNEST(tags) AS t(tag)

Result

customer_id	tag
C001	premium
C001	early_adopter
C002	trial

Customer C001 had tags [premium, early_adopter], which UNNEST expanded into 2 rows.

Filtering Nested Data

After unnesting, you can filter on array elements as if they were regular columns.

	SELECT
	customer_id,
	COUNT(*) AS order_count
	FROM customers
	CROSS JOIN UNNEST(order_ids) AS o(order_id)
	WHERE order_id > 1000
	GROUP BY customer_id

Result

customer_id	order_count
C001	2
C002	1

Filtering After UNNEST

A common pattern: UNNEST to expand, then WHERE to filter the individual elements.

	SELECT
	customer_id,
	COUNT(*) AS premium_count
	FROM customers
	CROSS JOIN UNNEST(tags) AS t(tag)
	WHERE tag LIKE '%premium%'
	GROUP BY customer_id

Result

customer_id	premium_count
C001	1
C002	1

> Complete the query to count how many tags each customer has after unnesting.

SELECT
  customer_id,
  ___ AS tag_count
FROM customers
CROSS JOIN ___(___) AS t(tag)
GROUP BY customer_id

UNNEST

Compression and error handling

Daily Life

Interviews

Recover gracefully from type mismatches

Modern column-oriented databases use compression to dramatically reduce storage. The compression ratio depends on the data type and the distribution of values. Understanding these techniques helps you design schemas that compress well.

Compression Types

DICTIONARYRUN-LENGTHDELTA

DICTIONARY

Value mapping

Replaces repeats with IDs

RUN-LENGTH

Streak counts

Stores value and run count

DELTA

Offset storage

Stores diffs between rows

The chart below shows which compression technique works best for different data patterns. Dictionary compression excels with low-cardinality data (repeating values like status codes). Run-length encoding dominates when data is sorted with long consecutive runs. Delta encoding is ideal for sequential values like timestamps. High-cardinality data with unique values typically cannot be compressed effectively.

Dictionary

Run-Length

Delta

Uncompressed

	SELECT
	COUNT(*) AS total_rows,
	COUNT(DISTINCT country) AS unique_countries,
	AVG(LENGTH(description)) AS avg_desc_len,
	MAX(LENGTH(description)) AS max_desc_len,
	SUM(LENGTH(description)) AS total_bytes,
	COUNT() 8 AS fixed_col_bytes
	FROM products

Result

total_rows	unique_countries	avg_desc_len	max_desc_len	total_bytes	fixed_col_bytes
1000000	195	45	200	45000000	8000000

✓Do

Sort by commonly filtered columns (date)
Use low-cardinality types for status columns
Define sort keys on large tables
Profile cardinality before schema design

✗Don't

Store sequential IDs as VARCHAR
Use random UUIDs if sorted access matters
Ignore compression in cost estimates
Skip sort keys on analytical tables

Robust Error Handling

Type conversion failures in production can crash entire pipelines. Robust error handling is essential for data engineering. These patterns handle progressively more complex scenarios.

ERROR HANDLING STRATEGIES

TRY_CAST: Returns NULL on failure instead of crashing
COALESCE: Chain multiple parsing strategies with fallbacks
Data Quality Metrics: Count valid vs invalid rows for monitoring
Quarantine Pattern: Route bad records to separate table for review

	SELECT
	row_id,
	price_text,
	TRY_CAST(
	price_text
	AS DECIMAL(10, 2)
	) AS price_clean,
	CASE
	WHEN TRY_CAST(price_text AS DECIMAL(10, 2)) IS NULL
	AND price_text IS NOT NULL THEN 'INVALID'
	ELSE 'OK'
	END AS status
	FROM raw_import

Result

row_id	price_text	price_clean	status
1	29.99	29.99	OK
2	N/A	NULL	INVALID
3	15.00	15.00	OK

How should you handle type conversion on untrusted data?

CAST fails the entire query if any single row has invalid data. One bad record like "N/A" in a million rows will crash your pipeline and produce zero results.

	SELECT
	row_id,
	CAST(price_text AS DECIMAL(10, 2)) AS price
	FROM raw_import

Applying Error Handling Patterns

When working with untrusted data, providing a fallback value for NULL results keeps downstream queries safe. Use TRY_CAST to avoid crashes and COALESCE to provide sensible defaults.

> Complete the query to provide a default value of 0 when the rating column is NULL.

SELECT
  name,
  ___(rating, ___)
FROM products

COALESCE

NULLIF

NULL

Robust error handling with TRY_CAST and COALESCE is essential for production data pipelines. These patterns ensure queries complete gracefully even with invalid input data.

TRY_CAST returns NULL instead of raising an error when conversion fails. This makes it ideal for data from external sources where type consistency cannot be guaranteed.

Combining TRY_CAST with COALESCE is the canonical ETL pattern: attempt the conversion and substitute a safe default if the value cannot be cast.

❯❯❯PUTTING IT ALL TOGETHER

> You are a data engineer at Amplitude designing a PostgreSQL schema that must store both structured user metadata and flexible JSON event payloads from the mobile SDK. Column type choices directly affect query speed and storage costs at billions of rows.

Type optimization at scale determines whether status and country columns use SMALLINT dictionary encoding or VARCHAR, changing compression ratios and scan throughput significantly.

VARCHAR sizing based on actual data audits prevents index bloat on user_name and email columns, keeping range scans fast on the billion-row events table.

MAP columns store per-user feature flag assignments efficiently when each user has different active experiments, avoiding hundreds of nullable columns.

UNNEST expands the SDK payload tags array into individual rows for aggregation, then ARRAY_AGG recollects them after filtering out invalid entries.

KEY TAKEAWAYS

Use the smallest type that fits your data; every byte saved is GB at scale

INT indexes are 2x smaller than BIGINT; more index fits in RAM

Type-matched joins are 10-100x faster than implicit conversion joins

VARCHAR stores actual length + overhead, not the max size declared

MAP<K,V> stores key-value pairs; access with bracket notation or ELEMENT_AT

UNNEST expands arrays/maps into rows; beware of data explosion

Dictionary encoding achieves 95%+ compression on low-cardinality columns

Use TRY_CAST + COALESCE for robust error handling in ETL

Arrays, maps, and type optimization at scale

Category: SQL
Difficulty: advanced
Duration: 21 minutes
Challenges: 0 hands-on challenges

Topics covered: Type optimization at scale, Storage calculations and VARCHAR, MAP data type for key-value pairs, Accessing nested data (UNNEST), Compression and error handling

Lesson Sections

Type optimization at scale (concepts: sqlStorageOptimization)
Choosing the right data type isn't just about correctness. It's also about performance and storage efficiency. At scale, poor type choices can waste terabytes of disk space and slow queries by orders of magnitude. Storage Impact VARCHAR vs CHAR Indexing and Type Choice Indexes work best when the type matches the data distribution. String indexes on numeric data are slow because string comparisons are byte-by-byte. String-based comparisons on numeric IDs require byte-by-byte evaluation, while int
Storage calculations and VARCHAR (concepts: sqlStorageOptimization)
How VARCHAR Stores Data Calculating Storage Needs To estimate table size, multiply row count by sum of column sizes plus overhead (typically 20-40 bytes per row for metadata). With these storage costs in mind, here are the key rules to follow when sizing your columns: Remember that UTF-8 characters can use 1 to 4 bytes each, so character count and byte count may differ for international text.
MAP data type for key-value pairs (concepts: sqlMapBasic)
Creating Maps Accessing Map Values When to Use Maps Bracket notation on a MAP column provides O(1) hash-based access, making it fast regardless of how many keys the map contains. Maps are ideal for sparse data where each row may have different attributes, avoiding the need for many nullable columns.
Accessing nested data (UNNEST) (concepts: sqlUnnest)
UNNEST Arrays Filtering Nested Data After unnesting, you can filter on array elements as if they were regular columns. Filtering After UNNEST Advanced data types unlock powerful transformations. Practice with hands-on challenges to master nested data queries.
Compression and error handling (concepts: sqlTryCast)
Modern column-oriented databases use compression to dramatically reduce storage. The compression ratio depends on the data type and the distribution of values. Understanding these techniques helps you design schemas that compress well. Compression Types Robust Error Handling Type conversion failures in production can crash entire pipelines. Robust error handling is essential for data engineering. These patterns handle progressively more complex scenarios. How should you handle type conversion on