Data Types: Advanced
Snowflake's ARRAY and OBJECT types let engineering teams store every attribute of a user session in a single row rather than normalizing events into a separate table with one row per attribute. A session with two hundred event properties becomes one wide row instead of two hundred narrow rows, and queries that previously required a multi-way join now read a single scan. Engineering teams at companies like Spotify have measured storage cost reductions of 60 percent compared to fully normalized designs, alongside query speed improvements from reduced I/O. This lesson teaches you the native collection types that make this kind of schema design possible.
Type optimization at scale
Pick types that save storage and speed
Storage Impact
Different types occupy different amounts of space. On a billion-row table, the difference between INT and BIGINT is 4GB of storage.
VARCHAR vs CHAR
VARCHAR stores only the actual string content, while CHAR pads short values to fixed length. For variable-length data, VARCHAR saves space.
CHAR(10) wastes 7 bytes per row. On 100M rows, that's 700MB of wasted space.
Indexing and Type Choice
- Stored as VARCHAR
- Index size: 2x larger
- Comparison: byte-by-byte
- Range scans: slow
- Stored as INTEGER
- Index size: compact
- Comparison: single op
- Range scans: fast
> Choose the most storage-efficient type for a "status" column that holds values 1-5.
SELECT product_id, CAST(status AS ) AS status FROM products
The difference between INTEGER and SMALLINT is 2 bytes per row. On a table with one billion rows, that adds up to 2 GB of storage. At scale, type precision is never a trivial choice.
Storage calculations and VARCHAR
Estimate how much space your data uses
VARCHAR length limits are critical for storage planning. Choosing VARCHAR(1000) when you need VARCHAR(100) wastes index space and slows queries.
How VARCHAR Stores Data
VARCHAR(n) uses a length prefix (1-2 bytes) plus the actual string bytes. UTF-8 encoding means international characters use 2-4 bytes each.
The Chinese characters "你好" are 2 characters but 6 bytes. VARCHAR(100) can store 100 ASCII characters but only 33-50 Chinese characters depending on encoding.
Calculating Storage Needs
- Size VARCHAR to 2x the expected max value for safety margin
- Use SMALLINT or INTEGER for codes/IDs instead of strings
- Analyze actual data before choosing VARCHAR length
- Use VARCHAR without a length limit for unbounded strings (blog posts, descriptions)
- Don't use VARCHAR(max) everywhere because it wastes index space
- Don't store numeric IDs as strings (waste space and slow queries)
- Don't forget UTF-8 overhead for international content
- Don't use CHAR for variable-length data
> Complete this query to find usernames that exceed a 50-character limit.
SELECT username, LENGTH() AS char_count FROM users WHERE (username) > 50
Auditing actual string lengths before choosing a VARCHAR size prevents wasted storage and avoids truncation surprises.
A good practice is to size VARCHAR to roughly 2x the expected maximum to provide a safety margin without excessive waste.
MAP data type for key-value pairs
Store flexible metadata in columns
A MAP stores key-value pairs, like a Python dictionary or JavaScript object. This is useful for sparse data where each row has different attributes.
Creating Maps
Use MAP with key-value pairs or MAP_FROM_ENTRIES to build maps from arrays.
Accessing Map Values
Access values with bracket notation or ELEMENT_AT. Missing keys return NULL.
When to Use Maps
- User preferences with variable keys
- Product attributes that vary by category
- Configuration settings per entity
- Sparse feature flags
MAPs work well when keys vary across rows. If every row has the same keys, use regular columns instead.> Complete this query to extract the color value from a MAP column.
SELECT product_id, AS color FROM products
When a key does not exist in the map, the result is NULL rather than an error, so always account for missing keys in your logic.
Accessing nested data (UNNEST)
Query inside arrays and nested fields
When you have arrays or maps, UNNEST "explodes" them into separate rows, making nested data queryable with standard SQL.
UNNEST Arrays
UNNEST takes an array and produces one row per element.
Customer C001 had tags [premium, early_adopter], which UNNEST expanded into 2 rows.
Filtering Nested Data
Filtering After UNNEST
A common pattern: UNNEST to expand, then WHERE to filter the individual elements.
> Complete the query to count how many tags each customer has after unnesting.
SELECT customer_id, AS tag_count FROM customers CROSS JOIN () AS t(tag) GROUP BY customer_id
UNNEST creates a Cartesian product between the original row and each array element. Be mindful of row explosion when arrays are large or when multiple arrays are unnested simultaneously.
After unnesting, you can filter, join, and aggregate using standard SQL syntax. The pattern of UNNEST followed by GROUP BY to re-aggregate is one of the most common in data engineering.
Compression and error handling
Recover gracefully from type mismatches
Compression Types
The chart below shows which compression technique works best for different data patterns. Dictionary compression excels with low-cardinality data (repeating values like status codes). Run-length encoding dominates when data is sorted with long consecutive runs. Delta encoding is ideal for sequential values like timestamps. High-cardinality data with unique values typically cannot be compressed effectively.
- Sort by commonly filtered columns (date)
- Use low-cardinality types for status columns
- Define sort keys on large tables
- Profile cardinality before schema design
- Store sequential IDs as VARCHAR
- Use random UUIDs if sorted access matters
- Ignore compression in cost estimates
- Skip sort keys on analytical tables
Robust Error Handling
TRY_CAST: ReturnsNULLon failure instead of crashingCOALESCE: Chain multiple parsing strategies with fallbacks- Data Quality Metrics: Count valid vs invalid rows for monitoring
- Quarantine Pattern: Route bad records to separate table for review
Applying Error Handling Patterns
When working with untrusted data, providing a fallback value for NULL results keeps downstream queries safe. Use TRY_CAST to avoid crashes and COALESCE to provide sensible defaults.
> Complete the query to provide a default value of 0 when the rating column is NULL.
SELECT name, (rating, ) FROM products
Robust error handling with TRY_CAST and COALESCE is essential for production data pipelines. These patterns ensure queries complete gracefully even with invalid input data.
TRY_CAST returns NULL instead of raising an error when conversion fails. This makes it ideal for data from external sources where type consistency cannot be guaranteed.
Combining TRY_CAST with COALESCE is the canonical ETL pattern: attempt the conversion and substitute a safe default if the value cannot be cast.
> You are a data engineer at Amplitude designing a PostgreSQL schema that must store both structured user metadata and flexible JSON event payloads from the mobile SDK. Column type choices directly affect query speed and storage costs at billions of rows.
SMALLINT dictionary encoding or VARCHAR, changing compression ratios and scan throughput significantly.VARCHAR sizing based on actual data audits prevents index bloat on user_name and email columns, keeping range scans fast on the billion-row events table.MAP columns store per-user feature flag assignments efficiently when each user has different active experiments, avoiding hundreds of nullable columns.UNNEST expands the SDK payload tags array into individual rows for aggregation, then ARRAY_AGG recollects them after filtering out invalid entries.INT indexes are 2x smaller than BIGINT; more index fits in RAMVARCHAR stores actual length + overhead, not the max size declaredMAP<K,V> stores key-value pairs; access with bracket notation or ELEMENT_ATUNNEST expands arrays/maps into rows; beware of data explosionTRY_CAST + COALESCE for robust error handling in ETLData Types: Advanced
Arrays, maps, and type optimization at scale
- Category
- SQL
- Difficulty
- advanced
- Duration
- 21 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Type optimization at scale, Storage calculations and VARCHAR, MAP data type for key-value pairs, Accessing nested data (UNNEST), Compression and error handling
Lesson Sections
- Type optimization at scale (concepts: sqlStorageOptimization)
Choosing the right data type isn't just about correctness. It's also about performance and storage efficiency. At scale, poor type choices can waste terabytes of disk space and slow queries by orders of magnitude. Storage Impact VARCHAR vs CHAR Indexing and Type Choice Indexes work best when the type matches the data distribution. String indexes on numeric data are slow because string comparisons are byte-by-byte. String-based comparisons on numeric IDs require byte-by-byte evaluation, while int
- Storage calculations and VARCHAR
How VARCHAR Stores Data Calculating Storage Needs To estimate table size, multiply row count by sum of column sizes plus overhead (typically 20-40 bytes per row for metadata). With these storage costs in mind, here are the key rules to follow when sizing your columns: Remember that UTF-8 characters can use 1 to 4 bytes each, so character count and byte count may differ for international text.
- MAP data type for key-value pairs (concepts: sqlMapBasic)
Creating Maps Accessing Map Values When to Use Maps Bracket notation on a MAP column provides O(1) hash-based access, making it fast regardless of how many keys the map contains. Maps are ideal for sparse data where each row may have different attributes, avoiding the need for many nullable columns.
- Accessing nested data (UNNEST)
UNNEST Arrays Filtering Nested Data After unnesting, you can filter on array elements as if they were regular columns. Filtering After UNNEST Advanced data types unlock powerful transformations. Practice with hands-on challenges to master nested data queries.
- Compression and error handling
Modern column-oriented databases use compression to dramatically reduce storage. The compression ratio depends on the data type and the distribution of values. Understanding these techniques helps you design schemas that compress well. Compression Types Robust Error Handling Type conversion failures in production can crash entire pipelines. Robust error handling is essential for data engineering. These patterns handle progressively more complex scenarios. How should you handle type conversion on