Redshift Interview Questions

Amazon Redshift interview questions for data engineer roles at AWS-native companies. 40+ questions covering Redshift internals (sort keys, dist keys, RA3 architecture, Spectrum, materialized views), query optimization, the on-demand vs Reserved Instance vs Serverless decision, and migration patterns from legacy DC2 nodes. Pair with the complete data engineer interview preparation framework and the how to pass the AWS Data Engineer interview.

Redshift Topics in Interviews

Topics ranked roughly by how often they appear in AWS DE loops.

Topic	Frequency	Depth Expected
Distribution strategies (KEY, ALL, EVEN)	Very common	Strategy choice and skew detection
Sort keys (compound vs interleaved)	Very common	When each is right; alignment with WHERE
RA3 architecture vs DC2 / DS2	Common	Compute-storage separation, when to choose
Redshift Spectrum (S3 external tables)	Common	When to use vs loading into Redshift
Materialized views with auto-refresh	Common	AWS late-2022 feature, now production-grade
VACUUM and ANALYZE	Common	Maintenance, sortkey re-sort, statistics refresh
WLM (workload management) queues	Occasional	Concurrency control, query queueing
Concurrency Scaling clusters	Occasional	Auto-scale read capacity for burst load
Result cache	Occasional	Query-string match cache, free reads
UNLOAD / COPY for data movement	Common	S3 integration patterns
Redshift Serverless	Common	Newer (2022 GA), spiky vs sustained workload trade-off
Federated Query (Postgres, MySQL)	Occasional	Cross-engine queries
Cross-database queries	Occasional	Multi-database architecture patterns
Redshift ML	Rare	In-warehouse SageMaker integration

Distribution Strategies: The Foundation

Redshift distributes table rows across cluster compute nodes. The distribution strategy determines whether joins require data movement (slow) or can run locally (fast). Three options.

KEY distribution: rows with the same value of the dist key land on the same node. Best for fact tables with a high-cardinality column that aligns with common joins (e.g., dist on customer_id when fact_orders joins frequently to dim_customer also dist on customer_id). Risk: skew if the dist key has low cardinality or hot values.

ALL distribution: every node gets a copy of the table. Best for small dimensions (<3M rows) joined frequently. Cost: extra storage per node. Joins to ALL-distributed tables don't require any data movement.

EVEN distribution: rows distributed round-robin across nodes. Default for tables without a clear dist key. Joins always require data movement. Acceptable for staging tables and small fact tables; suboptimal for production analytical workloads.

Six Real Redshift Interview Questions

Choose distribution strategy for fact_orders (5B rows) and dim_customer (10M rows)

fact_orders dist KEY on customer_id (high cardinality, aligns with join). dim_customer also dist KEY on customer_id so joins are co-located (no data movement). Sort key on order_date for typical date-range queries. Discuss the alternative: dim_customer at 10M rows is borderline; ALL distribution would also work and eliminate join redistribution but doubles storage on every node.

EXPLAIN shows DS_DIST_BOTH for a join. How would you fix it?

DS_DIST_BOTH means both sides of the join are being redistributed. Both tables have wrong dist keys for this join. Fix options. (1) Re-distribute one side: if the smaller table has a dist key that doesn’t match the join, change it to match. (2) Switch the smaller side to ALL distribution: replicates to every node, eliminates redistribution but adds storage. (3) Pre-join in a materialized view: avoid the runtime redistribution by precomputing the join.

When would you use compound vs interleaved sort keys?

Compound sort key: rows sorted by the first key, then the second within ties of the first, etc. Strong pruning when WHERE filters on the first key. Interleaved: rows sorted by all keys equally; better pruning when WHERE filters on any single key but weaker than compound for first-key filters. Compound is the default and usually right; interleaved adds vacuum cost (re-sorting interleaved is expensive) and is justified only when query patterns truly require equal pruning across multiple columns.

When would you use Redshift Spectrum vs loading data into Redshift?

Spectrum: query S3 data via external tables without loading. Best for: cold or rarely-queried data, exploratory analysis, joining warm Redshift data with cold S3 history. Cost: per-byte-scanned, similar to Athena. Loading into Redshift: best for frequently-queried data, low-latency requirements, complex queries that benefit from local compute. Discuss the hybrid pattern: hot 30 days in Redshift, older history in S3 queried via Spectrum.

Should this workload run on Redshift Serverless or Provisioned?

Redshift Serverless: pay per second of RPU (Redshift Processing Unit). Best for spiky workloads where idle time is significant, ad-hoc analytics with unpredictable load, dev / staging environments. Provisioned: pay per node hour. Best for sustained workloads with predictable load (24/7 production reporting), large historical datasets, when you can saturate the cluster regularly. The cost crossover varies by workload but typically: Serverless wins below ~20 hours of compute per day; Provisioned wins above.

Design a Redshift architecture for a multi-tenant SaaS analytics product

Multi-tenancy options. (1) Single cluster with tenant_id partition column on every table; row-level security via RLS policies. Pro: simplest. Con: noisy-neighbor risk, harder to scale per-tenant. (2) Per-tenant Redshift Serverless workgroup: each tenant gets its own compute, shared via federated query if needed. Pro: isolation, per-tenant cost attribution. Con: many workgroups to manage. (3) Tiered model: shared cluster for free tier, dedicated workgroups for enterprise tier. Discuss the trade-offs.

Redshift vs Snowflake vs BigQuery

Redshift remains AWS-native dominant despite Snowflake's encroachment. The choice depends on existing AWS investment, predictability of workload, and operational tolerance.

Dimension	Redshift	Snowflake	BigQuery
AWS native	Yes (only)	Multi-cloud	GCP native
Pricing model	Per node hour or RPU-second	Credit per warehouse second	Slot or per-byte
Compute-storage separation	Yes (RA3) or No (DC2)	Yes (always)	Yes (always)
Operational overhead	Higher: VACUUM, dist key tuning	Low: managed	Lowest: fully serverless feel
Concurrency model	WLM + Concurrency Scaling	Multi-cluster warehouses	Slot queueing
Time travel	No native (snapshots only)	1 to 90 days	7 days default
Best fit	AWS-heavy, sustained load	Multi-cloud, mixed load	GCP-heavy, ad-hoc

How Redshift Connects to the Rest of the Cluster

Redshift is the warehouse component in any how to pass the AWS Data Engineer interview stack. The broader the complete SQL interview problem set hub covers SQL fluency at the dialect-portable level; this guide is the Redshift-specific layer on top.

For comparison, see Google BigQuery interview prep (the GCP equivalent) and the how to pass the Snowflake Data Engineer interview guide for multi-cloud warehouse depth. The Snowflake or Databricks Data Engineer interview differences decision page covers the broader warehouse-vs- lakehouse choice.

Prepare for the interview

01 / Open invite

02min.

Know Redshift the way the interviewer who asks it knows it.

a Redshift query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1SELECT user_id,

2 COUNT(*) AS sessions

3FROM events

4WHERE ts >= NOW() - INTERVAL '7 day'

Execute your solution0.4s avg.

GoogleInterview question

Solve a Redshift problem

Data engineer interview prep FAQ

Is Redshift still relevant in 2026 with Snowflake’s growth?+

Yes, especially in AWS-native shops where deep AWS integration matters (IAM, VPC, Lake Formation, fine-grained access control). Redshift Serverless and the RA3 architecture closed most of the historical operational gap with Snowflake. New deployments split roughly between the two, with the choice driven by existing cloud investment more than features.

Should I learn DC2 / DS2 architecture if RA3 is the future?+

Light familiarity is enough. Many production deployments still run DC2 (compute-storage coupled), and migration questions come up. Know that RA3 separates compute from storage and that migration from DC2 to RA3 is the modernization path.

How important is VACUUM and ANALYZE?+

Important for any production Redshift role. Redshift doesn’t auto-maintain sorted order; VACUUM SORT re-sorts data after heavy inserts or updates. ANALYZE refreshes statistics for the query optimizer. Both are typically scheduled nightly. Skipping them silently degrades query performance.

What’s the difference between Redshift and Redshift Serverless?+

Redshift Provisioned: cluster of nodes you pay for whether or not in use. Predictable cost, sustained workloads. Redshift Serverless: per-RPU-second pricing, scales to zero when idle. Variable cost, spiky workloads. Same SQL, same features, different cost models.

Should I use Redshift Spectrum or just Athena for S3 queries?+

Spectrum: when you need to join S3 data with data in Redshift. Athena: when querying purely S3 without Redshift involvement. Both use the AWS Glue Data Catalog and similar pricing. Athena is often simpler if you don’t need the join.

How does query optimization work in Redshift?+

Cost-based optimizer using table statistics. The query planner chooses join strategies (broadcast, redistribute) based on stats. Stale stats lead to bad plans, hence ANALYZE matters. The EXPLAIN plan shows the chosen strategy; reading EXPLAIN is a senior signal in any Redshift role.

Are materialized views production-ready in Redshift?+

Yes since the 2022 GA of auto-refresh. They support a subset of SQL (similar limitations to BigQuery materialized views). The query optimizer can rewrite queries to use materialized views automatically when applicable. Best for high-frequency aggregates with low-frequency source changes.

What’s the migration path from Redshift to a different warehouse?+

Common paths: Redshift to Snowflake (most common in 2024-2026), Redshift to BigQuery (rare, requires cloud migration), Redshift modernization (DC2 to RA3 to Serverless). Each has different complexity. Snowflake migration typically takes 6-12 months end-to-end for a multi-team analytics platform.

02 / Why practice

Practice Redshift-Flavored SQL

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Start with a problem

More data engineer interview prep guides

the complete SQL interview problem set→

The full SQL interview problem set, indexed by topic, difficulty, and company.

Google BigQuery interview prep→

BigQuery internals, slot-based pricing, partitioning, and clustering interview prep.

PostgreSQL interview prep→

Postgres MVCC, indexing, partitioning, and replication interview prep.

Apache Flink interview prep→

Apache Flink stateful streaming, watermarks, exactly-once, checkpointing interview prep.

Hadoop ecosystem interview prep→

Hadoop ecosystem (HDFS, MapReduce, YARN, Hive) interview prep, including modern relevance.

Glue interview prep for AWS Data Engineer roles→

AWS Glue ETL jobs, crawlers, Data Catalog, and PySpark-on-Glue interview prep.