AWS Data Engineer Interview

AWS is the most common cloud platform in 2026 data engineering, and AWS data engineer roles are correspondingly the highest- volume specialization. The interview tests standard data engineering fundamentals plus deep familiarity with the AWS data stack: Glue, Redshift, Kinesis, EMR, S3, Lambda. Companies that hire AWS DEs heavily: Amazon itself, Netflix (mostly AWS with some in-house), Stripe, Airbnb, Lyft, DoorDash, Robinhood, and most non-tech industry data engineering teams. Loops run 4 to 5 weeks. This page is part of the complete data engineer interview preparation framework.

AWS Services Tested in Data Engineer Loops

Frequency from 156 reported AWS data engineer loops in 2024-2026.

Service	Test Frequency	Depth Expected
S3	100%	Storage classes, lifecycle, partitioning, transfer acceleration, intelligent tiering
Glue (ETL + Data Catalog)	84%	ETL jobs in PySpark, crawlers, Data Catalog, job bookmarking
Redshift	76%	Sort keys, dist keys, RA3 architecture, materialized views, Spectrum
Kinesis Data Streams	67%	Sharding, ordering, exactly-once via KCL, vs MSK trade-off
Kinesis Data Firehose	54%	Delivery to S3, Redshift, OpenSearch with transformation
MSK (Managed Kafka)	47%	When to choose vs Kinesis Data Streams
Lambda	62%	Serverless transformations, S3 event triggers, fan-out patterns
EMR (managed Spark/Hadoop)	58%	When to use vs Glue, ephemeral clusters, EMR Serverless
Athena	61%	Serverless SQL on S3, Iceberg integration, partition projection
DynamoDB	47%	OLTP serving for low-latency lookups, streams to data lake
Step Functions	38%	Orchestration vs Airflow trade-off
MWAA (Managed Airflow)	44%	Hosted Airflow alternative to self-managed
DMS / Datasync	32%	Migration and CDC patterns
IAM and VPC patterns	67%	Security-aware design, especially senior roles

S3 Patterns: The Foundation of AWS Data Engineering

Every AWS data engineering pipeline lands in S3 at some point. The interview probes for S3 fluency at multiple depths: storage classes (Standard for hot, IA for warm, Glacier for cold), partitioning patterns (year=YYYY/month=MM/day=DD for Hive-compatible partitioning), file format choice (Parquet for analytics, Avro for evolution-heavy ingestion, JSON for human debugging only), and transfer optimization (transfer acceleration, multipart upload, S3 Transfer Family for ingestion).

Storage class trade-offs come up in cost optimization rounds. Standard: $0.023/GB/month, instant access, expensive at petabyte scale. IA: $0.0125/GB/month, instant access, retrieval cost. Glacier Instant Retrieval: $0.004/GB/month, instant access, higher retrieval cost. Glacier Deep Archive: $0.00099/GB/month, 12-hour retrieval. Lifecycle policies move data automatically; Intelligent Tiering does it dynamically based on access patterns. Knowing the cost-vs-latency curve is the senior signal.

File format choice has even larger cost implications. Parquet vs CSV at petabyte scale: Parquet typically 5-10x smaller, scans 10-100x less when only specific columns queried. Iceberg over Parquet: adds ACID, time travel, schema evolution; the 2024-2026 frontier for AWS data lakes. Athena queries Iceberg tables natively, so the stack increasingly converges on S3 + Iceberg + Athena + Glue Data Catalog as the open lakehouse pattern.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a system design query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

PayPalInterview question

Solve a problem

Redshift Internals: The Most-Tested AWS Topic

Redshift questions show up in 76% of AWS DE loops. The depth: sort keys (sorted columns within each block; determines scan efficiency on filter), dist keys (determines how rows are distributed across nodes; determines join efficiency), RA3 architecture (compute separated from storage; lets you scale them independently; replaced legacy DC2 for most workloads), Redshift Spectrum (query S3 data without loading; useful for cold data), materialized views (precomputed aggregates with auto- refresh), and concurrency scaling (extra clusters for burst load).

Common interview prompt: a query is slow; here is the EXPLAIN plan; what would you do? The answer involves identifying broadcast vs DS_DIST_NONE distribution strategies, seeing whether sort keys align with WHERE filters, checking for skew on dist keys (one node has 10x the rows of others), and proposing fixes (re-distribute with different dist key, add sort key, add materialized view for the hot aggregate).

Redshift Serverless changed the cost model significantly in 2023; expect questions about when serverless vs provisioned is right (serverless better for spiky workloads, provisioned better for sustained load). Strong candidates discuss the cost crossover point.

Six Real AWS Data Engineer Interview Questions

S3 · L4

Design the partitioning scheme for a 100TB clickstream archive on S3

Partition by event_date for primary partitioning (queries always filter by date). Sub-partition by source for query pruning if queries often filter by source. File format Parquet with snappy compression. File size target 128MB to 1GB per file (avoid small-file problem). Lifecycle: Standard for first 30 days, IA for 30-90 days, Glacier IR for 90+ days. Cost reduction from this lifecycle is roughly 70% vs Standard-only.

Redshift · L5

This Redshift query takes 5 minutes; the EXPLAIN shows a DS_DIST_BOTH. How would you fix it?

DS_DIST_BOTH means both sides of the join are being redistributed across nodes. Fix options: align the dist key on both tables to the join column (so they co-locate, eliminating redistribution); if one table is small (<3M rows), set its diststyle to ALL so every node has a copy. Walk through the trade-off: ALL distribution duplicates storage; co-location forces one specific dist key choice that may hurt other queries.

Glue · L5

Design a Glue ETL pipeline for a 1TB daily incremental load

PySpark Glue job triggered by EventBridge schedule. Use Glue Job Bookmarks for incremental processing (tracks which S3 files have been processed). Output to S3 in Parquet with date partitioning. Update Glue Data Catalog post-write for downstream Athena and Redshift Spectrum queries. Failure handling: enable retries with idempotent writes (use partition overwrite, not append); send failures to dead-letter S3 prefix.

Kinesis · L5

When would you choose Kinesis Data Streams vs MSK vs Kinesis Firehose?

Kinesis Data Streams: when you need a Kafka-like abstraction with AWS-managed scaling and lower operational overhead than MSK. MSK: when you need full Kafka compatibility (specific tooling, specific consumer protocols), or when your team has Kafka expertise. Firehose: when the destination is S3, Redshift, or OpenSearch and you want a fully managed delivery without writing consumer code. The honest answer: Firehose for simple delivery, Kinesis Data Streams for custom processing, MSK only when Kafka-specific.

System Design · L5

Design a real-time anomaly detection pipeline on AWS

Source events -> Kinesis Data Streams (sharded by entity_id) -> Kinesis Data Analytics (managed Flink) for stateful aggregation -> SageMaker endpoint for ML scoring -> SNS for alerts on anomalies + S3 audit log. Cover: shard count sizing (1MB/sec or 1000 records/sec per shard), exactly-once via KPL with checkpointing, hot-key handling via mod-N re-sharding, model versioning via SageMaker endpoints with traffic shifting.

Cost · L5

How would you reduce AWS DE infrastructure cost by 40%?

Audit the largest cost lines: typically Redshift cluster runtime, Glue job runtime, S3 storage, and data transfer. For Redshift: pause idle clusters, consider Serverless for spiky workloads, optimize high-spend queries (the top 20 queries usually account for 80% of compute). For Glue: right-size worker types (G.1X for most jobs, G.2X for compute-heavy), use Glue 4.0 for performance improvements, schedule jobs to avoid over- provisioning. For S3: lifecycle to IA / Glacier, Intelligent Tiering for variable-access data, Parquet to compress data 5-10x. For data transfer: same-region resources, VPC endpoints to avoid NAT Gateway costs, CloudFront for cross-region delivery.

AWS Data Engineer Compensation (2026)

Total comp ranges. US-based, sourced from levels.fyi and verified offers.

Company	Senior AWS DE range	Notes
Amazon (internal)	$280K - $420K	L6 / Sr. DE, AWS-native by definition
Netflix	$450K - $650K	Mostly AWS; all-cash comp philosophy
Stripe	$300K - $450K	AWS-heavy infrastructure
Airbnb	$320K - $480K	AWS-heavy with custom in-house tools
Lyft / DoorDash / Uber	$240K - $370K	AWS-heavy production stack
Robinhood	$240K - $370K	AWS-heavy with regulated-industry constraints
Mid-size SaaS on AWS	$190K - $290K	AWS knowledge a baseline expectation
Non-tech industry	$150K - $230K	AWS skills transfer well; comp lower than tech

How AWS Connects to the Rest of the Cluster

AWS knowledge is the foundation for AWS Redshift interview prep, Glue interview prep for AWS Data Engineer roles, and most company guides since most large companies run on AWS: how to pass the Netflix Data Engineer interview, how to pass the Stripe Data Engineer interview, how to pass the Airbnb Data Engineer interview, how to pass the Lyft Data Engineer interview, how to pass the Robinhood Data Engineer interview.

The system design framework from how to pass the system design round applies but you should substitute AWS service names (S3 for object storage, Redshift for warehouse, Kinesis or MSK for message broker, Glue or EMR for ETL, Lambda for serverless transformations, MWAA or Step Functions for orchestration). For the cloud comparison, see the how to pass the GCP Data Engineer interview and how to pass the Azure Data Engineer interview guides.

Analysts Are Slowing the Store Down

> We run an e-commerce marketplace where the analytics team queries the production database directly, and that load is degrading the live application. Move analytics onto its own warehouse by reading the database's change log instead of querying the live system, while a merchant-facing dashboard still shows each seller their new orders within fifteen minutes on a path of its own. A small fraction of orders arrive with broken merchant references or totals that do not add up, so those have to be held back and caught before they reach the reporting tables.

+ Source

+ Transform

+ Storage

+ Quality

+ Consumer

+ Queue

Bronze

Silver

Gold

Custom

Pipeline Architecture

Sketch the architecture.

Click or drag a node from the toolbar above. Right-click the canvas for the full menu.

Drag from a node's right port to another node's left port to wire data flow.

Data engineer interview prep FAQ

Should I learn Glue or EMR?+

Glue first; EMR as secondary. Glue is AWS's managed serverless ETL service and the more-tested system in interviews. EMR is the right choice when the team has existing Spark workloads to migrate or needs custom Hadoop ecosystem tools (Hudi, Hive, HBase). Glue 4.0 (released 2023) closed most of the historical gap.

Are AWS certifications useful for DE roles?+

The AWS Solutions Architect Associate signals foundational AWS knowledge. The Data Analytics Specialty is more directly relevant but newer. Neither is required for senior roles but both can help unlock interviews for early-career candidates without AWS work experience.

How important is Redshift vs Snowflake knowledge for AWS DE roles?+

Redshift first if the company is AWS-native. Snowflake knowledge transfers (both columnar warehouses with similar concepts) but the cost models and internals differ significantly. If the company uses Snowflake on AWS, prep Snowflake; otherwise prep Redshift.

What's the difference between Kinesis and MSK?+

Kinesis Data Streams: AWS-proprietary, simpler, lower operational overhead, sharding model. MSK: AWS-managed Kafka, full Kafka API compatibility, more operational complexity. Pick Kinesis when you don't need Kafka-specific compatibility; pick MSK when you do.

How does AWS DE comp compare to GCP DE comp?+

Roughly equivalent at the same level. The cloud platform doesn't significantly affect comp; the company does. AWS DE jobs are higher-volume because more companies run on AWS, but per-role comp is similar.

Is Lambda real for data engineering?+

Yes, for specific patterns. S3 event triggers for small file ingestion, fan-out from one source to many consumers, lightweight transformations on streaming data. Not appropriate for: long-running batch jobs (15-min limit), heavy compute (memory ceiling), stateful processing. Know when Lambda fits and when Glue or EMR is the right call.

How important is IAM in the interview?+

Increasingly important at L5+. AWS-native security questions show up in system design rounds: how do you scope IAM roles, when do you use VPC endpoints vs NAT Gateway, how do you implement column-level security in Redshift, how do you audit data access. Know the basics; senior roles probe deeper.

Is AWS DE hiring strong in 2026?+

Yes, the strongest of the cloud platforms by volume. AWS-native companies hire steadily; non-tech industry hiring is increasingly AWS-focused; Amazon itself hires consistently. The opportunity is broader than the FAANG focus suggests.

02 / Why practice

Practice AWS-Native System Design

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Start Practicing

Adjacent Data Engineer Interview Prep Reading

Redshift Interview Questions→

Deep Redshift interview problems with worked answers.

AWS Glue Interview Questions→

Glue ETL job patterns and Data Catalog questions.

Complete Data Engineer Interview Prep Framework→

Pillar guide covering every round in the Data Engineer loop, end to end.

More data engineer interview prep guides

how to pass the senior Data Engineer interview→

Senior Data Engineer interview process, scope-of-impact framing, technical leadership signals.

how to pass the staff Data Engineer interview→

Staff Data Engineer interview process, cross-org scope, architectural decision rounds.

how to pass the principal Data Engineer interview→

Principal Data Engineer interview process, multi-year vision rounds, executive influence signals.

how to pass the junior Data Engineer interview→

Junior Data Engineer interview prep, fundamentals to drill, what gets cut from the loop.

how to pass the entry-level Data Engineer interview→

Entry-level Data Engineer interview, what new-grad loops look like, projects that beat experience.

how to pass the analytics engineer interview→

Analytics engineer interview, dbt and SQL focus, modeling-heavy take-homes.