AWS is the most common cloud platform in 2026 data engineering, and AWS data engineer roles are correspondingly the highest- volume specialization. The interview tests standard data engineering fundamentals plus deep familiarity with the AWS data stack: Glue, Redshift, Kinesis, EMR, S3, Lambda. Companies that hire AWS DEs heavily: Amazon itself, Netflix (mostly AWS with some in-house), Stripe, Airbnb, Lyft, DoorDash, Robinhood, and most non-tech industry data engineering teams. Loops run 4 to 5 weeks. This page is part of the complete data engineer interview preparation framework.
Frequency from 156 reported AWS data engineer loops in 2024-2026.
| Service | Test Frequency | Depth Expected |
|---|---|---|
| S3 | 100% | Storage classes, lifecycle, partitioning, transfer acceleration, intelligent tiering |
| Glue (ETL + Data Catalog) | 84% | ETL jobs in PySpark, crawlers, Data Catalog, job bookmarking |
| Redshift | 76% | Sort keys, dist keys, RA3 architecture, materialized views, Spectrum |
| Kinesis Data Streams | 67% | Sharding, ordering, exactly-once via KCL, vs MSK trade-off |
| Kinesis Data Firehose | 54% | Delivery to S3, Redshift, OpenSearch with transformation |
| MSK (Managed Kafka) | 47% | When to choose vs Kinesis Data Streams |
| Lambda | 62% | Serverless transformations, S3 event triggers, fan-out patterns |
| EMR (managed Spark/Hadoop) | 58% | When to use vs Glue, ephemeral clusters, EMR Serverless |
| Athena | 61% | Serverless SQL on S3, Iceberg integration, partition projection |
| DynamoDB | 47% | OLTP serving for low-latency lookups, streams to data lake |
| Step Functions | 38% | Orchestration vs Airflow trade-off |
| MWAA (Managed Airflow) | 44% | Hosted Airflow alternative to self-managed |
| DMS / Datasync | 32% | Migration and CDC patterns |
| IAM and VPC patterns | 67% | Security-aware design, especially senior roles |
Every AWS data engineering pipeline lands in S3 at some point. The interview probes for S3 fluency at multiple depths: storage classes (Standard for hot, IA for warm, Glacier for cold), partitioning patterns (year=YYYY/month=MM/day=DD for Hive-compatible partitioning), file format choice (Parquet for analytics, Avro for evolution-heavy ingestion, JSON for human debugging only), and transfer optimization (transfer acceleration, multipart upload, S3 Transfer Family for ingestion).
Storage class trade-offs come up in cost optimization rounds. Standard: $0.023/GB/month, instant access, expensive at petabyte scale. IA: $0.0125/GB/month, instant access, retrieval cost. Glacier Instant Retrieval: $0.004/GB/month, instant access, higher retrieval cost. Glacier Deep Archive: $0.00099/GB/month, 12-hour retrieval. Lifecycle policies move data automatically; Intelligent Tiering does it dynamically based on access patterns. Knowing the cost-vs-latency curve is the senior signal.
File format choice has even larger cost implications. Parquet vs CSV at petabyte scale: Parquet typically 5-10x smaller, scans 10-100x less when only specific columns queried. Iceberg over Parquet: adds ACID, time travel, schema evolution; the 2024-2026 frontier for AWS data lakes. Athena queries Iceberg tables natively, so the stack increasingly converges on S3 + Iceberg + Athena + Glue Data Catalog as the open lakehouse pattern.
Redshift questions show up in 76% of AWS DE loops. The depth: sort keys (sorted columns within each block; determines scan efficiency on filter), dist keys (determines how rows are distributed across nodes; determines join efficiency), RA3 architecture (compute separated from storage; lets you scale them independently; replaced legacy DC2 for most workloads), Redshift Spectrum (query S3 data without loading; useful for cold data), materialized views (precomputed aggregates with auto- refresh), and concurrency scaling (extra clusters for burst load).
Common interview prompt: a query is slow; here is the EXPLAIN plan; what would you do? The answer involves identifying broadcast vs DS_DIST_NONE distribution strategies, seeing whether sort keys align with WHERE filters, checking for skew on dist keys (one node has 10x the rows of others), and proposing fixes (re-distribute with different dist key, add sort key, add materialized view for the hot aggregate).
Redshift Serverless changed the cost model significantly in 2023; expect questions about when serverless vs provisioned is right (serverless better for spiky workloads, provisioned better for sustained load). Strong candidates discuss the cost crossover point.
Total comp ranges. US-based, sourced from levels.fyi and verified offers.
| Company | Senior AWS DE range | Notes |
|---|---|---|
| Amazon (internal) | $280K - $420K | L6 / Sr. DE, AWS-native by definition |
| Netflix | $450K - $650K | Mostly AWS; all-cash comp philosophy |
| Stripe | $300K - $450K | AWS-heavy infrastructure |
| Airbnb | $320K - $480K | AWS-heavy with custom in-house tools |
| Lyft / DoorDash / Uber | $240K - $370K | AWS-heavy production stack |
| Robinhood | $240K - $370K | AWS-heavy with regulated-industry constraints |
| Mid-size SaaS on AWS | $190K - $290K | AWS knowledge a baseline expectation |
| Non-tech industry | $150K - $230K | AWS skills transfer well; comp lower than tech |
AWS knowledge is the foundation for AWS Redshift interview prep, Glue interview prep for AWS Data Engineer roles, and most company guides since most large companies run on AWS: how to pass the Netflix Data Engineer interview, how to pass the Stripe Data Engineer interview, how to pass the Airbnb Data Engineer interview, how to pass the Lyft Data Engineer interview, how to pass the Robinhood Data Engineer interview.
The system design framework from how to pass the system design round applies but you should substitute AWS service names (S3 for object storage, Redshift for warehouse, Kinesis or MSK for message broker, Glue or EMR for ETL, Lambda for serverless transformations, MWAA or Step Functions for orchestration). For the cloud comparison, see the how to pass the GCP Data Engineer interview and how to pass the Azure Data Engineer interview guides.
Drill S3 patterns, Redshift internals, Glue ETL, and Kinesis architectures in our practice sandbox.
Start PracticingSenior Data Engineer interview process, scope-of-impact framing, technical leadership signals.
Staff Data Engineer interview process, cross-org scope, architectural decision rounds.
Principal Data Engineer interview process, multi-year vision rounds, executive influence signals.
Junior Data Engineer interview prep, fundamentals to drill, what gets cut from the loop.
Entry-level Data Engineer interview, what new-grad loops look like, projects that beat experience.
Analytics engineer interview, dbt and SQL focus, modeling-heavy take-homes.
Continue your prep
50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.