Cloud Specialization Guide

AWS Data Engineer Interview

AWS is the most common cloud platform in 2026 data engineering, and AWS data engineer roles are correspondingly the highest- volume specialization. The interview tests standard data engineering fundamentals plus deep familiarity with the AWS data stack: Glue, Redshift, Kinesis, EMR, S3, Lambda. Companies that hire AWS DEs heavily: Amazon itself, Netflix (mostly AWS with some in-house), Stripe, Airbnb, Lyft, DoorDash, Robinhood, and most non-tech industry data engineering teams. Loops run 4 to 5 weeks. This page is part of the complete data engineer interview preparation framework.

The Short Answer
Expect a 5-round AWS data engineer loop: recruiter screen, technical phone screen, system design (an AWS-native pipeline architecture), live coding (Python with boto3 patterns or PySpark on Glue), and behavioral. Distinctive emphasis: S3 partitioning and storage classes, Redshift internals (sort keys, dist keys, RA3, query optimization), Kinesis Data Streams vs Firehose vs MSK decision, Glue ETL job patterns, Lambda for serverless transformations, IAM-aware design, and explicit attention to AWS-specific cost optimization.
Updated April 2026·By The DataDriven Team

AWS Services Tested in Data Engineer Loops

Frequency from 156 reported AWS data engineer loops in 2024-2026.

ServiceTest FrequencyDepth Expected
S3100%Storage classes, lifecycle, partitioning, transfer acceleration, intelligent tiering
Glue (ETL + Data Catalog)84%ETL jobs in PySpark, crawlers, Data Catalog, job bookmarking
Redshift76%Sort keys, dist keys, RA3 architecture, materialized views, Spectrum
Kinesis Data Streams67%Sharding, ordering, exactly-once via KCL, vs MSK trade-off
Kinesis Data Firehose54%Delivery to S3, Redshift, OpenSearch with transformation
MSK (Managed Kafka)47%When to choose vs Kinesis Data Streams
Lambda62%Serverless transformations, S3 event triggers, fan-out patterns
EMR (managed Spark/Hadoop)58%When to use vs Glue, ephemeral clusters, EMR Serverless
Athena61%Serverless SQL on S3, Iceberg integration, partition projection
DynamoDB47%OLTP serving for low-latency lookups, streams to data lake
Step Functions38%Orchestration vs Airflow trade-off
MWAA (Managed Airflow)44%Hosted Airflow alternative to self-managed
DMS / Datasync32%Migration and CDC patterns
IAM and VPC patterns67%Security-aware design, especially senior roles

S3 Patterns: The Foundation of AWS Data Engineering

Every AWS data engineering pipeline lands in S3 at some point. The interview probes for S3 fluency at multiple depths: storage classes (Standard for hot, IA for warm, Glacier for cold), partitioning patterns (year=YYYY/month=MM/day=DD for Hive-compatible partitioning), file format choice (Parquet for analytics, Avro for evolution-heavy ingestion, JSON for human debugging only), and transfer optimization (transfer acceleration, multipart upload, S3 Transfer Family for ingestion).

Storage class trade-offs come up in cost optimization rounds. Standard: $0.023/GB/month, instant access, expensive at petabyte scale. IA: $0.0125/GB/month, instant access, retrieval cost. Glacier Instant Retrieval: $0.004/GB/month, instant access, higher retrieval cost. Glacier Deep Archive: $0.00099/GB/month, 12-hour retrieval. Lifecycle policies move data automatically; Intelligent Tiering does it dynamically based on access patterns. Knowing the cost-vs-latency curve is the senior signal.

File format choice has even larger cost implications. Parquet vs CSV at petabyte scale: Parquet typically 5-10x smaller, scans 10-100x less when only specific columns queried. Iceberg over Parquet: adds ACID, time travel, schema evolution; the 2024-2026 frontier for AWS data lakes. Athena queries Iceberg tables natively, so the stack increasingly converges on S3 + Iceberg + Athena + Glue Data Catalog as the open lakehouse pattern.

Redshift Internals: The Most-Tested AWS Topic

Redshift questions show up in 76% of AWS DE loops. The depth: sort keys (sorted columns within each block; determines scan efficiency on filter), dist keys (determines how rows are distributed across nodes; determines join efficiency), RA3 architecture (compute separated from storage; lets you scale them independently; replaced legacy DC2 for most workloads), Redshift Spectrum (query S3 data without loading; useful for cold data), materialized views (precomputed aggregates with auto- refresh), and concurrency scaling (extra clusters for burst load).

Common interview prompt: a query is slow; here is the EXPLAIN plan; what would you do? The answer involves identifying broadcast vs DS_DIST_NONE distribution strategies, seeing whether sort keys align with WHERE filters, checking for skew on dist keys (one node has 10x the rows of others), and proposing fixes (re-distribute with different dist key, add sort key, add materialized view for the hot aggregate).

Redshift Serverless changed the cost model significantly in 2023; expect questions about when serverless vs provisioned is right (serverless better for spiky workloads, provisioned better for sustained load). Strong candidates discuss the cost crossover point.

Six Real AWS Data Engineer Interview Questions

S3 · L4

Design the partitioning scheme for a 100TB clickstream archive on S3

Partition by event_date for primary partitioning (queries always filter by date). Sub-partition by source for query pruning if queries often filter by source. File format Parquet with snappy compression. File size target 128MB to 1GB per file (avoid small-file problem). Lifecycle: Standard for first 30 days, IA for 30-90 days, Glacier IR for 90+ days. Cost reduction from this lifecycle is roughly 70% vs Standard-only.
Redshift · L5

This Redshift query takes 5 minutes; the EXPLAIN shows a DS_DIST_BOTH. How would you fix it?

DS_DIST_BOTH means both sides of the join are being redistributed across nodes. Fix options: align the dist key on both tables to the join column (so they co-locate, eliminating redistribution); if one table is small (<3M rows), set its diststyle to ALL so every node has a copy. Walk through the trade-off: ALL distribution duplicates storage; co-location forces one specific dist key choice that may hurt other queries.
Glue · L5

Design a Glue ETL pipeline for a 1TB daily incremental load

PySpark Glue job triggered by EventBridge schedule. Use Glue Job Bookmarks for incremental processing (tracks which S3 files have been processed). Output to S3 in Parquet with date partitioning. Update Glue Data Catalog post-write for downstream Athena and Redshift Spectrum queries. Failure handling: enable retries with idempotent writes (use partition overwrite, not append); send failures to dead-letter S3 prefix.
Kinesis · L5

When would you choose Kinesis Data Streams vs MSK vs Kinesis Firehose?

Kinesis Data Streams: when you need a Kafka-like abstraction with AWS-managed scaling and lower operational overhead than MSK. MSK: when you need full Kafka compatibility (specific tooling, specific consumer protocols), or when your team has Kafka expertise. Firehose: when the destination is S3, Redshift, or OpenSearch and you want a fully managed delivery without writing consumer code. The honest answer: Firehose for simple delivery, Kinesis Data Streams for custom processing, MSK only when Kafka-specific.
System Design · L5

Design a real-time anomaly detection pipeline on AWS

Source events -> Kinesis Data Streams (sharded by entity_id) -> Kinesis Data Analytics (managed Flink) for stateful aggregation -> SageMaker endpoint for ML scoring -> SNS for alerts on anomalies + S3 audit log. Cover: shard count sizing (1MB/sec or 1000 records/sec per shard), exactly-once via KPL with checkpointing, hot-key handling via mod-N re-sharding, model versioning via SageMaker endpoints with traffic shifting.
Cost · L5

How would you reduce AWS DE infrastructure cost by 40%?

Audit the largest cost lines: typically Redshift cluster runtime, Glue job runtime, S3 storage, and data transfer. For Redshift: pause idle clusters, consider Serverless for spiky workloads, optimize high-spend queries (the top 20 queries usually account for 80% of compute). For Glue: right-size worker types (G.1X for most jobs, G.2X for compute-heavy), use Glue 4.0 for performance improvements, schedule jobs to avoid over- provisioning. For S3: lifecycle to IA / Glacier, Intelligent Tiering for variable-access data, Parquet to compress data 5-10x. For data transfer: same-region resources, VPC endpoints to avoid NAT Gateway costs, CloudFront for cross-region delivery.

AWS Data Engineer Compensation (2026)

Total comp ranges. US-based, sourced from levels.fyi and verified offers.

CompanySenior AWS DE rangeNotes
Amazon (internal)$280K - $420KL6 / Sr. DE, AWS-native by definition
Netflix$450K - $650KMostly AWS; all-cash comp philosophy
Stripe$300K - $450KAWS-heavy infrastructure
Airbnb$320K - $480KAWS-heavy with custom in-house tools
Lyft / DoorDash / Uber$240K - $370KAWS-heavy production stack
Robinhood$240K - $370KAWS-heavy with regulated-industry constraints
Mid-size SaaS on AWS$190K - $290KAWS knowledge a baseline expectation
Non-tech industry$150K - $230KAWS skills transfer well; comp lower than tech

How AWS Connects to the Rest of the Cluster

AWS knowledge is the foundation for AWS Redshift interview prep, Glue interview prep for AWS Data Engineer roles, and most company guides since most large companies run on AWS: how to pass the Netflix Data Engineer interview, how to pass the Stripe Data Engineer interview, how to pass the Airbnb Data Engineer interview, how to pass the Lyft Data Engineer interview, how to pass the Robinhood Data Engineer interview.

The system design framework from how to pass the system design round applies but you should substitute AWS service names (S3 for object storage, Redshift for warehouse, Kinesis or MSK for message broker, Glue or EMR for ETL, Lambda for serverless transformations, MWAA or Step Functions for orchestration). For the cloud comparison, see the how to pass the GCP Data Engineer interview and how to pass the Azure Data Engineer interview guides.

Data Engineer Interview Prep FAQ

Should I learn Glue or EMR?+
Glue first; EMR as secondary. Glue is AWS's managed serverless ETL service and the more-tested system in interviews. EMR is the right choice when the team has existing Spark workloads to migrate or needs custom Hadoop ecosystem tools (Hudi, Hive, HBase). Glue 4.0 (released 2023) closed most of the historical gap.
Are AWS certifications useful for DE roles?+
The AWS Solutions Architect Associate signals foundational AWS knowledge. The Data Analytics Specialty is more directly relevant but newer. Neither is required for senior roles but both can help unlock interviews for early-career candidates without AWS work experience.
How important is Redshift vs Snowflake knowledge for AWS DE roles?+
Redshift first if the company is AWS-native. Snowflake knowledge transfers (both columnar warehouses with similar concepts) but the cost models and internals differ significantly. If the company uses Snowflake on AWS, prep Snowflake; otherwise prep Redshift.
What's the difference between Kinesis and MSK?+
Kinesis Data Streams: AWS-proprietary, simpler, lower operational overhead, sharding model. MSK: AWS-managed Kafka, full Kafka API compatibility, more operational complexity. Pick Kinesis when you don't need Kafka-specific compatibility; pick MSK when you do.
How does AWS DE comp compare to GCP DE comp?+
Roughly equivalent at the same level. The cloud platform doesn't significantly affect comp; the company does. AWS DE jobs are higher-volume because more companies run on AWS, but per-role comp is similar.
Is Lambda real for data engineering?+
Yes, for specific patterns. S3 event triggers for small file ingestion, fan-out from one source to many consumers, lightweight transformations on streaming data. Not appropriate for: long-running batch jobs (15-min limit), heavy compute (memory ceiling), stateful processing. Know when Lambda fits and when Glue or EMR is the right call.
How important is IAM in the interview?+
Increasingly important at L5+. AWS-native security questions show up in system design rounds: how do you scope IAM roles, when do you use VPC endpoints vs NAT Gateway, how do you implement column-level security in Redshift, how do you audit data access. Know the basics; senior roles probe deeper.
Is AWS DE hiring strong in 2026?+
Yes, the strongest of the cloud platforms by volume. AWS-native companies hire steadily; non-tech industry hiring is increasingly AWS-focused; Amazon itself hires consistently. The opportunity is broader than the FAANG focus suggests.

Practice AWS-Native System Design

Drill S3 patterns, Redshift internals, Glue ETL, and Kinesis architectures in our practice sandbox.

Start Practicing

More Data Engineer Interview Prep Guides

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats