Amazon Data Engineer Interview Questions
Amazon Data Engineer Interview Questions
Amazon-tagged data engineer interview questions with live grading.
Amazon data engineer interview questions tagged based on reported interview shape. AWS-native stack: Redshift, Glue, EMR, Kinesis, S3, Athena, DynamoDB. Correctness and clean code weighted heavily on technical rounds. Leadership Principles framing on behavioral and design. The bar-raiser round as the cultural-fit gate unique to Amazon.
Amazon's data engineer interview loop in 2026 has 5-6 rounds and centers on the AWS data stack: Redshift (columnar warehouse with DISTKEY and SORTKEY decisions, materialized views, COPY for bulk load, VACUUM and ANALYZE operational story), Glue (serverless ETL with crawlers and Glue Catalog as metastore), EMR (Spark and Hive at scale), Kinesis Data Streams and Firehose (streaming ingest), S3 with Athena (data lake query layer), DynamoDB (operational reads). The data engineer SQL round is Redshift-flavored: DISTKEY and SORTKEY questions appear frequently, the COPY command for bulk loading is a typical sub-question, and the VACUUM/ANALYZE operational story comes up at L5+.
The Amazon data engineer Python round is pipeline-shaped: parse a malformed CSV, deduplicate events with composite key, implement retry with exponential backoff and jitter for a flaky Kinesis put or DynamoDB write. Vanilla Python preferred. Pandas allowed for SCD Type 2 merge and similar prompts. The bar is correctness plus error handling; silent failures in pipeline code are the explicit failure mode.
The Amazon data engineer system design round expects an AWS-centric architecture. For 10B clickstream events per day: Kinesis Data Streams shard count sizing (1MB/sec per shard, peak 5x, 580 MB/s peak = 580 shards), Firehose to S3 with Parquet conversion, Glue crawler for catalog updates, partitioned by date and hour, Athena for ad-hoc plus Redshift for BI workload, EMR for Spark on heavy joins. The cost question (how much would this run per month) comes up at L5+; rough back-of-envelope numbers matter (Kinesis at $X per shard-hour, S3 at $Y per GB, Redshift cluster pricing, Glue DPU pricing).
Leadership Principles (LP) framing is what makes Amazon distinct. 16 stated cultural values that Amazon's interview rubric explicitly maps every behavioral and design answer to. Ownership for "what happens when your pipeline breaks at 3am on a weekend". Frugality for "design this within a $5k/month budget". Bias for Action for "ship in 2 weeks versus 6 weeks with the right architecture". Insist on the Highest Standards for "why did you go back to fix the schema inconsistency when the team had moved on". Interviewers explicitly map your answer to LPs in their rubric. Prepare 5-7 STAR-format stories that each map to 2-3 different LPs.
The bar-raiser round is unique to Amazon. An interviewer from outside the hiring team, trained on Amazon's leveling and LPs, whose vote can veto a hire that the rest of the panel wants to make. Typically behavioral with deep LP probing, sometimes combined with a stretch technical question at a level above the target role. The bar-raiser's job is to ensure the hire would raise the bar for the company, meaning be better than 50 percent of current Amazons at that level.
Amazon's data engineer SQL bar is correctness-and-clean-code heavier than narrative-heavy companies like Meta. The rubric explicitly weights "produced a working solution" and "handled the obvious edge cases" above "articulated multiple trade-offs". For mid-level (L4) loops, that often means a correct, clean, well-named CTE structure scores higher than a verbose multi-approach discussion. For senior (L5+) loops, the trade-off articulation comes back in the design round, but the SQL round stays correctness-focused.
- What is the typical Amazon data engineer loop structure?
- 5-6 rounds over an onsite: SQL (Redshift-flavored), Python, data modeling, system design (AWS stack expected), behavioral with explicit Leadership Principles mapping, and often a bar-raiser round (one interviewer from outside the team whose vote can veto a hire). Phone screens are usually SQL plus a 60-minute behavioral with LP framing.
- Do I need to know AWS specifically for Amazon data engineer interviews?
- Yes for the design rounds. The interviewer expects an AWS-centric architecture (Kinesis to Firehose to S3 to Glue/Athena/Redshift, EMR for heavy compute, DynamoDB for operational reads). Mention non-AWS alternatives when relevant (Kafka instead of Kinesis if you have an argument for it), but the default expected answer is AWS-native.
- What are Leadership Principles and how do they affect data engineer interviews?
- 16 stated cultural values that Amazon's interview rubric explicitly maps every behavioral and design answer to. Ownership, Customer Obsession, Frugality, Bias for Action, Insist on the Highest Standards, Are Right A Lot, and 10 others. Prep 5-7 STAR-format stories that each map to 2-3 LPs. The interviewer asks 'tell me about a time you...' and silently maps your answer to specific LPs.
- What is the bar-raiser round?
- An interviewer from outside the hiring team, chosen for interviewing skill, trained on Amazon's leveling and LPs, whose vote can veto a hire that the rest of the panel wants to make. The bar-raiser round is typically behavioral with deep LP probing, sometimes combined with a stretch technical question at a level above the target role. The bar-raiser's job is to ensure the hire would raise the bar for the company.
- How is the Amazon SQL round different from Meta or Google?
- Amazon's data engineer SQL round weights correctness and clean code over multi-approach articulation. A correct, well-named CTE solution to a Medium problem scores well even without discussing alternatives. The Redshift dialect comes up: DISTKEY, SORTKEY, COPY command, VACUUM/ANALYZE operational story at L5+. Window functions, CTEs, and aggregation are the same as everywhere else; the dialect questions are Amazon-specific.
- What is the design round bar at Amazon?
- AWS-centric architecture with explicit cost reasoning (Frugality LP). For 10B events per day: Kinesis Data Streams shard count, Firehose to S3 with Parquet conversion, Glue crawler for catalog updates, partitioned by date/hour, Athena for ad-hoc plus Redshift for BI, EMR for Spark on heavy joins. The cost question comes up at L5+; rough back-of-envelope numbers matter.
- What is the Python round like at Amazon?
- Pipeline-shaped, similar to Meta's Python round. Common prompts: parse a malformed CSV without crashing, deduplicate Kinesis events by composite key (event_id, source) with tiebreaker, implement retry with exponential backoff and jitter for a flaky DynamoDB put, validate records with field-level errors and route bad ones to a DLQ. Vanilla Python preferred.
- What levels does Amazon hire data engineers at?
- L4 (entry/junior DE), L5 (mid/senior DE, most common hire), L6 (senior DE / principal DE for some orgs), L7 (principal DE / senior principal). L5 is typically a 4-8 year experience floor. Rubric depth increases per level: L5 expects trade-off articulation, L6 expects design ownership of cross-team systems, L7 expects org-level technical strategy.
129 practice problems matching this filter. Domains: Data Modeling (8), SQL (82), Python (39). Difficulty: easy (63), medium (54), hard (12).
Data Modeling (8)
- A Number for the Seller - easy - They want a total. Give them the right schema first.
- Event Ticketing System Data Model - easy - JSON in. Reporting warehouse out. Design both ends.
- Food Truck Operations Data Model - medium - Mobile vendor, fixed menu, unpredictable locations.
- Marketplace Sales Warehouse - hard - No schema given. The interviewer is watching.
- The Last Mile - medium - Order placed. Now track it to the door.
- The Sales Architecture - medium - Numbers are easy. Making them queryable at scale is the real job.
- Two Wallets - medium - Two user types. Multiple payment methods. One messy billing table.
- The Transfer Request - medium - Apply, wait, get approved or denied. Track all of it.
SQL (82)
- API Calls With Matching Status - medium - Same status, same pattern. Coincidence?
- Average Session Duration by Device - easy - Session length, device by device.
- Best Selling Product by Month - hard - Every month has a winner.
- Best-Selling Reps Each Month - easy - In every category, a few sellers rise to the top.
- The Notification That Paid Off - hard - The message went out to thousands. A smaller number actually bit.
- CDN-Related DNS Lookups - easy - DNS lookups tied to the CDN.
- Character Position in Endpoint - easy - URL patterns, character by character.
- Cheapest Cost Per Region - easy - Lowest spend per region.
- Cheapest Transaction per User - easy - Everyone has a smallest purchase.
- Cloud Cost Trend Analysis - medium - Cost trends across billing periods.
- Cross-Variant User Pairs - medium - Same experiment. Different variants. Who overlaps?
- Customer Full Name Concat - easy - First name, last name. Combine them.
- Daily Net Revenue - hard - Net revenue, day by day. Refunds included.
- Device Type Serving Most Users - medium - One device type serves more users than the rest.
- Duplicate DQ Check Records - medium - Passed QA twice. That's the problem.
- Duplicated User Event Messages - medium - Duplicated messages from the alerts topic.
- Even-ID February Signups - easy - A very specific slice of a very specific cohort.
- Even-ID June Signups - easy - Odd IDs, even IDs. The filter is precise.
- Event Types Spanning Multiple Months - easy - Some events span seasons.
- Filtered User Roster - easy - A clean roster for the all-hands.
- Find the Fifth Largest Cost - medium - Not the biggest. Not the smallest. The fifth.
- First Half of Page Views - medium - Half the data. The first half.
- First Migration Record - easy - The very first migration. Where it all began.
- Frequent Message Senders - medium - Someone is sending too many messages.
- Health Checks per Service - easy - Some services get checked constantly.
- Highest and Lowest Cloud Costs - medium - The extremes in cloud spending.
- Highest Daily Spend - medium - Somewhere in that window, someone broke the spending record.
- Inactive Users in Date Range - medium - Ghost accounts. Active signup, zero sessions.
- Largest Single Cloud Cost - medium - One line item. The biggest bill of all.
- Last Five Batch Jobs - easy - The last five. A quick tail check.
- Last Migration Record - easy - The most recent migration. Is it the last?
- Latest Session Per User - easy - Everyone has a most recent session.
- Longest Deploy With Full Identifier - easy - The longest deployment. Full ID.
- Long Searches Containing 'er' - easy - Long queries with 'er'. A pattern?
- Low-Volume Stream Topics - medium - Quiet topics in the stream.
- Mid-CPU Nodes - easy - Not the heaviest. Not the lightest. The middle.
- Mid-Range Cost Allocations - easy - Not the cheapest. Not the priciest. The middle.
- The Floor Price - medium - Before the negotiation, find what each provider really charges at its cheapest.
- Monthly Revenue Change - hard - Revenue, month over month.
- The Tiebreaker - easy - One column wasn't enough. The second column settles it.
- Peak Activity by Device - easy - Activity windows, device by device.
- Power Users by Session Activity - medium - More sessions. More time. The power users.
- Priciest Item in Each Category - medium - The most expensive item per category.
- Production Deploys From April Onward - easy - After the cutoff, how many times did prod get a push?
- Product Name Letter Replace - easy - A quick text transform on product names.
- Product Name Prefix - easy - Just the first three characters. That is all.
- Regional Sales Growth QoQ - hard - Quarter-over-quarter growth. Region by region.
- Repeat Buyers Across Halves - medium - First half buyer. Second half buyer. Same person.
- Repeat Purchase Window - medium - The retention squad is looking for repeat purchasers.
- Returning Buyers - medium - They came back and bought again.
- Two Names on the Ledger - easy - Two accounts. One ledger. Watch the spend stack up.
- Rolling Revenue Average - hard - Smooth out the revenue bumps. The trend matters more.
- Runner-Up Cost Without ORDER BY - medium - The second highest. Without sorting.
- Second Highest Cloud Cost - medium - The second biggest bill on record.
- Session-Fit Content - easy - Content that fits the session length.
- Signups by Age Bucket Since April - easy - Recent signups by age.
- The Compliance Order - easy - Token scopes need to be in the right sequence before the audit.
- Teams Below Double Average Spend - medium - Teams spending under twice the average.
- The February Cohort - easy - One signup window. One cohort. Who joined the club?
- Third Highest Spender - medium - Bronze medal in spending.
- Three Lowest Distinct Cloud Cost Amounts - easy - The three cheapest bills on record.
- Titles Ending With S - easy - Naming conventions. Specifically the plurals.
- Keys That Never Die - medium - Some API keys have no expiry date at all. That should worry someone.
- Top 10 CPU-Heavy Nodes - medium - The ten hungriest nodes.
- Top API Caller - medium - One user triggered more API calls than anyone.
- Top API Token Scopes - easy - The highest-value token scopes.
- Top Campaign by Opens - medium - One campaign got all the opens.
- Top Cost Entry per Team - medium - The single biggest bill per team.
- Top Metric Values - easy - The five highest numbers. No duplicates.
- Top Services by Uptime - medium - Uptime is a competition. Which services never blink?
- Total Cost by Category - easy - Total spend per category.
- Total Hours Between Consecutive Events - hard - Hours between state changes.
- Total User Spend - easy - Each customer's total. Summarized.
- Transaction-Only Features - hard - Exclusive to one source. Missing from the other.
- Transaction Revenue by Customer - medium - One month, every customer, every dollar accounted for.
- Transaction Share of User Spend - medium - Each transaction's share of the whole.
- Trim Endpoints Right - easy - Trailing whitespace. Clean it up.
- Trim Search Terms Left - easy - Leading whitespace. Clean it up.
- US-East KV Store Entries - easy - KV store inventory. us-east-1.
- Users Without Sessions - medium - Account created. Never logged in.
- Weekly Build Status Report - hard - Every CI run, bucketed by week.
- Weekly Transaction Volume - easy - Weekly volume. The pulse.
Python (39)
- Batch Records - medium - Too many at once. Break them into groups.
- Batch With Metadata - easy - The list gets chopped.
- Column Max - easy - One value rules the column.
- Column Range - easy - From minimum to maximum. What is the spread?
- All Told - easy - Every shift leaves a number behind. Total the fleet.
- Cumulative Sum - medium - The total grows with every row.
- Diagonal Extract - medium - Not every value sits in a row or column.
- Explode List - easy - One row holds many values. Unpack it.
- Find Indices - medium - It is in there somewhere. Where exactly?
- Full Outer Zip - medium - Two sides. No value left behind.
- Greeting Formatter Class - easy - First impressions are formatted carefully.
- Null Counter - easy - How many holes in the data?
- Portfolio Profit Calculator - medium - Portfolio gain from purchase history and current prices.
- Quality Gate - easy - Not everything passes inspection.
- Full Circle - medium - Load has to keep moving. Pass it down the line.
- Run Length Encoding - easy - AAABBB becomes 3A3B. Compress it.
- Sort Descending - easy - Biggest first. No exceptions.
- Subarray Signal - medium - One stretch carries the strongest signal.
- The Change Data Capture - hard - Inserts, updates, deletes : all present.
- The Deep Config - medium - Nested config, dot-notation output.
- The Dependency Resolver - medium - Everything depends on everything.
- The Dominant Signal - easy - Hottest items in the transaction log. Ties included.
- The Event Aggregator - medium - Bucket a firehose of events into tidy time windows.
- The Log Pulse - easy - Some lines repeat themselves.
- The Nearest Value Mapper - medium - Close enough counts. Ties go low.
- The One-of-Each - easy - Strip the repeats, keep the originals.
- The Original Keeper - easy - Clean up duplicate events without losing the timeline.
- The Output Peak - hard - One stretch outpaced all the others.
- The Payload Flattener - medium - Turn a deeply nested API response into a flat row.
- The Record Reconciler - medium - Two versions of the same truth.
- The Running Total - easy - Each position holds the sum of everything before it.
- The Squeeze - easy - aaabbb gets old fast. Shrink it.
- The Streak Breaker - easy - It has a problem with repetition.
- The String Shrinker - easy - Compress the string. Shorter wins.
- The Target Hunt - medium - Pairs that hit a target. Every one of them.
- The Trade Signal - easy - Buy low, sell high. Identify the ideal moment.
- Transform Column - easy - Same data, new shape.
- Transpose Table - medium - Rows become columns. Columns become rows.
- Value Count - easy - How many of each? Count them.