Walk into your data engineering interview prepared

Practice the problems being asked by 920 companies in data engineering interviews

1,500

real questions

920

companies

2,817

interview reports

  • Practice that counts. Write code and design data infra the same way you will in the interview.
  • Built for your level. Practice adapts to where you are and pushes you toward where you need to be.
  • Know when you're ready. Track readiness across SQL, Python, data modeling, and pipeline architecture.
query.sql
fact_sessions
PK
session_key
FK
user_key
FK
device_key
-
event_count
dim_user_scd2
PK
user_key
-
plan_tier
-
valid_from
-
is_current
dim_device
PK
device_key
-
device_type
-
os
-
browser
PostgreSQL
Source System
Kafka
Kafka Queue
Snowflake
Data Warehouse
Redis
Realtime Cache
Stage 0SCAN
user_events8.4M
user_events
user_idplan_tierevent_tsrn
u1free2026-03-22 08:012
u5free2026-03-22 11:452
u2pro2026-03-22 08:031
u3team2026-03-22 10:001
u5enterprise2026-03-22 12:001
u1pro2026-03-22 09:151
u4free2026-03-22 11:301
u6free2026-03-22 13:151
8 rows

SQL, Spark, Python, Modeling, & Pipeline Architecture

1SELECT
2 source_system,
3 event_date,
4 COUNT(*) AS rows_landed,
5 LAG(COUNT(*), 1) OVER (
6 PARTITION BY source_system
7 ORDER BY event_date
8 ) AS prev_day
9FROM raw_logs_complete
10GROUP BY 1, 2

SQL

The queries interviewers actually write on the whiteboard

JOINs, self-joins & subqueries
Window functions
CTEs & recursive queries
Aggregations
NULL handling
Date functions & time series
Appears in 95% of DE interviews

Spark

The distributed compute questions that trip up senior candidates

RDD vs DataFrame vs Dataset
Shuffles & partitioning
Joins: broadcast, sort-merge, shuffle hash
Lazy evaluation & the DAG
Caching, persistence & spills
Skew, speculation & tuning
Appears in 70% of DE interviews
1w = (Window
2 .partitionBy("user_id")
3 .orderBy(F.desc("event_ts")))
4df = spark.read.parquet("s3://events/")
5ranked = df.withColumn(
6 "rn", F.row_number().over(w))
7deduped = (ranked
8 .filter("rn = 1").drop("rn"))
9(deduped.groupBy("source").agg(
10 F.approx_count_distinct("user_id"),
11 F.avg("lag_ms").alias("avg_lag")))
1def validate_batch(records, schema):
2 clean, rejected = [], []
3 for row in records:
4 row["loaded_at"] = utc_now()
5 nulls = [k for k in schema if not row.get(k)]
6 if nulls:
7 rejected.append({**row, "err": nulls})
8 else:
9 clean.append(row)
10 return clean, rejected

Python

The data transforms and pipeline logic interviewers test

Dictionaries & deduplication
List comprehensions & filtering
String slicing & time bucketing
Event stream processing
Idempotent data transforms
Aggregation without pandas
Appears in 78% of DE interviews

Data Modeling

The interview round that separates analysts from engineers

Schema design & normalization
Star & snowflake schemas
Slowly changing dimensions
Entity relationships & cardinality
Keys, constraints & indexing
Design patterns & trade-offs
Appears in 65% of DE interviews
dim_user_scd2PKuser_keyplan_tiervalid_fromis_currentfact_sessionsPKsession_keyFKuser_keyFKdevice_keyevent_countdim_devicePKdevice_keydevice_typeosbrowser
CDC
Postgres
sla: 2h
REST
Stripe API
Kafka
events_topic
parallelism: 4
Spark
clean_events
parallelism: 200
Quality
row_count
Snowflake
analytics_dwh
idempotency: upsert
Looker
exec_dashboard
dbt
feature_eng
SageMaker
ml_scoring
retry: 3 exp

Architecture

Design the systems that move data at scale

Scheduling & orchestration
Batch vs streaming
Data quality & validation
Idempotent pipelines
Schema evolution
Monitoring & alerting
Appears in 52% of DE interviews
Precision Focus

There Are Hundreds of Data Skills. You Don't Need All of Them.

Prescribed difficulty

Every question optimizes your odds of success

Learn the patterns

Interview questions fall into predictable patterns you can study

Defend your solution

Explaining your reasoning is just as important as building it

Study what your company asks

Focus your time on the problems that will actually come up

DatabricksAirbnbStripeGoogleUber

Know what your target company tests

Every question is tagged to the companies that ask it. Stop guessing whether window functions or Delta Lake matters for Databricks. Filter to your target company and study exactly what shows up in their loops.

1SELECTuser_id,RANK()OVER(
2PARTITION BYuser_id

Learn the pattern, not the answer

Getting the right answer isn't enough. Every problem includes a full written solution that explains the reasoning, the tradeoffs, and the mistake most candidates make. So when they ask a variant, you're not starting from scratch.

AirbnbIncremental load: SCD Type 2
StripeEvent dedup across shardspro
NetflixLate-arriving data: watermarkspro

The real questions, not stand-ins

LeetCode is built for software engineers. The questions that show up at Airbnb, Stripe, and Netflix's data engineering loops are different. Practice on the questions you'll hit in the loop.

Databricks · SQL · 45 min
27:34

Go in having already done it once

The first time you sit in a timed interview is the worst time to discover you freeze under pressure. Mocks scoped to your target company's format mean the real thing feels like a repeat, not a surprise.

Why DataDriven

The most efficient path to interview-ready.

Every hour you spend preparing directly increases your chance of getting the offer. No grinding through problems that won't show up.

$220K+

Data Engineer Median Compensation

The offer is worth preparing for correctly.

  • Focus

    Define your target companies and level. DataDriven cuts the scope of your focus areas by up to 60%, stripping away the noisy things interviewers don’t ask.

  • Sharpen

    Every challenge narrows in on the area that optimally improves your interview success rate, so every minute that you spend is impactful.

  • Practice

    Master the SQL, Python, data modeling, and pipeline design that matters in one place. Write real code against real data. No round you haven’t rehearsed.

  • Ready

    A readiness score tracks how prepared you are for every topic interviewers ask about. When it’s green across the board, you’ll ace it. No guessing.

Details

I write SQL every day and I still bombed a technical screen. What happened?
Production work and interview performance are different skills. You don't fail on knowledge. You fail on structuring an answer under time pressure with unfamiliar tables and someone watching. Every challenge here is timed and live so you build the muscle of producing correct code when it counts.
I have no idea what my target company actually tests. How do I not waste a month?
Every session targets your weakest topic against the pattern mix your target company tests most heavily. You're not working through a generic top-100 list. You're closing the specific gaps that would cost you the offer, so every hour of prep counts.
The data modeling round scares me and I can't find anywhere to practice it.
That round cuts more senior candidates than any other, and most people just re-read the Kimball book and hope. You get a product scenario, build the schema from scratch, and get evaluated on your grain, dimensions, and SCD strategies before you're doing it live.
I keep telling myself 'one more week of prep' and it's been three months.
That loop never ends on its own. A readiness score per target company shows exactly which rounds you'd pass today and which ones would cost you the offer. When you can see the gap closing, you stop guessing and start scheduling.
Every company seems to test something completely different. How do I prep for that?
They do. Databricks leans hard on Spark internals, Meta on SQL windows, Stripe on idempotent pipelines. Your practice set is weighted to your target company's actual pattern distribution, not a one-size-fits-all set of canned problems.
Start the DataDriven 75

Ace the Data Engineering Interview

Your day job does not prepare you for what they actually ask in the interview. Practice the real rounds. Find your gaps before the interviewer does. Free forever.

About DataDriven

DataDriven is a free web application for data engineering interview preparation. It is not a generic coding platform. It is built exclusively for data engineering interviews.

What DataDriven Is

DataDriven is the only platform that simulates all four rounds of a data engineering interview: SQL, Python, Data Modeling, and Pipeline Architecture. Each round can be practiced in two modes: Problem mode and Interview mode.

Problem Mode

Problem mode is self-paced practice with clear problem statements and instant grading. For SQL, your query runs against a real database and gets graded automatically. For Python, your code executes for real with automatic grading. For Data Modeling, you build schemas on an interactive canvas with structural validation. For Pipeline Architecture, you design pipelines on an interactive canvas with component evaluation and cost estimation.

Interview Mode

Interview mode simulates a real interview from start to finish. It has four phases. Phase 1 (Think): you receive a deliberately vague prompt and ask clarifying questions to an AI interviewer, who responds like a real hiring manager. Phase 2 (Code/Design): you write SQL, Python, or build a schema/pipeline on the interactive canvas. Your code executes for real. Phase 3 (Discuss): the AI interviewer asks follow-up questions about your solution, one question at a time. You respond, and it asks another. This continues for up to 8 exchanges. The interviewer probes edge cases, optimization, alternative approaches, and may introduce curveball requirements that change the problem mid-interview. Phase 4 (Verdict): you receive a hire/no-hire decision with specific feedback on what you did well, where your reasoning had gaps, and what to study next.

Platform Features Explained

Adaptive difficulty: problems get harder when you answer correctly and easier when you struggle, targeting the difficulty level that maximally improves your interview readiness. Spaced repetition: concepts you struggle with resurface at optimal intervals before you forget them, while mastered topics fade from rotation. Readiness score: a per-topic tracker that shows exactly which concepts are strong and which have gaps, across every topic interviewers test. Company-specific filtering: filter questions by target company (Google, Amazon, Meta, Stripe, Databricks, and more) and seniority level (Junior through Staff), weighted by real interview frequency data. All features are 100% free with no trial, no credit card, and no paywall.

Four Interview Domains

SQL: 850+ questions with real SQL execution. Topics include joins, window functions, GROUP BY, CTEs, subqueries, COALESCE, CASE WHEN, pivot, rank, and partition by. Python: 388+ questions with real code execution. Topics include data transformation, dictionary operations, file parsing, ETL logic, PySpark, error handling, and debugging. Data Modeling: interactive schema design canvas. Topics include star schema, snowflake schema, dimensional modeling, slowly changing dimensions, data vault, grain definition, and conformed dimensions. Pipeline Architecture: interactive pipeline design canvas. Topics include ETL vs ELT, batch vs streaming, Spark, Kafka, Airflow, dbt, storage architecture, fault tolerance, and incremental loading.

Skills You Will Practice

SQL

The queries interviewers actually write on the whiteboard. Appears in 95% of DE interviews.

Data Modeling

The interview round that separates analysts from engineers. Appears in 65% of DE interviews.

Python

The data transforms and pipeline logic interviewers test. Appears in 78% of DE interviews.

Pipeline Architecture

Design the systems that move data at scale. Appears in 52% of DE interviews.

Platform Features

How DataDriven Works

  1. Focus: Define your target companies and level. DataDriven cuts the scope of your focus areas by up to 60%, stripping away the noisy things interviewers do not ask.
  2. Sharpen: Every challenge narrows in on the area that optimally improves your interview success rate, so every minute that you spend is impactful.
  3. Practice: Master the SQL, Python, data modeling, and pipeline design that matters in one place. Write real code against real data. No round you have not rehearsed.
  4. Ready: A readiness score tracks how prepared you are for every topic interviewers ask about. When it is green across the board, you will ace it. No guessing.

Frequently Asked Questions

I write SQL every day and I still bombed a technical screen. What happened?

Production work and interview performance are different skills. You do not fail on knowledge. You fail on structuring an answer under time pressure with unfamiliar tables and someone watching. Every challenge here is timed and live so you build the muscle of producing correct code when it counts.

I have no idea what my target company actually tests. How do I not waste a month?

Every session targets your weakest topic against the pattern mix your target company tests most heavily. You are not working through a generic top-100 list. You are closing the specific gaps that would cost you the offer, so every hour of prep counts.

The data modeling round scares me and I cannot find anywhere to practice it.

That round cuts more senior candidates than any other, and most people just re-read the Kimball book and hope. You get a product scenario, build the schema from scratch, and get evaluated on your grain, dimensions, and SCD strategies before you are doing it live.

I keep telling myself one more week of prep and it has been three months.

That loop never ends on its own. A readiness score per target company shows exactly which rounds you would pass today and which ones would cost you the offer. When you can see the gap closing, you stop guessing and start scheduling.

Every company seems to test something completely different. How do I prep for that?

They do. Databricks leans hard on Spark internals, Meta on SQL windows, Stripe on idempotent pipelines. Your practice set is weighted to your target company's actual pattern distribution, not a one-size-fits-all set of canned problems.

Practice by Domain (Interview Mode)

Practice by Domain (Problem Mode)

Start Practicing

Interview Prep Guides