I keep telling myself 'one more week of prep' and it's been three months.

That loop never ends on its own. A readiness score per target company shows exactly which rounds you'd pass today and which ones would cost you the offer. When you can see the gap closing, you stop guessing and start scheduling.

Walk into your data engineering interview prepared

Practice the problems being asked by 920 companies in data engineering interviews

1,587

real questions

920

companies

2,817

interview reports

Practice that counts. Write code and design data infra the same way you will in the interview.
Built for your level. Practice adapts to where you are and pushes you toward where you need to be.
Know when you're ready. Track readiness across SQL, Python, data modeling, and pipeline architecture.

query.sql

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34














WITH deduped AS (
  SELECT
    user_id,
    plan_tier,
    event_ts,
    rn
  FROM user_events
  WHERE rn = 1
)
SELECT s.user_id, s.plan_tier, s.event_ts
FROM deduped s
LEFT JOIN users d ON s.user_id = d.user_id
WHERE d.user_id IS NULL
  OR d.hash_diff != s.plan_tier
WITH deduped AS (
  SELECT
    user_id,
    plan_tier,
    event_ts,
    rn
  FROM user_events
  WHERE rn = 1
)
SELECT s.user_id, s.plan_tier, s.event_ts
FROM deduped s
LEFT JOIN users d ON s.user_id = d.user_id
WHERE d.user_id IS NULL
  OR d.hash_diff != s.plan_tier

fact_sessions

session_key

user_key

device_key

event_count

dim_user_scd2

user_key

plan_tier

valid_from

is_current

dim_device

device_key

device_type

browser

Source System

Kafka Queue

Data Warehouse

Realtime Cache

CDC Delta Detection

Write a CDC query that deduplicates staging events by keeping only the latest row per user, then finds users that are new or whose plan has changed vs. the dimension table.

Question 1 of 5

Stage 0SCAN

user_events8.4M

user_events
user_id	plan_tier	event_ts	rn
u1	free	2026-03-22 08:01	2
u5	free	2026-03-22 11:45	2
u2	pro	2026-03-22 08:03	1
u3	team	2026-03-22 10:00	1
u5	enterprise	2026-03-22 12:00	1
u1	pro	2026-03-22 09:15	1
u4	free	2026-03-22 11:30	1
u6	free	2026-03-22 13:15	1
8 rows

SQL, Spark, Python, ModelingData Modeling, & Pipeline Architecture

	SELECT
	source_system,
	event_date,
	COUNT(*) AS rows_landed,
	LAG(COUNT(*), 1) OVER (
	PARTITION BY source_system
	ORDER BY event_date
	) AS prev_day
	FROM raw_logs_complete
	GROUP BY 1, 2

SQL

The queries interviewers actually write on the whiteboard

JOINs, self-joins & subqueries

Window functions

CTEs & recursive queries

Aggregations

NULL handling

Date functions & time series

Appears in 95% of DE interviews

Spark

The distributed compute questions that trip up senior candidates

RDD vs DataFrame vs Dataset

Shuffles & partitioning

Joins: broadcast, sort-merge, shuffle hash

Lazy evaluation & the DAG

Caching, persistence & spills

Skew, speculation & tuning

Appears in 70% of DE interviews

	w = (Window
	.partitionBy("user_id")
	.orderBy(F.desc("event_ts")))
	df = spark.read.parquet("s3://events/")
	ranked = df.withColumn(
	"rn", F.row_number().over(w))
	deduped = (ranked
	.filter("rn = 1").drop("rn"))
	(deduped.groupBy("source").agg(
	F.approx_count_distinct("user_id"),
	F.avg("lag_ms").alias("avg_lag")))

	def validate_batch(records, schema):
	clean, rejected = [], []
	for row in records:
	row["loaded_at"] = utc_now()
	nulls = [k for k in schema if not row.get(k)]
	if nulls:
	rejected.append({**row, "err": nulls})
	else:
	clean.append(row)
	return clean, rejected

Python

The data transforms and pipeline logic interviewers test

Dictionaries & deduplication

List comprehensions & filtering

String slicing & time bucketing

Event stream processing

Idempotent data transforms

Aggregation without pandas

Appears in 78% of DE interviews

Data Modeling

The interview round that separates analysts from engineers

Schema design & normalization

Star & snowflake schemas

Slowly changing dimensions

Entity relationships & cardinality

Keys, constraints & indexing

Design patterns & trade-offs

Appears in 65% of DE interviews

CDC

Postgres

sla: 2h

REST

Stripe API

Kafka

events_topic

parallelism: 4

Spark

clean_events

parallelism: 200

Quality

row_count

Snowflake

analytics_dwh

idempotency: upsert

Looker

exec_dashboard

dbt

feature_eng

SageMaker

ml_scoring

retry: 3 exp

Architecture

Design the systems that move data at scale

Scheduling & orchestration

Batch vs streaming

Data quality & validation

Idempotent pipelines

Schema evolution

Monitoring & alerting

Appears in 52% of DE interviews

Precision Focus

There Are Hundreds of Data Skills.
You Don't Need All of Them.

Prescribed difficulty

Every question optimizes your odds of success

Learn the patterns

Interview questions fall into predictable patterns you can study

Defend your solution

Explaining your reasoning is just as important as building it

Study what your company asks

Focus your time on the problems that will actually come up

DatabricksAirbnbStripeGoogleUber

Know what your target company tests

Every question is tagged to the companies that ask it. Stop guessing whether window functions or Delta Lake matters for Databricks. Filter to your target company and study exactly what shows up in their loops.

1SELECTuser_id,RANK()OVER(

2PARTITION BYuser_idORDER BYevent_tsDESC

Learn the pattern, not the answer

Getting the right answer isn't enough. Every problem includes a full written solution that explains the reasoning, the tradeoffs, and the mistake most candidates make. So when they ask a variant, you're not starting from scratch.

AirbnbIncremental load: SCD Type 2

StripeEvent dedup across shardspro

NetflixLate-arriving data: watermarkspro

The real questions, not stand-ins

LeetCode is built for software engineers. The questions that show up at Airbnb, Stripe, and Netflix's data engineering loops are different. Practice on the questions you'll hit in the loop.

Databricks · SQL · 45 min

27:34

Go in having already done it once

The first time you sit in a timed interview is the worst time to discover you freeze under pressure. Mocks scoped to your target company's format mean the real thing feels like a repeat, not a surprise.

Why DataDriven

The most efficient path to interview-ready.

Every hour you spend preparing directly increases your chance of getting the offer. No grinding through problems that won't show up.

$220K+

Data Engineer Median Compensation

The offer is worth preparing for correctly.

Focus
Define your target companies and level. DataDriven cuts the scope of your focus areas by up to 60%, stripping away the noisy things interviewers don’t ask.
Sharpen
Every challenge narrows in on the area that optimally improves your interview success rate, so every minute that you spend is impactful.
Practice
Master the SQL, Python, data modeling, and pipeline design that matters in one place. Write real code against real data. No round you haven’t rehearsed.
Ready
A readiness score tracks how prepared you are for every topic interviewers ask about. When it’s green across the board, you’ll ace it. No guessing.

Interview Prep

Data Engineer Interview Prep Guide→

50+ guides covering every round, company, and role

Interview Rounds

SQL Round Python Round System Design Data Modeling Behavioral Live Coding

By Company

Stripe Airbnb Netflix Uber Databricks Snowflake

By Role

Senior DE Staff DE Analytics Engineer ML Data Engineer Junior DE

Question Sets

Top 100 Questions FAANG Questions SQL Questions 50 Questions Take-Home Examples

Start the DataDriven 75

Ace the Data Engineering Interview

Your day job does not prepare you for what they actually ask in the interview. Practice the real rounds. Find your gaps before the interviewer does. Free forever.

About DataDriven

DataDriven is a free web application for data engineering interview preparation. It is not a generic coding platform. It is built exclusively for data engineering interviews.

What DataDriven Is

DataDriven is the only platform that simulates all four rounds of a data engineering interview: SQL, Python, Data Modeling, and Pipeline Architecture. Each round can be practiced in two modes: Problem mode and Interview mode.

Problem Mode

Problem mode is self-paced practice with clear problem statements and instant grading. For SQL, your query runs against a real database and gets graded automatically. For Python, your code executes for real with automatic grading. For Data Modeling, you build schemas on an interactive canvas with structural validation. For Pipeline Architecture, you design pipelines on an interactive canvas with component evaluation and cost estimation.

Interview Mode

Interview mode simulates a real interview from start to finish. It has four phases. Phase 1 (Think): you receive a deliberately vague prompt and ask clarifying questions to an AI interviewer, who responds like a real hiring manager. Phase 2 (Code/Design): you write SQL, Python, or build a schema/pipeline on the interactive canvas. Your code executes for real. Phase 3 (Discuss): the AI interviewer asks follow-up questions about your solution, one question at a time. You respond, and it asks another. This continues for up to 8 exchanges. The interviewer probes edge cases, optimization, alternative approaches, and may introduce curveball requirements that change the problem mid-interview. Phase 4 (Verdict): you receive a hire/no-hire decision with specific feedback on what you did well, where your reasoning had gaps, and what to study next.

Platform Features Explained

Adaptive difficulty: problems get harder when you answer correctly and easier when you struggle, targeting the difficulty level that maximally improves your interview readiness. Spaced repetition: concepts you struggle with resurface at optimal intervals before you forget them, while mastered topics fade from rotation. Readiness score: a per-topic tracker that shows exactly which concepts are strong and which have gaps, across every topic interviewers test. Company-specific filtering: filter questions by target company (Google, Amazon, Meta, Stripe, Databricks, and more) and seniority level (Junior through Staff), weighted by real interview frequency data. All features are 100% free with no trial, no credit card, and no paywall.

Four Interview Domains

SQL: 850+ questions with real SQL execution. Topics include joins, window functions, GROUP BY, CTEs, subqueries, COALESCE, CASE WHEN, pivot, rank, and partition by. Python: 388+ questions with real code execution. Topics include data transformation, dictionary operations, file parsing, ETL logic, PySpark, error handling, and debugging. Data Modeling: interactive schema design canvas. Topics include star schema, snowflake schema, dimensional modeling, slowly changing dimensions, data vault, grain definition, and conformed dimensions. Pipeline Architecture: interactive pipeline design canvas. Topics include ETL vs ELT, batch vs streaming, Spark, Kafka, Airflow, dbt, storage architecture, fault tolerance, and incremental loading.

Skills You Will Practice

SQL

The queries interviewers actually write on the whiteboard. Appears in 95% of DE interviews.

JOINs, self-joins and subqueries
Window functions
CTEs and recursive queries
Aggregations
NULL handling
Date functions and time series

Data Modeling

The interview round that separates analysts from engineers. Appears in 65% of DE interviews.

Schema design and normalization
Star and snowflake schemas
Slowly changing dimensions
Entity relationships and cardinality
Keys, constraints and indexing
Design patterns and trade-offs

Python

The data transforms and pipeline logic interviewers test. Appears in 78% of DE interviews.

Dictionaries and deduplication
List comprehensions and filtering
String slicing and time bucketing
Event stream processing
Idempotent data transforms
Aggregation without pandas

Pipeline Architecture

Design the systems that move data at scale. Appears in 52% of DE interviews.

Scheduling and orchestration
Batch vs streaming
Data quality and validation
Idempotent pipelines
Schema evolution
Monitoring and alerting

Platform Features

Adaptive Difficulty: Problems get harder when you answer correctly and easier when you struggle. The system targets the difficulty level that maximally improves your interview readiness.
Readiness Score: A per-topic tracker that shows exactly which concepts are strong and which have gaps, across every topic interviewers test. When all topics are green, you are ready.
Company-Specific Prep: Filter questions by target company (Google, Amazon, Meta, Stripe, Databricks) and seniority level (Junior through Staff), weighted by real interview frequency data.
Spaced Repetition: Concepts you struggle with resurface at optimal intervals before you forget them. Mastered topics fade from rotation.
Real Code Execution: SQL runs against a real database with automatic grading. Python runs with real execution and automatic grading. No multiple choice.
AI Mock Interview Simulation: Interview mode has four phases (Think, Code, Discuss, Verdict). An AI interviewer asks follow-up questions one at a time for up to 8 exchanges, probes edge cases and optimization, introduces curveball requirements, and delivers a hire/no-hire verdict with detailed feedback.

How DataDriven Works

Focus: Define your target companies and level. DataDriven cuts the scope of your focus areas by up to 60%, stripping away the noisy things interviewers do not ask.
Sharpen: Every challenge narrows in on the area that optimally improves your interview success rate, so every minute that you spend is impactful.
Practice: Master the SQL, Python, data modeling, and pipeline design that matters in one place. Write real code against real data. No round you have not rehearsed.
Ready: A readiness score tracks how prepared you are for every topic interviewers ask about. When it is green across the board, you will ace it. No guessing.

Frequently Asked Questions

I write SQL every day and I still bombed a technical screen. What happened?

Production work and interview performance are different skills. You do not fail on knowledge. You fail on structuring an answer under time pressure with unfamiliar tables and someone watching. Every challenge here is timed and live so you build the muscle of producing correct code when it counts.

I have no idea what my target company actually tests. How do I not waste a month?

Every session targets your weakest topic against the pattern mix your target company tests most heavily. You are not working through a generic top-100 list. You are closing the specific gaps that would cost you the offer, so every hour of prep counts.

The data modeling round scares me and I cannot find anywhere to practice it.

That round cuts more senior candidates than any other, and most people just re-read the Kimball book and hope. You get a product scenario, build the schema from scratch, and get evaluated on your grain, dimensions, and SCD strategies before you are doing it live.

I keep telling myself one more week of prep and it has been three months.

That loop never ends on its own. A readiness score per target company shows exactly which rounds you would pass today and which ones would cost you the offer. When you can see the gap closing, you stop guessing and start scheduling.

Every company seems to test something completely different. How do I prep for that?

They do. Databricks leans hard on Spark internals, Meta on SQL windows, Stripe on idempotent pipelines. Your practice set is weighted to your target company's actual pattern distribution, not a one-size-fits-all set of canned problems.

Walk into your data engineering interview prepared

SQL, Spark, Python, ModelingData Modeling, & Pipeline Architecture

SQL

Spark

Python

Data Modeling

Architecture

There Are Hundreds of Data Skills. You Don't Need All of Them.

Know what your target company tests

Learn the pattern, not the answer

The real questions, not stand-ins

Go in having already done it once

The most efficient path to interview-ready.

Ace the Data Engineering Interview

About DataDriven

What DataDriven Is

Problem Mode

Interview Mode

Platform Features Explained

Four Interview Domains

Skills You Will Practice

SQL

Data Modeling

Python

Pipeline Architecture

Platform Features

How DataDriven Works

Frequently Asked Questions

I write SQL every day and I still bombed a technical screen. What happened?

I have no idea what my target company actually tests. How do I not waste a month?

The data modeling round scares me and I cannot find anywhere to practice it.

I keep telling myself one more week of prep and it has been three months.

Every company seems to test something completely different. How do I prep for that?

Practice by Domain (Interview Mode)

Practice by Domain (Problem Mode)

Start Practicing

Interview Prep Guides

There Are Hundreds of Data Skills.
You Don't Need All of Them.