The Data Engineer Roadmap

What to learn, in what order, and what to skip. Six stages, a timeline for every starting point, and the interview data that justifies every cut. Written from the interviews that actually decide hiring outcomes, not from the job descriptions that list every tool the company has ever touched.

Every list of data engineering skills you'll find online is too long. Most working data engineers use four things daily: SQL, Python, a warehouse, and an orchestrator. The remaining tools on the famous Awesome-Data-Engineering lists are real but optional, and the order you learn them matters more than the count. This is the order, written from the interviews that actually decide hiring outcomes rather than the JDs that list every tool the company has ever touched.

The job itself, stripped of the tool inventory: a data engineer moves data from the systems that produce it to the people and models that need it, on a schedule, without losing or duplicating rows, in a shape someone can query. On a normal Tuesday that means writing SQL, reviewing a pipeline change, chasing a late-arriving file, and explaining to someone why two dashboards disagree. This roadmap is ordered by how often each skill decides that Tuesday, and how often it decides the interview.

If you can write a window function from memory and parse a malformed JSON file in Python, you're past stage one. If you've done either with a deadline on the line, you're past stage two. Most people who think they need this roadmap need stage three.

The path, in order

Each stage has a thing to learn and a thing to ship. Skip neither. The 'ship' part is what turns reading into recall.

01
SQL to fluency
SELECT, JOIN, GROUP BY with HAVING, window functions, CTEs, recursive CTEs, NULL handling, conditional aggregation with FILTER, the difference between COUNT(col) and COUNT(*). You're done when you can write a window-function query that handles ties correctly without thinking about it, and when you can articulate why an INNER JOIN can drop rows.
- ▸Don't read about SQL. Write SQL against a Postgres database and let the wrong answers tell you what you don't know.
- ▸Ship: one analysis project that touches a real dataset and ends with a query you can defend out loud.
02
Python the way a data engineer writes it
Pandas for groupby/merge/pivot, the standard library for file parsing (CSV, JSON, gzipped logs, fields-of-fields), enough OOP to write a class with three methods and not embarrass yourself, and the kind of error handling that distinguishes a script from a job. Skip LeetCode-style algorithms; they don't show up in the rounds you care about.
- ▸If pandas feels slow, you're holding it wrong. Learn vectorization before you learn Polars.
- ▸Ship: a script that ingests a messy CSV, validates it, and writes to a warehouse table.
03
Dimensional modeling
Star schema, snowflake, the difference, when to denormalize on purpose. Slowly changing dimensions Type 1 versus Type 2 versus Type 6, and which one a real product actually needs. Grain. Always grain. State the grain before you draw the table. The single biggest separator between mid-level and senior data engineers is whether dimensional thinking is automatic or effortful.
- ▸Read Kimball's Data Warehouse Toolkit. There is no shortcut. The book is forty years old and still right.
- ▸Ship: a five-table dimensional model for a product you understand (your gym, a side project, a hobby) with the grain stated for every fact.
04
One warehouse, deeply
Pick one of Snowflake, BigQuery, or Postgres. Learn it past surface depth: query planning, partitioning, clustering, materialized views, the dialect quirks that change which queries are cheap. Going wide across all three is what bootcamps teach. Going deep on one is what gets you hired and lets you contribute on day one.
- ▸BigQuery if you're targeting Google or analytics-heavy startups. Snowflake for most mid-market. Postgres for working knowledge that translates everywhere.
05
Orchestration and the failure modes that come with it
Airflow conceptually, because it's the default. Backfills, retries, idempotency, the difference between a pipeline that works and one that's safe to re-run. Late-arriving data. Schema drift. The phrase 'exactly-once' and why it's almost always actually 'at-least-once with deduplication.' This is the content of the system design round.
- ▸Dagster and Prefect are real. Airflow is what you'll interview on. Learn both eventually; learn Airflow first.
- ▸Ship: a DAG that ingests something on a schedule, handles a deliberate failure, and recovers without manual intervention.
06
Interview practice
By now you know the material. What you don't have yet is the speed and the calm. Time-box SQL problems at 20 minutes, Python at 30, modeling at 45, design at 60. Do at least three full mock loops out loud before any onsite. The gap between knowing the answer and saying the answer is closed only by repetition.
- ▸Use the round-by-round guides on this site for company-specific patterns.
- ▸Ship: an offer. The point of the roadmap is the offer, not the roadmap.

How long it takes, honestly

The stage durations in the six-stage path assume the middle column. Scale accordingly. The hours are focused-practice hours, not video-watching hours; those count for about a third.

Starting point	5 hrs/week	10 to 15 hrs/week	20+ hrs/week
From scratch, no technical background	24+ months	12 to 16 months	9 to 12 months
CS degree or analytics background	12 to 14 months	6 to 8 months	4 to 6 months
Working data analyst (SQL already fluent)	8 to 10 months	5 to 6 months	3 to 4 months
Working software engineer	5 to 6 months	3 months	6 to 8 weeks

What interviews actually test

Share of data engineer interview loops in which each skill appears, from the 2,817 verified interview reports behind this site. This table is why the roadmap is ordered the way it is.

Skill	Share of DE loops	Where it shows up
SQL (windows, joins, aggregation)	~95%	Dedicated live-coding round, 45 to 60 minutes
Python (pipeline-shaped, not LeetCode)	~85%	Coding round or take-home
Dimensional modeling	~70%	Modeling round; decides most senior loops
Pipeline / system design	~65%	Design round at L4 and above
Spark or PySpark	~35%	Concentrated at Spark-first companies
Cloud services (S3, IAM, one compute primitive)	~30%	Conceptual questions inside the design round
Kafka / streaming	~20%	Design-round tradeoff, rarely API-level
dbt	<10%	Only at dbt-native companies
Kubernetes, Terraform, Helm	~2%	Effectively never

What's overrated and what's underrated

Overrated. Spark before SQL. Five cloud platforms instead of one. Kafka unless you're interviewing somewhere that actually runs on Kafka. Building a portfolio project in month one before you can write the queries the project depends on. Memorizing Airflow operators. Watching YouTube videos at 1.5x speed. Reading about a tool instead of running it. dbt before you understand what a fact table is.

Underrated. Writing the same query three different ways and benchmarking them. Reading other people's code in production warehouses. Doing the boring schema-drawing exercise out loud. Asking working engineers what they actually do on Tuesdays, which is rarely what the job description says. Pairing with someone on a real bug. The first three bullets scale; the rest compound.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a SQL query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1SELECT user_id,

2 COUNT(*) AS sessions

3FROM events

4WHERE ts >= NOW() - INTERVAL '7 day'

Execute your solution0.4s avg.

MicrosoftInterview question

Solve a problem

Cloud, exactly to the depth interviews require

S3 or GCS, what they're for, what an object store does that a filesystem doesn't. IAM at the level of 'I can reason about a role and a policy.' One compute primitive (Lambda or Cloud Functions). One pipeline service (Step Functions, Dataflow, Glue). That's the surface area. Nobody is going to ask you to write Terraform on a whiteboard.

The cloud platform you pick should match the company you're targeting. AWS for breadth, GCP for analytics shops, Azure for enterprise. If you have no preference, AWS has the most jobs and translates into the others.

Tools you don't need yet

Spark unless you're targeting Netflix, Databricks, Airbnb, or a large adtech shop. Kafka if you're not interviewing somewhere that genuinely needs sub-second latency. Flink at all, unless you're applying to one of the five companies in the industry that ask Flink questions. dbt unless the company is dbt-native. Hadoop and Hive belong in this list too: they still appear on job descriptions written by copy-paste, but the interview questions moved to Spark and the warehouses a decade ago. Terraform, Kubernetes, Helm, ArgoCD: these are platform tooling, not data engineering, and they show up in the loop maybe one time in fifty.

If a job listing names all of these, the listing was written by a recruiter who read another listing. The work itself is closer to SQL and Python than to the tool inventory.

New User Purchases

> Growth wants to quantify how much revenue the 2026 signup cohort has generated. Sum all transaction spending for users who signed up during calendar year 2026 and return a single total.

Will AI replace data engineers? The 2026 read

The honest version: AI moved the interview bar, it did not remove the job. Assistants now write competent boilerplate SQL and Python, which means the rounds test the parts they can't: deciding the grain of a fact table, spotting why a pipeline double-counts on retry, choosing batch over streaming and defending it. Interviewers increasingly assume you'll have a copilot on the job, so they probe the judgment that survives autocomplete.

What actually changed in the work: more of the junior-shaped tasks (column mapping, first-draft transformations, test scaffolding) get drafted by a model and reviewed by an engineer. What didn't change: someone still has to own correctness, idempotency, cost, and the conversation about what the numbers mean. Demand for that owner has grown, not shrunk, because AI products are themselves data pipelines with stricter freshness and quality requirements. If you're choosing the field in 2026, the risk isn't that data engineering disappears. It's that entry-level hiring keeps consolidating around candidates who can already reason about systems, which is exactly what the six stages of this roadmap build.

Data engineer versus the adjacent roles

Where the roles actually differ, in the dimension that matters for choosing: what the interview loop tests.

Role	Owns	Loop tests	SQL bar
Data engineer	Pipelines, warehouse, data correctness	SQL, Python, modeling, pipeline design	Highest: windows, optimization, edge cases
Data analyst	Dashboards, metrics, answers to business questions	SQL, metrics judgment, a case study	High on querying, low on optimization
Data scientist	Models, experiments, statistical inference	Stats, ML, Python, some SQL	Moderate
Analytics engineer	dbt layer, semantic models, transformations	SQL, dbt, dimensional modeling	High, dialect-heavy

If you're deciding between adjacent paths, the detailed comparisons are here: data engineer vs backend engineer, data engineer vs analytics engineer, and data engineer vs ML engineer. The short version: if you like owning systems more than owning answers or models, this is your roadmap.

What the roadmap is worth: 2026 salary bands

Total compensation aggregated across the verified offer samples behind this site's company pages: the typical company's 25th to 75th percentile at each level, with the median between them. The sample skews toward name-brand tech; the interview loop, not the resume, decides which end of a band you land on.

Level	US total comp band	Median	What the loop expects
Entry / junior (L3)	$114K to $192K	$167K	Working SQL, basic Python, schema vocabulary
Mid (L4)	$150K to $238K	$203K	Fluency across all five rounds, conceptual design
Senior (L5)	$201K to $347K	$290K	Modeling and design judgment, leads the conversation
Staff+ (L6+)	$317K to $469K+	$404K	Cross-team scope, two design rounds, decision records

Certifications: the short, honest answer

Certifications are weak hiring signals at most companies because the rounds test problem-solving, not vendor knowledge. The exceptions are narrow: a Databricks or Snowflake certification when the role is explicitly built on that stack, or a cloud cert when you have no degree and no shipped work and need something for the recruiter screen. Even then, the cert gets you the screen, never the offer. If you're weighing specific ones, the data engineering certifications guide ranks them by what they're actually worth in interviews. Spend the difference on solving problems; nobody has ever been asked for a certificate in a modeling round.

Resources that are actually worth the time

Two books: Kimball's The Data Warehouse Toolkit for the dimensional modeling stage (there is no substitute), and Fundamentals of Data Engineering by Reis and Housley for the systems vocabulary that fills the gaps between stages. One community: r/dataengineering, which is where working engineers complain about the tools, and the complaints are more instructive than the tutorials.

For the practice itself, everything on this site is free: SQL practice problems against a live Postgres grader for the SQL and interview-practice stages, Python practice problems shaped like pipeline work for the Python stage, the structured lessons for the modeling, warehouse, and orchestration stages, and the portfolio guide for the capstone project that goes on your resume.

When to start interviewing

Sooner than you think. Most engineers wait until they feel ready, which is never. The signal you're ready to interview is not "I know everything" but "I can pass a single SQL round without freezing." Once that's true, interview while you continue to study; rejected loops are the best diagnostic you'll get for what to learn next. If you wait until you feel ready, you'll spend three extra months learning things that don't show up in the rounds.

For the round-specific deep dives, start with the SQL round walkthrough and the modeling round walkthrough. For the full loop, see the interview prep pillar.

Questions people actually ask

How long does it take to become a data engineer?+

Twelve to sixteen months at ten to fifteen hours a week starting from scratch, with the caveat that the hours are focused-practice hours, not reading-blog-posts hours. Six to eight months if you have a CS or analytics background. Three months if you're a working software engineer pivoting in. The timeline table on this page breaks this down by starting point and weekly time budget.

Can I become a data engineer without a CS degree?+

Yes, and most working data engineers under 35 didn't take the traditional CS path. The interviews test SQL, Python, modeling, and pipeline reasoning. A degree helps with the recruiter screen and almost nothing after that. Demonstrated skill in the rounds is what determines the outcome.

What programming languages should a data engineer learn first?+

SQL first, Python second, and that ordering is not close. SQL appears in roughly 95 percent of data engineer interview loops and in nearly every working day. Python is the second language of the job: pipeline glue, file parsing, pandas. Scala, Java, and Go appear only at specific companies and only after the JD says so explicitly. Learning anything before SQL is the single most common sequencing mistake.

Should I get a certification?+

No, with one exception. Certifications are weak hiring signals at most companies because the rounds test problem-solving, not vendor knowledge. The exception is the Databricks or Snowflake certification if you're targeting a specific role that genuinely requires it; even then, the cert gets you the screen, not the offer.

What does a data engineer earn in 2026?+

Across the verified offer samples behind this site's company pages, US total compensation runs roughly $114K to $192K at entry level, $150K to $238K at mid, $201K to $347K at senior, and $317K to $469K+ at staff and above. The sample skews toward name-brand tech companies; smaller markets sit below these bands. The spread inside every band is decided by interview performance, which is why the last stage of the roadmap is interview practice rather than another tool.

Is data engineering being automated away by AI?+

No, but it is being reshaped. AI assistants draft boilerplate SQL and transformations, so interviews now weight the judgment layer: grain decisions, idempotency, failure modes, cost. The number of pipelines is growing because AI products are data pipelines with stricter freshness and quality requirements. The role consolidating hardest is the pure boilerplate-writing junior one; this roadmap is aimed at the judgment layer on purpose.

Which cloud platform should I learn first?+

Match the companies you're targeting. AWS if you have no preference, because it has the most job listings and its concepts translate to the others. GCP if you're aiming at analytics-heavy shops or anywhere BigQuery-native. Azure if you're targeting enterprise. Learn one to the depth the cloud stage of this roadmap describes; learning two before your first offer is wasted motion.

What about a portfolio project?+

Useful in month ten, useless in month one. A portfolio project built on shaky SQL is a liability. Wait until you can build the analyses the project depends on, then pick something with a real dataset and write up the design tradeoffs. The writeup matters more than the code.

Can I do this while working a full-time job?+

Yes, and most people who complete the path do. The five-hour-a-week column in the timeline table is the sustainable full-time-job pace; the practical trick is making the hours focused (solving problems against a grader, drawing schemas out loud) rather than passive (videos, blog posts). Forty-five focused minutes a day beats a lost Saturday every week.

Is math required for data engineering?+

Almost none beyond arithmetic and the logic of sets and joins. No calculus, no linear algebra, no statistics beyond understanding an average and a percentile. That's data science. The hard parts of data engineering are systems reasoning and precision, not math.

What's the fastest path if I'm already a software engineer?+

SQL to fluency in two weeks, dimensional modeling in three weeks, then start interviewing. Your algorithm prep is over-leveled for these loops; the deltas to close are modeling and the failure modes of running schedules instead of writing services.

02 / Why practice

Stage 1 is SQL. Start there.

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Open a SQL problem