The Data Engineer Roadmap

What to learn, in what order, and what to skip. Written from the interviews that actually decide hiring outcomes, not from the job descriptions that list every tool the company has ever touched.

Every list of data engineering skills you'll find online is too long. Most working data engineers use four things daily: SQL, Python, a warehouse, and an orchestrator. The remaining tools on the famous Awesome-Data-Engineering lists are real but optional, and the order you learn them matters more than the count. This is the order, written from the interviews that actually decide hiring outcomes rather than the JDs that list every tool the company has ever touched.

If you can write a window function from memory and parse a malformed JSON file in Python, you're past stage one. If you've done either with a deadline on the line, you're past stage two. Most people who think they need this roadmap need stage three.

Prepare for the interview
01 / Open invite
02min.

Know the patterns before the interviewer asks them.

a SQL query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1SELECT user_id,
2 COUNT(*) AS sessions
3FROM events
4WHERE ts >= NOW() - INTERVAL '7 day'
5
Execute your solution0.4s avg.
MicrosoftInterview question
Solve a problem

What's overrated and what's underrated

Overrated. Spark before SQL. Five cloud platforms instead of one. Kafka unless you're interviewing somewhere that actually runs on Kafka. Building a portfolio project in month one before you can write the queries the project depends on. Memorizing Airflow operators. Watching YouTube videos at 1.5x speed. Reading about a tool instead of running it. dbt before you understand what a fact table is.

Underrated. Writing the same query three different ways and benchmarking them. Reading other people's code in production warehouses. Doing the boring schema-drawing exercise out loud. Asking working engineers what they actually do on Tuesdays, which is rarely what the job description says. Pairing with someone on a real bug. The first three bullets scale; the rest compound.

The path, in order

Each stage has a thing to learn and a thing to ship. Skip neither. The 'ship' part is what turns reading into recall.

  1. 01

    SQL to fluency

    SELECT, JOIN, GROUP BY with HAVING, window functions, CTEs, recursive CTEs, NULL handling, conditional aggregation with FILTER, the difference between COUNT(col) and COUNT(*). You're done when you can write a window-function query that handles ties correctly without thinking about it, and when you can articulate why an INNER JOIN can drop rows.

    • Don't read about SQL. Write SQL against a Postgres database and let the wrong answers tell you what you don't know.
    • Ship: one analysis project that touches a real dataset and ends with a query you can defend out loud.
  2. 02

    Python the way a data engineer writes it

    Pandas for groupby/merge/pivot, the standard library for file parsing (CSV, JSON, gzipped logs, fields-of-fields), enough OOP to write a class with three methods and not embarrass yourself, and the kind of error handling that distinguishes a script from a job. Skip LeetCode-style algorithms; they don't show up in the rounds you care about.

    • If pandas feels slow, you're holding it wrong. Learn vectorization before you learn Polars.
    • Ship: a script that ingests a messy CSV, validates it, and writes to a warehouse table.
  3. 03

    Dimensional modeling

    Star schema, snowflake, the difference, when to denormalize on purpose. Slowly changing dimensions Type 1 versus Type 2 versus Type 6, and which one a real product actually needs. Grain. Always grain. State the grain before you draw the table. The single biggest separator between mid-level and senior data engineers is whether dimensional thinking is automatic or effortful.

    • Read Kimball's Data Warehouse Toolkit. There is no shortcut. The book is forty years old and still right.
    • Ship: a five-table dimensional model for a product you understand (your gym, a side project, a hobby) with the grain stated for every fact.
  4. 04

    One warehouse, deeply

    Pick one of Snowflake, BigQuery, or Postgres. Learn it past surface depth: query planning, partitioning, clustering, materialized views, the dialect quirks that change which queries are cheap. Going wide across all three is what bootcamps teach. Going deep on one is what gets you hired and lets you contribute on day one.

    • BigQuery if you're targeting Google or analytics-heavy startups. Snowflake for most mid-market. Postgres for working knowledge that translates everywhere.
  5. 05

    Orchestration and the failure modes that come with it

    Airflow conceptually, because it's the default. Backfills, retries, idempotency, the difference between a pipeline that works and one that's safe to re-run. Late-arriving data. Schema drift. The phrase 'exactly-once' and why it's almost always actually 'at-least-once with deduplication.' This is the content of the system design round.

    • Dagster and Prefect are real. Airflow is what you'll interview on. Learn both eventually; learn Airflow first.
    • Ship: a DAG that ingests something on a schedule, handles a deliberate failure, and recovers without manual intervention.
  6. 06

    Interview reps

    By now you know the material. What you don't have yet is the speed and the calm. Time-box SQL problems at 20 minutes, Python at 30, modeling at 45, design at 60. Do at least three full mock loops out loud before any onsite. The gap between knowing the answer and saying the answer is closed only by reps.

    • Use the round-by-round guides on this site for company-specific patterns.
    • Ship: an offer. The point of the roadmap is the offer, not the roadmap.

Cloud, exactly to the depth interviews require

S3 or GCS, what they're for, what an object store does that a filesystem doesn't. IAM at the level of 'I can reason about a role and a policy.' One compute primitive (Lambda or Cloud Functions). One pipeline service (Step Functions, Dataflow, Glue). That's the surface area. Nobody is going to ask you to write Terraform on a whiteboard.

The cloud platform you pick should match the company you're targeting. AWS for breadth, GCP for analytics shops, Azure for enterprise. If you have no preference, AWS has the most jobs and translates into the others.

Prepare for the interview
03 / From the bank03 of many
03hand-picked.

Top Active Senders per Channel

Medium26 min

Top three messages per channel by replies.

Pulled from debriefs where SQL was the gate.

Tools you don't need yet

Spark unless you're targeting Netflix, Databricks, Airbnb, or a large adtech shop. Kafka if you're not interviewing somewhere that genuinely needs sub-second latency. Flink at all, unless you're applying to one of the five companies in the industry that ask Flink questions. dbt unless the company is dbt-native. Terraform, Kubernetes, Helm, ArgoCD: these are platform tooling, not data engineering, and they show up in the loop maybe one time in fifty.

If a job listing names all of these, the listing was written by a recruiter who read another listing. The work itself is closer to SQL and Python than to the tool inventory.

When to start interviewing

Sooner than you think. Most engineers wait until they feel ready, which is never. The signal you're ready to interview is not "I know everything" but "I can pass a single SQL round without freezing." Once that's true, interview while you continue to study; rejected loops are the best diagnostic you'll get for what to learn next. If you wait until you feel ready, you'll spend three extra months learning things that don't show up in the rounds.

For the round-specific deep dives, start with the SQL round walkthrough and the modeling round walkthrough. For the full loop, see the interview prep pillar.

Questions people actually ask

How long does this take if I'm starting from scratch?+
Twelve to sixteen months at forty-five minutes a day, with the caveat that the hours are focused-practice hours, not reading-blog-posts hours. Six to eight if you have a CS or analytics background. Three if you're a working software engineer pivoting in.
Can I become a data engineer without a CS degree?+
Yes, and most working data engineers under 35 didn't take the traditional CS path. The interviews test SQL, Python, modeling, and pipeline reasoning. A degree helps with the recruiter screen and almost nothing after that. Demonstrated skill in the rounds is what determines the outcome.
Should I get a certification?+
No, with one exception. Certifications are weak hiring signals at most companies because the rounds test problem-solving, not vendor knowledge. The exception is the Databricks or Snowflake certification if you're targeting a specific role that genuinely requires it; even then, the cert gets you the screen, not the offer.
What about a portfolio project?+
Useful in month ten, useless in month one. A portfolio project built on shaky SQL is a liability. Wait until you can build the analyses the project depends on, then pick something with a real dataset and write up the design tradeoffs. The writeup matters more than the code.
What's the fastest path if I'm already a software engineer?+
SQL to fluency in two weeks, dimensional modeling in three weeks, then start interviewing. Your algorithm prep is over-leveled for these loops; the deltas to close are modeling and the failure modes of running schedules instead of writing services.
02 / Why practice

Stage 1 is SQL. Start there.

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition