What Is Data Engineering?

A data engineer is the person at a company whose job is to make sure the warehouse has the right numbers in it. That covers a wide range of work, from babysitting Fivetran connectors to architecting a Kafka topology, but the unifying thing is ownership of where the data lives and how it gets there.

What Data Engineers Build

Data Pipelines

Code that moves rows out of an API or a Postgres replica, runs them through some transformation, and lands them in a warehouse on a schedule. About half the job is dealing with the source changing shape on you without warning.

Data Warehouses

The Snowflake or BigQuery account where analysts run their queries. You own the schema. SCD2 dimensions, incremental fact loads, the surrogate key strategy, the partitioning. If a finance number is wrong on Monday, your name is on the table that produced it.

Data Platforms

Airflow or Dagster, the dbt project, the warehouse permissions model, a catalog that nobody updates, and whatever observability you can afford. This is the part of the job that grows the most as the team grows.

Quality and Testing

dbt tests, freshness checks, row count diffs against a control snapshot. The goal is that when something breaks at 3am, the page tells the on-call which table is bad before they open the laptop.

Data Engineer vs Data Scientist vs Data Analyst

	Data Engineer	Data Scientist	Data Analyst
Owns	The warehouse and the things feeding it	A model and the notebooks behind it	A dashboard and the questions behind it
Languages	SQL, Python. Maybe Scala if Spark	Python. Sometimes R if they came from stats	SQL. Looker or Tableau on top
SQL depth	Writes the indexes	Reads from the marts you built	Lives in window functions
A normal day	Pipeline triaging and schema arguments	Fighting with sklearn or a notebook kernel	Slack threads about why a number moved
Interview focuses on	SQL, Python, schema design	Statistics, ML, a coding round	SQL, a product case, sometimes A/B testing

A Typical Day as a Data Engineer

First thing. Open Slack, scan the #data-alerts channel for anything that paged overnight. Usually it's a Fivetran connector that ran out of API quota or a dbt model that hit a NOT NULL on a new column nobody told you about.
Standup. Fifteen minutes, half of which is the analytics lead asking when the new attribution model lands.
After standup. Real work. Whatever you're shipping this sprint. A new dbt model, a backfill, ripping out a connector that's been flaky for a month, arguing with a vendor about a CDC bug.
Some point after lunch. A PR review for someone on the analytics team who wants to push raw event data straight into a customer-facing table. You say no nicely.
Late. If you're unlucky, an exec asks where a number on a board came from and you spend an hour reverse-engineering a query somebody wrote in 2022.

The Interview Process

A normal loop is a recruiter screen, a technical phone screen, then an onsite of three to five rounds. Almost every loop has a SQL round somewhere. After that the composition depends a lot on the company.

SQL Round: The bedrock. Around 69% of DE loops include one. Almost always window functions (LAG, LEAD, RANK), a CTE or two, an aggregation against a fact table, sometimes a self join over events. You type into a shared editor like CoderPad. Some interviewers run your output against hidden test cases, others just read your code and ask why.

Python or Coding Round: Shows up in just over half of loops. The bar is lower than a software engineering interview, but the problem space is different. Expect to parse a messy log, deduplicate records with a fuzzy key, or build a small generator pipeline. Forty-five minutes is typical. Pandas is fine. NumPy is overkill and signals you did not read the room.

Data Modeling: About a third of loops have an explicit modeling round. The prompt is usually a real product (Lyft rides, a payments ledger, a content feed) and you're asked to model it well enough to answer two or three analytical questions. SCD2 comes up. Star schema comes up less often than you'd think, in around 4.7% of loops, because most interviewers want to see how you reason about grain rather than which formal pattern you name.

Behavioral: Nobody loves these but everybody runs them. Have two stories ready about a data quality incident you owned end to end, and one story about a disagreement with a stakeholder. The number one failure mode is candidates who frame everything in terms of what their team did instead of what they personally did.

$94K to $110K

10th to 25th percentile base (US)

$148K

National median base, all levels

$210K+

75th percentile in SF, NYC, Seattle

Is Data Engineering Right for You?

Reasons it fits: The pay band starts above the tech median and the senior bands compress less than software engineering, which is good for the middle of your career. Your work is hard to fake. Either the table is fresh and correct or it is not. Performance reviews are easier to argue when the answer is in the warehouse. Most teams are still understaffed, so even mid-level ICs end up owning systems end-to-end. That accelerates a resume. The skillset transfers. SQL and dbt look basically identical at Stripe, at a 40-person Series B, and at a hospital system.

Honest downsides: On-call. Pipelines fail at 2am because Snowflake had a regional incident, or because someone renamed a Salesforce field. A lot of the job is data quality archaeology. Why this number does not match that number. Hours of work, one Slack message of reward. Nobody notices when the warehouse is healthy. They notice loudly when it is not. The vendor churn is real. dbt, then dbt Cloud, then Coalesce, then SDF, then a Rust rewrite. Half of what you learn this year is a line item on a resume in three years. At some companies the data team is a service desk for whoever yells loudest. Read the org chart before signing.

Prepare for the interview

01 / Open invite

02min.

Know What Is Data Engineering the way the interviewer who asks it knows it.

a What Is Data Engineering query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

Capital OneInterview question

Solve a What Is Data Engineering problem

Frequently Asked Questions

Do I need a computer science degree to become a data engineer?+

Not at most companies. A lot of DEs came in through the analyst chair or via a bootcamp. Where a CS degree quietly helps is in the parts of the interview that look like software engineering: a tricky Python round, a question about hash joins, anything that touches concurrency. Without the degree you can read a copy of Designing Data-Intensive Applications and probably close the gap.

What programming languages do data engineers use?+

SQL is almost the entire job. Python is the second language and you use it for orchestration glue, the occasional API integration, and tests. If you end up on a Spark-heavy team you may see Scala, but most JVM work is now PySpark. You can get hired without ever writing Java.

How much do data engineers make?+

Base salaries in the US generally start in the low six figures for someone with a few years of experience and stretch into the mid-200s at senior levels in major metros. Total comp at a public tech company is usually 1.4 to 2x base once stock is factored in. Our salary guide has the percentiles broken out by location and tier.

What is the difference between a data engineer and a backend engineer?+

A backend engineer ships features that the user clicks on. A data engineer ships the warehouse that the company reports on. The skills rhyme but the bug reports are different. A backend bug is usually a 500 error in a logger. A data bug is usually a quiet number that's wrong by 3%.

How long does it take to become job-ready as a data engineer?+

If you already write SQL at work and know enough Python to script things, the gap to a junior or mid DE role is usually a couple of months of evening practice. Coming in cold (no SQL, no Python) it's closer to six. The bottleneck is almost never how many hours you put in, it's whether you have somewhere to practice on real-shaped data instead of toy CSVs.

02 / Why practice

Practice on real interview shapes

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Try Your First Problem

Related Guides

Study Plan→

A week-by-week plan to get from zero or near-zero to onsite-ready.

DE vs Data Scientist→

Where the two roles actually overlap, and where the interview loops diverge.

Salary Guide→

Percentile bands by metro, level, and visa class, pulled from H-1B and PERM filings.