The Modern Data Stack
The five-layer architecture that won the 2020s, the tradeoffs that are getting un-done in 2026, and how interviewers actually want you to talk about both.
The modern data stack is what you get when you take the legacy ETL warehouse, separate storage from compute, replace the stored procedures with SQL inside the warehouse, and break the monolith into five tools that each do one thing. It worked. It is now being un-done in the other direction, because eight SaaS bills with eight integration points is its own kind of pain. The 2026 version of the stack is fewer tools doing more, not more tools doing less.
For interview purposes, what matters is being able to name the layers, defend a tool choice for a given company size, and articulate the tradeoff the modern stack made and the one it's now correcting. The rest is vendor trivia.
Know the patterns before the interviewer asks them.
What the stack got right
ELT over ETL. Land raw, transform in place. Cloud warehouse storage is essentially free; the compute is elastic and only charges when you run a query. There's no reason left to filter at the edge. Loading everything raw preserves optionality: when the requirement changes in six months (it will), the raw data is still there. Background in ETL vs ELT.
Storage and compute as separate primitives. The architectural shift that made everything else possible. Snowflake's virtual warehouses, BigQuery's slot model, Iceberg and Delta on top of object storage. Reporting workloads stop fighting ad-hoc workloads for the same CPU. ML training reads the same tables BI reads without copying them. Capacity planning, the bottleneck of the on-prem era, mostly evaporates.
SQL as the interface. dbt runs SQL. BI tools speak SQL. Streaming engines now speak SQL (ksqlDB, Flink SQL, RisingWave). The analyst who knows SQL can ship transformations. The data engineer who knows SQL can talk to the analyst. Everyone uses the same language for the same job.
Analytics as code. dbt put the transformation layer into Git. Code review, tests, CI, deploys. Before dbt, transformations lived in stored procedures and Jupyter notebooks; pull requests on data logic weren't a thing. This cultural change was larger than the tooling change.
The five layers
Stable across vendors. What changes is which company is winning each layer this quarter.
| Layer | What it does | Default in 2026 |
|---|---|---|
| Ingestion | Move data from sources into the warehouse. Schema detection, incremental loads, CDC. | Fivetran or Airbyte. Custom only when the source is unusual or PII forces it. |
| Warehouse | Store everything raw, query everything cheaply, scale compute independently of storage. | Snowflake or BigQuery. Databricks if you're ML-heavy. Redshift in legacy AWS shops. |
| Transformation | Turn raw into modeled tables. ELT, not ETL. SQL in the warehouse, not Python on the edge. | dbt. Almost no real competition for the default; SQLMesh exists. |
| Orchestration | Schedule the DAG. Retry. Alert. Backfill. Block downstream when upstream fails. | Airflow still wins by deployment count. Dagster is winning new builds. |
| BI | Serve dashboards. Define metrics once and have every team see the same number. | Looker for the metrics layer, Tableau or Power BI for execs, Metabase for everyone else. |
What the stack got wrong
Tool sprawl. Eight vendors with eight billing models, eight auth systems, eight failure modes. Each is best-in-class at one job; the integration tax between them is nobody's job. A 50-person company running the canonical modern stack spends real engineering time on tool plumbing instead of data work.
Cost surprises. Usage-based pricing on the warehouse and the ingestion tool means a single bad query, a single unwatched dbt model rebuilt on every run, a single Fivetran connector resyncing on schema change can spike the bill ten-fold in a week. The first thing every modern stack eventually adds is a FinOps tool or a query monitor or an engineer whose job is just to watch the warehouse bill.
Streaming as an afterthought. The modern stack is batch-first. Real-time is a parallel architecture, typically Kafka plus Flink or Spark Streaming, that lives alongside the warehouse stack and rarely shares code with it. The "unified batch and streaming" pitch has been one quarter away for five years.
Modularity that didn't fully deliver. The promise was you'd swap any component cheaply. In practice, swapping warehouses means rewriting dbt models around dialect differences, redoing every BI connection, and revalidating every metric. Modularity is real but the switching cost is higher than the marketing said.
What Everyone Is Watching
Someone is watching. Capture everything.
Pulled from debriefs where system design separated levels.
Where 2026 is going
The pendulum is swinging toward consolidation. Snowflake bought Streamlit and added Snowpark. Databricks added Delta Live Tables and an orchestrator. dbt added the semantic layer and metrics. Microsoft Fabric is the same idea from the other direction: one bill, one auth, one set of integration points, in exchange for less best-of-breed per component.
The bet for the next five years is that the best-of-suite platforms get good enough at the individual layers that the modularity premium isn't worth paying. The counter-bet is that they don't, and best-of-breed survives. Reasonable people disagree. In an interview, naming both directions and the tradeoff between them reads as senior; picking a winner reads as junior.
Practically, the stack most companies should run in 2026 is smaller than the canonical version. Snowflake or BigQuery plus dbt plus one orchestrator plus one BI tool. Ingestion via the warehouse-native option when it exists. Streaming only when the latency SLA actually requires it.
How interviewers ask about it
"Design the stack for a Series B with fifty people." The losing answer names eight tools. The winning answer names four, picks managed over self-hosted on every layer, and defends the choice with the sentence "this team doesn't have the engineering capacity to operate Airflow on Kubernetes." Concrete recommendation: Fivetran or warehouse-native ingestion, Snowflake or BigQuery, dbt Cloud, Metabase or Looker depending on existing tooling.
"ELT versus ETL, when would you still ETL?" Three conditions. PII regulations that require redaction before warehouse landing. Source volume that's genuinely too large to load raw (rare; usually means you should sample at the source). Transformation logic that's not expressible in SQL (ML model scoring, heavy text processing). Outside these, ELT wins by default in 2026.
"What are the limitations of the modern data stack?" Tool sprawl. Cost unpredictability. The streaming gap. The senior signal is also naming what the market is doing about each: consolidation, FinOps tooling, the emergence of warehouse-native streaming products. Mid candidates name the problems; senior candidates name the problems and the responses.
"What's overrated?" Real-time everything. Multi-cloud. Replacing dbt with whatever launched last quarter. The honest answer is that the boring parts (model design, testing, documentation, naming) determine outcomes more than the tool selection at any layer.
Common questions
Is the modern data stack still relevant?+
How much does it cost?+
Do I need to know all eight tools for an interview?+
Lakehouse versus warehouse?+
Walk the stack for a design round
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition