The Modern Data Stack

The five-layer architecture that won the 2020s, the tradeoffs that are getting un-done in 2026, and how interviewers actually want you to talk about both.

The modern data stack is what you get when you take the legacy ETL warehouse, separate storage from compute, replace the stored procedures with SQL inside the warehouse, and break the monolith into five tools that each do one thing. It worked. It is now being un-done in the other direction, because eight SaaS bills with eight integration points is its own kind of pain. The 2026 version of the stack is fewer tools doing more, not more tools doing less.

For interview purposes, what matters is being able to name the layers, defend a tool choice for a given company size, and articulate the tradeoff the modern stack made and the one it's now correcting. The rest is vendor trivia.

The five layers

Stable across vendors. What changes is which company is winning each layer this quarter.

Layer	What it does	Default in 2026
Ingestion	Move data from sources into the warehouse. Schema detection, incremental loads, CDC.	Fivetran or Airbyte. Custom only when the source is unusual or PII forces it.
Warehouse	Store everything raw, query everything cheaply, scale compute independently of storage.	Snowflake or BigQuery. Databricks if you're ML-heavy. Redshift in legacy AWS shops.
Transformation	Turn raw into modeled tables. ELT, not ETL. SQL in the warehouse, not Python on the edge.	dbt. Almost no real competition for the default; SQLMesh exists.
Orchestration	Schedule the DAG. Retry. Alert. Backfill. Block downstream when upstream fails.	Airflow still wins by deployment count. Dagster is winning new builds.
BI	Serve dashboards. Define metrics once and have every team see the same number.	Looker for the metrics layer, Tableau or Power BI for execs, Metabase for everyone else.

What the stack got right

ELT over ETL. Land raw, transform in place. Cloud warehouse storage is essentially free; the compute is elastic and only charges when you run a query. There's no reason left to filter at the edge. Loading everything raw preserves optionality: when the requirement changes in six months (it will), the raw data is still there. Background in ETL vs ELT.

Storage and compute as separate primitives. The architectural shift that made everything else possible. Snowflake's virtual warehouses, BigQuery's slot model, Iceberg and Delta on top of object storage. Reporting workloads stop fighting ad-hoc workloads for the same CPU. ML training reads the same tables BI reads without copying them. Capacity planning, the bottleneck of the on-prem era, mostly evaporates.

SQL as the interface. dbt runs SQL. BI tools speak SQL. Streaming engines now speak SQL (ksqlDB, Flink SQL, RisingWave). The analyst who knows SQL can ship transformations. The data engineer who knows SQL can talk to the analyst. Everyone uses the same language for the same job.

Analytics as code. dbt put the transformation layer into Git. Code review, tests, CI, deploys. Before dbt, transformations lived in stored procedures and Jupyter notebooks; pull requests on data logic weren't a thing. This cultural change was larger than the tooling change.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a system design query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

PayPalInterview question

Solve a problem

What the stack got wrong

Tool sprawl. Eight vendors with eight billing models, eight auth systems, eight failure modes. Each is best-in-class at one job; the integration tax between them is nobody's job. A 50-person company running the canonical modern stack spends real engineering time on tool plumbing instead of data work.

Cost surprises. Usage-based pricing on the warehouse and the ingestion tool means a single bad query, a single unwatched dbt model rebuilt on every run, a single Fivetran connector resyncing on schema change can spike the bill ten-fold in a week. The first thing every modern stack eventually adds is a FinOps tool or a query monitor or an engineer whose job is just to watch the warehouse bill.

Streaming as an afterthought. The modern stack is batch-first. Real-time is a parallel architecture, typically Kafka plus Flink or Spark Streaming, that lives alongside the warehouse stack and rarely shares code with it. The "unified batch and streaming" pitch has been one quarter away for five years.

Modularity that didn't fully deliver. The promise was you'd swap any component cheaply. In practice, swapping warehouses means rewriting dbt models around dialect differences, redoing every BI connection, and revalidating every metric. Modularity is real but the switching cost is higher than the marketing said.

Analysts Are Slowing the Store Down

> We run an e-commerce marketplace where the analytics team queries the production database directly, and that load is degrading the live application. Move analytics onto its own warehouse using a replication path that adds no load to the production system, while a merchant-facing dashboard still shows each seller their new orders within a couple of minutes on a path of its own. A small fraction of orders arrive with broken merchant references or totals that do not add up, so those have to be held back and caught before they reach the reporting tables.

+ Source

+ Transform

+ Storage

+ Quality

+ Consumer

+ Queue

Bronze

Silver

Gold

Custom

Pipeline Architecture

Sketch the architecture.

Click or drag a node from the toolbar above. Right-click the canvas for the full menu.

Drag from a node's right port to another node's left port to wire data flow.

Where 2026 is going

The pendulum is swinging toward consolidation. Snowflake bought Streamlit and added Snowpark. Databricks added Delta Live Tables and an orchestrator. dbt added the semantic layer and metrics. Microsoft Fabric is the same idea from the other direction: one bill, one auth, one set of integration points, in exchange for less best-of-breed per component.

The bet for the next five years is that the best-of-suite platforms get good enough at the individual layers that the modularity premium isn't worth paying. The counter-bet is that they don't, and best-of-breed survives. Reasonable people disagree. In an interview, naming both directions and the tradeoff between them reads as senior; picking a winner reads as junior.

Practically, the stack most companies should run in 2026 is smaller than the canonical version. Snowflake or BigQuery plus dbt plus one orchestrator plus one BI tool. Ingestion via the warehouse-native option when it exists. Streaming only when the latency SLA actually requires it.

How interviewers ask about it

"Design the stack for a Series B with fifty people." The losing answer names eight tools. The winning answer names four, picks managed over self-hosted on every layer, and defends the choice with the sentence "this team doesn't have the engineering capacity to operate Airflow on Kubernetes." Concrete recommendation: Fivetran or warehouse-native ingestion, Snowflake or BigQuery, dbt Cloud, Metabase or Looker depending on existing tooling.

"ELT versus ETL, when would you still ETL?" Three conditions. PII regulations that require redaction before warehouse landing. Source volume that's genuinely too large to load raw (rare; usually means you should sample at the source). Transformation logic that's not expressible in SQL (ML model scoring, heavy text processing). Outside these, ELT wins by default in 2026.

"What are the limitations of the modern data stack?" Tool sprawl. Cost unpredictability. The streaming gap. The senior signal is also naming what the market is doing about each: consolidation, FinOps tooling, the emergence of warehouse-native streaming products. Mid candidates name the problems; senior candidates name the problems and the responses.

"What's overrated?" Real-time everything. Multi-cloud. Replacing dbt with whatever launched last quarter. The honest answer is that the boring parts (model design, testing, documentation, naming) determine outcomes more than the tool selection at any layer.

Common questions

Is the modern data stack still relevant?+

The principles are, the canonical tool list isn't. Cloud-native, ELT, SQL transformation, storage-compute separation, version control on data logic; all of that is settled. The eight-tool best-of-breed assembly is being challenged by warehouse-native consolidation. Frame it as 'modern stack principles, modernizing stack composition' and you'll be in the right register for an interview.

How much does it cost?+

A 50-person company on the standard stack runs roughly five hundred to two thousand a month: a couple hundred on ingestion, five hundred to a thousand on the warehouse, a couple hundred on dbt Cloud, free or cheap BI. A mid-market company at 50 to 200 employees runs five to twenty thousand. The variance is almost entirely warehouse compute, which scales with query volume and how disciplined the team is about dbt incremental versus full-refresh.

Do I need to know all eight tools for an interview?+

No. Know the layers. Know one tool deeply per layer: one warehouse, dbt, Airflow conceptually, one BI tool. Be able to name the alternatives without claiming to have used them. The interviewer cares that you can defend a stack choice for a company; they don't care that you've used Fivetran versus Airbyte.

Lakehouse versus warehouse?+

The boundary is mostly gone. Snowflake added Iceberg support, Databricks added serverless SQL warehouses, both engines now read Parquet on object storage with transactional guarantees. Pick the one with the better ML story if you're ML-heavy (Databricks), the better governance and concurrency story if you're analytics-heavy (Snowflake), or the better cost-at-your-volume story if you're cost-constrained.

02 / Why practice

Walk the stack for a design round

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Open a design problem

More background

ETL vs ELT→

The decision the modern stack made and where it still inverts.

Pipeline architecture→

End-to-end design patterns for the system design round.

dbt vs Airflow→

Two different jobs, often confused for the same one.