ETL vs ELT: The Real Decision in 2026
Two acronyms. Three patterns once you count EtLT, which is what production actually looks like. The argument is no longer 'which one wins' but 'where in the pipeline graph should each transform live.' Pipeline architecture trade-offs surface frequently in the data modeling segment of DE interviews, and the answer interviewers grade on is never a definition.
What this guide actually says
ELT won the warehouse era. ETL is winning the privacy era. The pendulum is moving back toward ingestion-time governance because GDPR, CCPA, HIPAA, and the EU AI Act all demand minimization at the boundary, not after the fact in a downstream model. If your warehouse can't handle your transforms, the answer is rarely 'go back to ETL' — it's usually incremental models, smaller warehouses, partitioning, or moving the heaviest DAG node to Spark. EtLT is the realistic shape of every production pipeline above 1 TB/day. dbt is not ELT — it's the T stage and runs in both patterns. Interviewers don't ask which is better; they give you a scenario and grade where you push transforms.
Three patterns, in one breath each
Same letters, different orders. The order changes where compute lives and what breaks at 3 AM.
Extract, Transform, Load
Pull from sources, reshape and clean on external compute (Spark, Python, a managed service), then load only the finished product into the warehouse. The warehouse never sees raw data.
Extract, Load, Transform
Pull from sources, land raw rows in the warehouse, then transform with SQL inside the warehouse (typically dbt). Raw is preserved, replays are cheap, analysts can read the bronze layer.
Extract, light-t, Load, Big-T
The realistic hybrid. Strip PII, normalize encodings, dedup keys, and validate schema before the warehouse. Then load and run the heavy business modeling inside it. What almost every regulated production pipeline actually looks like.
EtLT: the pattern people forget to name
Almost no real production pipeline above 1 TB/day is pure ELT. The realistic shape is land raw, do a small light-t for governance and primary-key dedup before the warehouse, then run the Big-T inside it.
| Stage | ETL | ELT | EtLT |
|---|---|---|---|
| Where transforms run | External Spark, Python, or a managed ETL service | Inside the warehouse via SQL or dbt | Light transforms in stream/staging, heavy transforms in the warehouse |
| Raw data retention | Often discarded after load | Always kept in a bronze schema | Kept in object storage; raw never enters the warehouse if it's sensitive |
| PII handling | Stripped before load (strong) | Lands raw, masked at query time (weaker) | Stripped or hashed in the light-t before warehouse load (strong, with replays) |
| Replay model | Re-extract from source | Re-run dbt against retained raw | Re-run dbt against retained raw; the light-t is deterministic and rerunnable |
| Cost center | EC2, EMR, or managed ETL service | Warehouse credits dominate | Split between stream/staging compute and warehouse credits |
| Typical orchestrator | Airflow, Step Functions, Informatica scheduler | Airflow or Dagster firing dbt | Airflow/Dagster firing both stages, separate SLAs per stage |
| Schema evolution risk | Silent drops if mappings are static | Forgiving: raw absorbs new columns, dbt models surface them | Best of both: light-t enforces contracts; warehouse adapts at the modeling layer |
Seven dimensions, side by side
The table you want loaded in your head before a pipeline architecture round. Each row is a question an interviewer can pull out and probe on for 5 minutes.
| Dimension | ETL | ELT |
|---|---|---|
| Transformation Timing | Transform before loading. Data is cleaned, joined, and reshaped in a staging area, then loaded into the warehouse in its final form. | Load first, transform later. Raw data lands in the warehouse, then SQL or dbt models reshape it into analytics-ready tables. |
| Storage Requirements | Lower warehouse storage. Only transformed data lives in the destination. Raw stays in the source or is discarded. | Higher warehouse storage. You keep raw alongside transformed tables. Storage is cheap in modern cloud warehouses, so this is usually acceptable. |
| Flexibility | Less flexible. Changing a transformation means re-running from scratch. Adding a new column requires updating the ETL job and a backfill window. | More flexible. Raw is preserved, so you can re-transform any time without re-extracting. New questions can be answered from existing raw. |
| Tooling | Heritage tools: Informatica, Talend, SSIS, custom Python or Spark. Transformation engine is outside the warehouse. | Modern stack: dbt, SQL inside Snowflake/BigQuery/Redshift/Databricks SQL. The warehouse is the transformation engine. |
| Cost Model | Compute lives in your servers or a managed service. Warehouse cost is lower because you only store transformed data. | Compute lives in the warehouse. Storage is higher and credit cost can dominate above a few TB/day. |
| Governance | Strong. Sensitive fields can be removed before they ever reach the warehouse. Easier to satisfy data-residency and minimization rules. | Weaker by default. Raw lands first, governance happens at query time. Row-level security and dynamic masking are required to compensate. |
| Failure shape | Stage-table corruption, ingestion bottlenecks, schema-evolution silent drops. | Warehouse credit spikes, dbt incremental drift, late-arriving facts that break SCD logic. |
Why ETL came first, why ELT took over, why ETL is back
The shift is not linear. Each era was right for its constraints; the current era is rediscovering that ingestion-time governance solves problems pure ELT cannot.
ETL was the only option
Warehouses had limited compute and expensive storage. It made sense to transform data on cheaper application servers before loading only the final result. Informatica and SSIS were built for this model.
Cloud warehouses changed the math
Snowflake, BigQuery, and Redshift offered massive parallel compute and decoupled storage. The warehouse could outrun any external Spark cluster on most SQL shapes. The bottleneck shifted from warehouse capacity to pipeline complexity.
ELT became the default
dbt made SQL-based transformation a first-class workflow with version control, tests, and documentation. Teams loaded raw and built the warehouse model in code. Fivetran + Snowflake + dbt became the canonical modern data stack.
ETL is back, dressed as governance
GDPR, CCPA, HIPAA enforcement and the EU AI Act made ingestion-time data minimization a hard requirement in regulated industries. Reverse ETL added pressure to push transforms back into the source-of-truth pipeline. Warehouse credit lines became the dominant cost at large scale. The pendulum is moving back.
Five forces pushing transforms back toward ingestion
Privacy law won
GDPR Article 5 (data minimization) and HIPAA both treat raw PII landing in a general-purpose warehouse as a privacy violation, regardless of downstream masking. Stripping at ingestion is the only defensible posture in regulated industries.
Warehouse credits dominate cost at large shops
A finance organization burning $4M/yr on Snowflake credits cares whether a particular DAG node should run on Spark on EC2 (predictable spend, spot pricing) or on warehouse XL credits at peak hours.
Reverse ETL pushed transforms back into the source-of-truth pipeline
If Hightouch is shipping warehouse data back into Salesforce and Braze, the business logic producing it needs the same governance and SLAs as a production application — fits the ETL operational model better than ad-hoc dbt models on a daily schedule.
LLM features want clean training data
Raw lakes are a liability for LLM fine-tuning and RAG: you don't want a customer email leaking into a prompt embedding because someone forgot to mask a column. Curated training corpora live closer to the ETL pattern (one place where data was filtered, the lineage is recorded and auditable).
Streaming forced people to think about staging again
Kafka + Flink pipelines naturally do transforms before sink. The moment any real-time path enters the architecture, you're already running ETL on the streaming side. That re-normalizes the pattern in engineers' heads.
Five scenarios you should answer cold
Interviewers don't ask 'ETL or ELT.' They ask scenarios. Internalize the shape so you can produce one for any variant.
Design an ingestion pipeline for 50 source systems with mixed schemas.
Strong: land raw to object storage with an immutable layout (s3://bronze/source=stripe/dt=2026-04-30/...). Run a schema registry (Avro/Confluent or a homegrown Iceberg/Delta contract) so producers can't silently drop columns. Apply a light light-t for PII redaction and primary-key dedup. Load with Fivetran/Airbyte for SaaS sources and a custom Python/Spark job for the weird ones. Warehouse ingest is governed: each source has a contract test that runs before dbt can build silver. This is EtLT, not pure ELT.
How do you backfill a 90-day window after a downstream model changed?
ELT shines here. Re-run dbt in full-refresh mode against the retained raw bronze, but only for the affected DAG subgraph. Use a materialization strategy (incremental with a unique_key, or table for small models) that the new logic understands. Pitfall: incremental drift — old logic had subtle dedup behavior the new doesn't, so the backfill silently produces different totals than the daily incremental does going forward. Mitigate with a parity check that compares full-refresh output to incremental output on a cutover day.
PII shows up in your raw layer. Walk me through remediation.
Order: (1) revoke access to the affected schemas immediately. (2) Identify the column(s) and the source contract that should have flagged them. (3) Run a one-shot scrub job that hashes or nulls the columns, write to a new partition, atomically swap. (4) Add the field to the light-t in the ingestion path so the next batch never lands raw. (5) File a privacy incident if legal requires (EU GDPR Article 33: 72 hours). The right answer also names the architectural fix: this is why EtLT exists.
Your nightly ELT exceeded the daily warehouse credit budget. Diagnose and propose.
Pull query history grouped by warehouse + dbt model. Look for: (a) a new model scanning an unpartitioned bronze table; (b) an incremental model that lost its unique_key and is re-scanning history every run; (c) a CROSS JOIN or fan-out via a poorly-keyed JOIN; (d) a virtual warehouse sized too large (XL when L would queue trivially). Propose: smaller warehouse with auto-suspend at 60s, clustering keys on the bronze table, switch the worst model to a Spark job on object storage if its DAG node is dominated by wide shuffle rather than SQL aggregation.
When would you reach for ETL on a modern stack?
Five honest cases: (1) regulated PII that cannot enter the warehouse raw; (2) very large unstructured workloads where Spark/Beam beats SQL on the shape (image features, log parsing, deeply nested JSON); (3) ML feature pipelines needing deterministic, versioned offline + online parity, which fits a dedicated transformation service better than dbt; (4) cross-warehouse pipelines where you can't trust either warehouse as the canonical engine; (5) cost containment when warehouse credits for a specific DAG node clearly exceed equivalent EC2/EMR cost.
Three scales, real numbers
The crossover where ELT becomes painful. Numbers are list-price ballparks; negotiated discounts move the lines but not the shape of the curve.
| Volume | ETL (Airflow + EMR/EC2) | ELT (warehouse credits) | Crossover note |
|---|---|---|---|
| 100 GB/day | $800-$1,500/mo. EMR cluster too heavy; usually a single r5.2xlarge or Glue with conservative DPUs. | $400-$1,200/mo. Snowflake S warehouse for 30 min/day, dbt Cloud or self-hosted dbt Core. | ELT wins comfortably. Engineering time on ETL plumbing usually exceeds the credit savings. |
| 1 TB/day | $3k-$8k/mo. EMR transient cluster, 10-20 nodes, ~1 hour. | $5k-$15k/mo. M-L warehouse for 1-2 hours/day depending on join shape and clustering. | Roughly even. Choice dictated by team skills (Spark vs SQL) and governance posture. |
| 10 TB/day, hourly transforms | $20k-$50k/mo. EMR or Spark on EKS, persistent or warm-pool clusters with spot. | $50k-$200k/mo. L-XL warehouse running most of the day; cluster-key maintenance and result-cache misses dominate. | ETL/EtLT wins. Above 5 TB/day with hourly cadence, warehouse credits routinely exceed equivalent Spark cost by 2-4x. |
Numbers are list-price ballparks for Snowflake credits and US-East EC2/EMR pricing as of 2026. Negotiated enterprise discounts move the lines, but the shape (ELT cheap below 1 TB/day, expensive above 5 TB/day at hourly cadence) is robust across BigQuery, Redshift, and Databricks SQL too.
What each pattern actually runs on
Naming the right stack signals you've shipped the pattern, not just read about it.
| Tool | Pattern | When to reach for it |
|---|---|---|
| Informatica PowerCenter | ETL | Enterprise on-premise pipelines with heavy governance. |
| Talend Open Studio | ETL | Visual job design with code generation. |
| Microsoft SSIS | ETL | Bundled with SQL Server shops; Visual Studio authoring. |
| Custom PySpark on Airflow | ETL | Imperative transformations where SQL is the wrong primitive. |
| AWS Glue / Azure Data Factory / GCP Dataflow | ETL | Managed cloud-native ETL replacements for legacy boxed tools. |
| Apache Beam (Dataflow, Flink) | ETL | Unified batch + streaming transformation outside the warehouse. |
| dbt Core / dbt Cloud | ELT | Version-controlled SQL models with tests and documentation. |
| Snowflake / BigQuery / Redshift / Databricks SQL | ELT | In-warehouse transformation engines. |
| Fivetran / Airbyte / Stitch | ELT | Managed extract and load into the warehouse. |
| Airflow / Dagster / Prefect | Either | Orchestrate dbt runs or external transformation jobs. |
| Iceberg / Delta Lake on object storage | ELT (lakehouse) | When the warehouse is actually a lakehouse with open table formats. |
Myth vs reality
Myth: ELT is always cheaper than ETL
Reality: under 1 TB/day, yes. Above 5 TB/day with hourly transforms, warehouse credits routinely exceed equivalent EC2/EMR cost by 2-4x. The crossover sneaks up because dbt makes it easy to add expensive models without anyone reviewing the credit line.
Myth: ETL is legacy. Nobody builds it anymore
Reality: every regulated industry (healthcare, finance, EU public sector) is moving back toward ingestion-time governance. The pattern is alive, rebadged as 'data contracts' and 'privacy-first ingestion.' Vendor logos changed; architecture didn't.
Myth: dbt = ELT. Spark = ETL
Reality: dbt is the T stage. It runs inside Snowflake/BigQuery/Redshift (common ELT case) but also against Spark via dbt-spark or dbt-databricks where it's structurally an ETL transformation engine. Naming a tool does not name a pattern.
Myth: Streaming makes ETL/ELT obsolete
Reality: streaming pipelines still pick where transforms execute — in Flink, in Kafka Streams, on the consumer side, or after sink in the warehouse. The taxonomy survives. A Flink job stripping PII before sink is ETL by any honest definition.
Myth: If your warehouse is fast enough, ELT always wins
Reality: warehouses optimize for query throughput on relational shapes. They lose to Spark/Beam on wide-shuffle ML feature pipelines, deeply nested JSON parsing, and unstructured workloads. The right answer depends on the shape of the DAG, not the speed of the warehouse.
Decision matrix
Nine common situations and the pattern that wins each. A starting point; pressure-test against the constraints you actually have.
| Situation | Pick | Reason |
|---|---|---|
| Mixed schemas, under 1 TB/day, all SQL transforms | ELT | Warehouse credits are cheap at this scale; dbt + Fivetran shortens engineer time the most. |
| Heavy PII, regulated (health, EU, fintech) | ETL or EtLT | Article 5 minimization and HIPAA make raw PII landing a violation. Strip before the warehouse. |
| Large unstructured (logs, blobs), Python/Spark transforms | ETL | Wide-shuffle and JSON-flattening shapes lose to Spark on object storage; SQL is the wrong primitive. |
| ML training data prep with offline/online parity | ETL | Feature pipelines need deterministic, versioned transformation services. dbt isn't built for this. |
| Analytics-only, single warehouse, < 5 TB/day | ELT | Canonical case. dbt + Snowflake/BigQuery beats every alternative on time-to-insight. |
| 10 TB/day at hourly cadence, cost-sensitive | EtLT | Push the heaviest DAG node to Spark; let the warehouse handle the modeling. Credit savings dominate. |
| Cross-warehouse pipeline (Snowflake to Databricks, etc.) | ETL | No single warehouse can be the trusted transformation engine; an external service is the only neutral place. |
| Real-time event pipeline with sub-minute SLA | ETL (streaming) | Flink/Kafka Streams transform before sink; warehouse SQL is too slow and too coarse. |
| Greenfield analytics team with 2 engineers | ELT | Operational simplicity wins. Move to EtLT when scale or compliance forces the issue. |
ELT failure modes
Where ELT pipelines actually break.
Warehouse credit spike
A new dbt model joins two huge bronze tables without clustering. The nightly run takes 6 hours instead of 40 minutes and burns through the monthly credit budget by Tuesday.
dbt incremental drift
Someone changed the unique_key on an incremental model without forcing a full refresh. New rows merge correctly; old rows are stale. The drift is silent until a finance reconciliation breaks weeks later.
Late-arriving facts
A payment processor backfills yesterday's transactions at 03:00 UTC today. Your incremental model already ran at 02:30 UTC and shipped wrong numbers to a CFO dashboard. Lookback windows and watermarks aren't optional.
PII leak via raw query
An analyst with bronze read access opens the customer table directly. Nothing is hashed at this layer; row-level security policies are only on silver views. A privacy audit catches it before the regulator does — if you're lucky.
Result-cache invalidation cascade
A schema change at the bronze layer invalidates the result cache for every downstream model. The next refresh costs 10x normal because nothing hits cache.
ETL failure modes
Where ETL pipelines actually break. Different failure surface from ELT.
Stage-table corruption
A Spark job dies mid-write. The stage table has half a partition; the downstream load reads it and inserts a duplicate or partial row. Fix requires idempotent stage writes (write to a tmp prefix, atomic rename) which not every pipeline does.
Ingestion bottleneck
The transformation cluster is sized for steady-state. A holiday traffic spike triples the volume. The Spark cluster can't autoscale fast enough, warehouse load is delayed, downstream SLAs miss.
Schema-evolution silent drops
The source adds a new column. The ETL mapping is static and ignores it. Six months later someone notices product analytics is missing a critical field that's been there the whole time, just not in the warehouse.
Stale logic in code
The Spark transform was written 4 years ago in PySpark 2.4. Upgrading to Spark 3.5 breaks half the UDFs. Nobody owns the code anymore; nobody can confidently rewrite it.
No replay path
The source overwrote yesterday's data. The ETL job already discarded raw. To backfill a corrected calculation you must re-extract from a source that no longer holds the original snapshot.
ETL vs ELT FAQ
What does ETL stand for?+
What does ELT stand for?+
What is EtLT?+
Is ELT always better than ETL?+
Do data engineering interviews ask about ETL vs ELT?+
Does dbt force ELT?+
How do I cost-compare ETL and ELT?+
Where do PII and GDPR fit in this decision?+
Practice the SQL behind ETL and ELT
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes