ETL vs ELT: The Real Decision in 2026
- 01ELT won the warehouse era. ETL is winning the privacy era. The pendulum is moving back toward ingestion-time governance because GDPR, CCPA, HIPAA, and the EU AI Act all demand minimization at the boundary, not after the fact in a downstream model.
- 02If your warehouse cannot handle your transforms, the answer is rarely ‘go back to ETL.’ The answer is usually incremental models, smaller virtual warehouses on cheap shapes, partitioning, or moving the heaviest DAG node to Spark on object storage. Pattern choice is the last lever to pull.
- 03The argument is no longer ETL vs ELT. It is where to push transforms in the pipeline graph: at the source, at ingestion, in a staging layer, or inside the warehouse. Modern stacks place different transforms in different stages of the same pipeline.
- 04EtLT is the realistic shape of every production pipeline above 1 TB/day. Land raw, do a small lower-case t (PII redaction, schema normalization, primary-key dedup) before the warehouse, then a Big-T inside the warehouse for business modeling. Almost no real shop is pure ELT.
- 05dbt is not ELT. dbt is the T stage. It runs equally inside an ELT pipeline (most common) or on top of a Spark cluster acting as the transformation engine in a classic ETL flow. The vendor pitch conflates the two.
- 06Interviewers do not ask which is better. They give you a scenario and grade where you push transforms, why, and what breaks at 3 AM. Memorizing definitions is wasted prep.
Three patterns, in one breath each
Same letters, different orders. The order changes everything about where compute lives, how flexible the pipeline is, and what breaks at 3 AM.
Extract, Transform, Load
Extract, Load, Transform
Extract, light-t, Load, Big-T
EtLT: the pattern people forget to name
Almost no real production pipeline above 1 TB/day is pure ELT. The realistic shape is land raw, do a small lower-case t for governance and primary-key dedup before the warehouse, then run the Big-T inside it. This is what regulated industries already do; analytics shops are catching up.
| Stage | ETL | ELT | EtLT |
|---|---|---|---|
| Where transforms run | External Spark, Python, or a managed ETL service | Inside the warehouse via SQL or dbt | Light transforms in stream/staging, heavy transforms in the warehouse |
| Raw data retention | Often discarded after load | Always kept in a bronze schema | Kept in object storage; raw never enters the warehouse if it is sensitive |
| PII handling | Stripped before load (strong) | Lands raw, masked at query time (weaker) | Stripped or hashed in the lower-case t before warehouse load (strong, with replays) |
| Replay model | Re-extract from source | Re-run dbt against retained raw | Re-run dbt against retained raw; the lower-case t is deterministic and rerunnable |
| Cost center | EC2, EMR, or managed ETL service | Warehouse credits dominate | Split between stream/staging compute and warehouse credits |
| Typical orchestrator | Airflow, Step Functions, Informatica scheduler | Airflow or Dagster firing dbt | Airflow / Dagster firing both stages, with separate SLAs per stage |
| Schema evolution risk | Silent drops if mappings are static | Forgiving. Raw absorbs new columns, dbt models surface them. | Best of both. Light-t enforces contracts; warehouse adapts at the modeling layer. |
Seven dimensions, side by side
The table you want loaded in your head before a pipeline architecture round. Each row is a question an interviewer can pull out and probe on for 5 minutes.
| Dimension | ETL | ELT |
|---|---|---|
| Transformation Timing | Transform before loading. Data is cleaned, joined, and reshaped in a staging area, then loaded into the warehouse in its final form. | Load first, transform later. Raw data lands in the warehouse, then SQL or dbt models reshape it into analytics-ready tables. |
| Storage Requirements | Lower warehouse storage. Only transformed data lives in the destination. Raw data stays in the source or is discarded. | Higher warehouse storage. You keep raw data alongside transformed tables. Storage is cheap in modern cloud warehouses, so this is usually acceptable. |
| Flexibility | Less flexible. Changing a transformation means re-running the pipeline from scratch. Adding a new column requires updating the ETL job and a backfill window. | More flexible. Raw data is preserved, so you can re-transform any time without re-extracting. New questions can be answered from existing raw data. |
| Tooling | Heritage tools: Informatica, Talend, SSIS, custom Python or Spark. The transformation engine is outside the warehouse. | Modern stack: dbt, SQL inside Snowflake, BigQuery, Redshift, or Databricks SQL. The warehouse is the transformation engine. |
| Cost Model | Compute lives in your servers or a managed service. Warehouse cost is lower because you only store transformed data. | Compute lives in the warehouse. Storage is higher and credit cost can dominate above a few TB/day. |
| Governance | Strong. Sensitive fields can be removed before they ever reach the warehouse. Easier to satisfy data-residency and minimization rules. | Weaker by default. Raw lands first, governance happens at query time. Row-level security and dynamic masking are required to compensate. |
| Failure shape | Stage-table corruption, ingestion bottlenecks, schema-evolution silent drops. | Warehouse credit spikes, dbt incremental drift, late-arriving facts that break SCD logic. |
Why ETL came first, why ELT took over, why ETL is back
The shift is not linear. Each era was right for its constraints, and the current era is rediscovering that ingestion-time governance solves problems pure ELT cannot.
ETL was the only option
Cloud warehouses changed the math
ELT became the default
ETL is back, dressed as governance
Why ETL is back
Not because ELT is bad. Because the constraints changed. Five forces are pushing transforms back toward ingestion.
Each one alone is enough to justify pulling at least some transforms out of the warehouse. Together they are an architectural pendulum.
- Privacy law won. GDPR Article 5 (data minimization) and HIPAA both treat raw PII landing in a general-purpose warehouse as a privacy violation, regardless of downstream masking. Stripping at ingestion is the only defensible posture in regulated industries.
- Warehouse credits are the dominant cost line at large shops. A finance organization burning $4M/yr on Snowflake credits cares a lot whether a particular DAG node should run on Spark on EC2 (predictable spend, spot pricing) or on warehouse XL credits at peak hours.
- Reverse ETL pushed transforms back into the source-of-truth pipeline. If Hightouch is shipping warehouse data back into Salesforce and Braze, the business logic that produces that data needs the same governance and SLAs as a production application, which fits the ETL operational model better than ad-hoc dbt models on a daily schedule.
- LLM features want clean training data. Raw lakes are a liability for LLM fine-tuning and RAG: you do not want a customer email leaking into a prompt embedding because someone forgot to mask a column. Curated training corpora live closer to the ETL pattern (one place where the data was filtered, the lineage is recorded, the lineage is auditable).
- Streaming forced people to think about staging again. Kafka + Flink pipelines naturally do transforms before sink. The moment any real-time path enters the architecture, you are already running ETL on the streaming side. That re-normalizes the pattern in engineers' heads.
“Pick ETL when the cost of governance is higher than the cost of compute. Pick ELT when the cost of compute is lower than the cost of engineering time.”
Five scenarios you should be able to answer cold
Interviewers do not ask ‘ETL or ELT.’ They ask scenarios. Each of these has a strong answer outline below; the goal is to internalize the shape so you can produce one for any variant.
Design an ingestion pipeline for 50 source systems with mixed schemas.
How do you backfill a 90-day window after a downstream model changed?
PII shows up in your raw layer. Walk me through remediation.
Your nightly ELT job exceeded the daily warehouse credit budget. Diagnose and propose.
When would you reach for ETL on a modern stack?
Three scales, real numbers
The crossover where ELT becomes painful. Numbers are list-price ballparks; negotiated discounts move the lines but not the shape of the curve.
| Volume | ETL (Airflow + EMR/EC2) | ELT (warehouse credits) | Crossover note |
|---|---|---|---|
| 100 GB/day | $800-$1,500/mo. EMR cluster too heavy; usually a single r5.2xlarge or Glue with conservative DPUs. | $400-$1,200/mo. Snowflake S warehouse for 30 minutes/day, dbt Cloud or self-hosted dbt Core. | ELT wins comfortably. Engineering time on ETL plumbing usually exceeds the credit savings. |
| 1 TB/day | $3k-$8k/mo. EMR transient cluster, 10-20 nodes, run for ~1 hour. | $5k-$15k/mo. M-L warehouse for 1-2 hours/day depending on join shape and clustering. | Roughly even. Choice is dictated by team skills (Spark vs SQL) and governance posture. |
| 10 TB/day, hourly transforms | $20k-$50k/mo. EMR or Spark on EKS, persistent or warm-pool clusters with spot. | $50k-$200k/mo. L-XL warehouse running most of the day; cluster-key maintenance and result-cache misses dominate. | ETL/EtLT wins. Above 5 TB/day with hourly cadence, warehouse credits routinely exceed equivalent Spark cost by 2-4x. |
What each pattern actually runs on
The tool list is the giveaway in interviews. Naming the right stack signals you have shipped the pattern, not just read about it.
Transformation engine sits outside the warehouse. Heritage tools dominate, with managed cloud variants catching up.
- Informatica PowerCenter for enterprise on-premise pipelines with heavy governance.
- Talend Open Studio for visual job design with code generation.
- Microsoft SSIS bundled inside SQL Server shops with a Visual Studio authoring surface.
- Custom Python and PySpark jobs orchestrated through Airflow when transformations need imperative logic.
- AWS Glue, Azure Data Factory, and GCP Dataflow as managed cloud-native ETL replacements for the legacy boxed tools.
- Apache Beam (via Dataflow or Flink) for unified batch + streaming transformation outside the warehouse.
Warehouse is the transformation engine. The modern data stack is built around this assumption.
- dbt Core or dbt Cloud for version-controlled SQL models with tests and documentation.
- Snowflake, BigQuery, Redshift, and Databricks SQL as the in-warehouse transformation engines.
- Fivetran, Airbyte, and Stitch for managed extract and load into the warehouse.
- Airflow, Dagster, or Prefect to orchestrate dbt runs and downstream dependencies.
- Materialized views and incremental table macros to encode incremental ELT logic that the warehouse executes.
- Iceberg and Delta Lake on object storage when the warehouse is actually a lakehouse with open table formats.
Five things you will hear in a tech-blog comments section
And what an engineer who has actually run both patterns will say back.
The matrix
Nine common situations and the pattern that wins each. Use this as a starting point, then pressure-test it against the constraints you actually have.
When the matrix above does not match your situation exactly, fall back to these five rules.
- Pick ETL when sensitive fields must be stripped before they ever land in the warehouse.
- Pick ETL when the destination has limited compute and cannot absorb transformation cost.
- Pick ELT when transformation logic changes weekly and you want analyst-friendly SQL workflows.
- Pick ELT when storage is cheap and replaying transforms against raw history is valuable.
- Pick EtLT when ingest paths are mixed: light governance/normalization on the way in, heavy modeling once data is staged.
- Pick streaming ETL when the SLA is sub-minute and waiting for a warehouse refresh is unacceptable.
Failure modes by pattern
Pattern choice does not eliminate incidents; it shifts where they happen. Knowing the failure shape per pattern is what separates a senior interview answer from a junior one.
The warehouse is the engine, so the warehouse is the on-call surface.
- Warehouse credit spike. A new dbt model joins two huge bronze tables without clustering. The nightly run takes 6 hours instead of 40 minutes and burns through the monthly credit budget by Tuesday.
- dbt incremental drift. Someone changed the unique_key on an incremental model without forcing a full refresh. New rows merge correctly; old rows are stale. The drift is silent until a finance reconciliation breaks weeks later.
- Late-arriving facts. A payment processor backfills yesterday's transactions at 03:00 UTC today. Your incremental model already ran at 02:30 UTC and shipped wrong numbers to a CFO dashboard. Lookback windows and watermarks are not optional.
- PII leak via raw query. An analyst with bronze read access opens the customer table directly. Nothing is hashed at this layer; row-level security policies are only on the silver views. A privacy audit catches it before the regulator does, if you are lucky.
- Result-cache invalidation cascade. A schema change at the bronze layer invalidates the result cache for every downstream model. The next refresh costs 10x normal because nothing hits cache.
External compute and stage tables introduce a different blast radius.
- Stage-table corruption. A Spark job dies mid-write. The stage table has half a partition; the downstream load reads it and inserts a duplicate or a partial row. The fix requires idempotent stage writes (write to a tmp prefix, atomic rename) which not every pipeline does.
- Ingestion bottleneck. The transformation cluster is sized for steady-state. A holiday traffic spike triples the volume. The Spark cluster cannot autoscale fast enough, the warehouse load is delayed, downstream SLAs miss.
- Schema-evolution silent drops. The source adds a new column. The ETL mapping is static and ignores it. Six months later someone notices product analytics is missing a critical field that has been there the whole time, just not in the warehouse.
- Stale logic in code. The Spark transform was written 4 years ago in PySpark 2.4. Upgrading the cluster to Spark 3.5 breaks half the UDFs. Nobody owns the code anymore; nobody can confidently rewrite it.
- No replay path. The source overwrote yesterday's data. The ETL job already discarded raw. To backfill a corrected calculation you must re-extract from a source that no longer holds the original snapshot.
Where ETL/ELT meets dimensional modeling
The pattern choice ripples into how you handle slowly changing dimensions. ELT typically uses dbt snapshots over retained raw; ETL typically computes SCD Type 2 in the staging layer before load. Both work; the failure shapes differ.
She moved. She upgraded. She became someone new. The record has to keep up.
Frequently asked questions
What does ETL stand for?+
What does ELT stand for?+
What is EtLT?+
Is ELT always better than ETL?+
Do data engineering interviews ask about ETL vs ELT?+
Does dbt force ELT?+
How do I cost-compare ETL and ELT?+
Where do PII and GDPR fit in this decision?+
Practice the patterns
Four real interview challenges that touch ETL/ELT decisions: PII reconciliation, dedup, large-scale streaming joins, and SCD modeling.
Two versions of the same truth.
Same email, different rows. Spot the repeats.
Billions of clicks. One tiny code. Two very different clocks.
She moved. She upgraded. She became someone new. The record has to keep up.
Practice the SQL behind ETL and ELT
DataDriven covers SQL, Python, data modeling, and pipeline architecture with hands-on challenges. Build the muscle memory interviewers actually grade on.