Pipeline Architecture
Two approaches to getting data from source systems into a warehouse. ETL transforms before loading. ELT loads first and transforms inside the warehouse. System design is a small slice of DE interview rounds, but pipeline architecture questions within the modeling segment frequently test this trade-off.
Know the trade-offs, not just the acronyms. SQL is the most-tested skill in DE interviews, and your ELT answer should reflect whether transformations happen in SQL inside the warehouse or outside it.
Extract → Transform → Load
Data is pulled from sources, reshaped and cleaned on external servers, then loaded into the warehouse in its final form. The warehouse only stores processed data.
Extract → Load → Transform
Data is pulled from sources and loaded raw into the warehouse. Transformations happen inside the warehouse using SQL. Raw data is preserved for flexibility.
Transform before loading. Data is cleaned, joined, and reshaped in a staging area, then loaded into the warehouse in its final form.
Load first, transform later. Raw data lands in the warehouse, then SQL or dbt models reshape it into analytics-ready tables.
Lower warehouse storage. Only transformed data lives in the destination. Raw data stays in the source or is discarded.
Higher warehouse storage. You keep raw data alongside transformed tables. Storage is cheap in modern cloud warehouses, so this is usually acceptable.
Less flexible. If you need to change a transformation, you re-run the pipeline from scratch. Adding a new column means updating the ETL job.
More flexible. Raw data is preserved, so you can re-transform it any time without re-extracting from the source. New questions can be answered from existing raw data.
Traditional tools: Informatica, Talend, SSIS, custom Python scripts. The transformation engine is outside the warehouse.
Modern stack: dbt, SQL-based transforms inside the warehouse. The warehouse itself is the transformation engine.
Compute cost is in the ETL layer (your servers or a managed service). Warehouse costs are lower because less data is stored.
Compute cost is in the warehouse (you pay for query processing). Warehouse costs are higher, but you avoid maintaining a separate transformation layer.
When the warehouse has limited compute power. When data privacy requires filtering sensitive fields before they reach the warehouse. When you have well-defined, stable transformations.
When using a cloud warehouse with scalable compute. When transformation logic changes frequently. When analysts need access to raw data for ad-hoc analysis. This is the default for most modern data teams.
The shift from ETL to ELT mirrors the shift from on-premise to cloud data warehouses. Understanding the history helps you explain trade-offs in interviews.
Warehouses had limited compute and expensive storage. It made sense to transform data on cheaper application servers before loading only the clean, final result. Tools like Informatica and SSIS were built for this model.
Cloud warehouses offered massive parallel compute and cheap storage. Suddenly, the warehouse itself could handle transformations faster than external servers. The bottleneck shifted from warehouse capacity to pipeline complexity.
dbt made SQL-based transformation a first-class workflow. Teams started loading raw data into the warehouse and writing version-controlled SQL models to transform it. This gave analysts more flexibility and made pipelines easier to debug.
Pipeline trade-offs surface frequently in the data modeling segment of DE interviews. Interviewers want to hear trade-off reasoning, not textbook definitions.
When data must be filtered for privacy before reaching the warehouse (PII stripping, GDPR compliance). When the warehouse is on-premise with limited compute. When transformations are stable and well-defined.
When using a cloud warehouse with elastic compute. When the analytics team needs flexibility to re-transform data. When you want version-controlled SQL transformations (dbt). This is the more common answer in modern data engineering interviews.
Yes. A common pattern: light ETL for data cleansing and PII removal, then ELT for business logic transformations inside the warehouse. Mention this hybrid approach to show practical experience.
DataDriven covers SQL, Python, and data modeling with hands-on challenges. Build the skills interviewers actually test.