The modern data stack is a cloud-native, ELT-first, modular approach to data architecture. Instead of one monolithic platform, you use specialized tools for ingestion, warehousing, transformation, orchestration, and business intelligence. Each tool does one thing well and connects to the others through standard interfaces.
This page covers the five core components, the evolution from legacy ETL, the current consolidation trend, and how interviewers test your understanding of modern data architecture.
Core Components
Companies Tracked
System Design Rounds
Challenges Built
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Every modern data stack has these five layers. The specific tools vary by organization, but the architecture pattern is consistent. Data flows from left to right: sources into ingestion, ingestion into the warehouse, warehouse through transformation, orchestration coordinating everything, and BI serving the output.
Ingestion tools move data from source systems into a central store. In the modern data stack, ingestion is typically managed by dedicated tools that connect to APIs, databases, SaaS applications, and event streams. They handle schema detection, incremental loading, change data capture (CDC), and error recovery. The key principle: ingestion tools should be configuration-driven, not code-heavy. Define the source, the destination, and the sync schedule. The tool handles the rest.
The shift from custom ingestion scripts to managed connectors is one of the defining changes of the modern stack. A data engineer in 2018 spent weeks writing a Python script to pull data from Salesforce, handle pagination, manage OAuth tokens, and deal with rate limits. In 2026, they configure a connector in a UI and it handles those same concerns automatically. This frees engineers to work on transformation and modeling instead of plumbing. The tradeoff: managed connectors are opinionated about schema mapping and sync frequency. When you need custom logic (like filtering events at the source), you may still need code.
The warehouse is the central compute and storage layer. Unlike traditional on-prem databases, cloud warehouses separate storage from compute, enabling independent scaling. Load 10TB of data once, then spin up as many compute clusters as you need for different workloads (reporting, ad-hoc analysis, ML training) without copying the data. This architecture eliminated the capacity planning bottleneck that plagued on-prem warehouses.
Three warehouses dominate the market: Snowflake (multi-cloud, usage-based pricing, strong governance), BigQuery (serverless, auto-scaling, tight Google Cloud integration), and Redshift (AWS-native, RA3 instances for storage-compute separation). Each has different pricing models, concurrency handling, and ecosystem integrations. In interviews, knowing the architectural differences between these three shows you think about platform selection, not just SQL. Databricks with Delta Lake blurs the warehouse/lakehouse boundary and is increasingly common in ML-heavy organizations.
Transformation in the modern stack follows the ELT pattern: load raw data first, then transform it inside the warehouse using SQL. dbt (data build tool) is the standard transformation tool. It lets you write SELECT statements that define transformations, and it handles materialization (table, view, incremental), dependency resolution, testing, and documentation. The 'T in ELT' is where data engineers spend most of their modeling time.
dbt changed data engineering culture as much as it changed tooling. It brought software engineering practices to analytics: version control, code review, testing, CI/CD, and documentation are now standard in transformation workflows. Before dbt, transformations lived in stored procedures, Airflow operators, or Jupyter notebooks with no testing or version control. The cultural shift toward 'analytics as code' is arguably more impactful than the tool itself. Alternatives to dbt exist (SQLMesh, Dataform, custom Python), but dbt's ecosystem (packages, community, hiring market) makes it the default choice for most teams.
Orchestration tools schedule and coordinate pipeline tasks. They define the DAG (directed acyclic graph) of dependencies: ingest data, then transform, then run quality checks, then update dashboards. When a task fails, the orchestrator handles retries, alerts, and dependency blocking (do not run downstream tasks if upstream failed). Orchestration is the control plane of your data platform.
Airflow is the most widely deployed orchestrator, but it is showing its age. Its scheduler is pull-based (polls for tasks), its UI is dated, and DAG authoring in Python is verbose. Newer tools (Dagster, Prefect, Mage) offer asset-based orchestration (define what data you want, not just what tasks to run), better local development experience, and built-in observability. In interviews, knowing Airflow is still important because most companies use it. But understanding why newer tools are gaining traction shows you think about the evolution of the space.
BI tools sit at the top of the stack and serve data to business users. They connect to the warehouse, let users build dashboards and reports, and (in the modern stack) push metric definitions down to a semantic layer. The trend is toward lighter, SQL-native BI tools that treat the warehouse as the compute engine instead of importing data into their own cache.
The BI layer is where the modern data stack meets the business. If the BI tool is slow, confusing, or shows stale data, none of the upstream engineering matters. Modern BI tools support live queries against the warehouse (no data extraction), version-controlled dashboards, embedded analytics (dashboards inside your product), and semantic layers (consistent metric definitions). Looker pioneered the 'metrics as code' approach with LookML. Newer tools like Lightdash and Hashboard extend this pattern. Traditional tools like Tableau and Power BI remain dominant by user count but are adding cloud-native features to compete.
The modern data stack did not appear overnight. It evolved over a decade as cloud infrastructure matured, warehouses became elastic, and the data engineering community adopted software engineering practices. Understanding this evolution helps you answer interview questions about why the modern stack exists and where it is heading.
On-prem databases with limited storage and compute. Transformations happened before loading (ETL) because warehouse resources were scarce and expensive. Data engineers wrote stored procedures, SSIS packages, or Informatica workflows. Schema changes required DBA approval and multi-week migration cycles. Testing was manual. Version control was rare. If the ETL job failed at 3 AM, the on-call engineer connected to a VPN and restarted it manually.
Organizations moved to cloud warehouses (Redshift, then BigQuery and Snowflake). Storage became cheap. Compute became elastic. The ETL bottleneck shifted: you no longer needed to minimize what you loaded, so the industry moved toward ELT. Load everything raw, then transform inside the warehouse. Airflow replaced cron jobs. Python scripts replaced stored procedures. But many organizations just lifted-and-shifted their legacy patterns into the cloud without rethinking their architecture.
The modern data stack emerged as a modular, best-of-breed architecture. Fivetran for ingestion. Snowflake for storage and compute. dbt for transformation. Airflow for orchestration. Looker for BI. Each component was best-in-class and connected through standard interfaces (SQL, APIs, JDBC). This composability was the core innovation: swap any component without rebuilding the whole stack. The downside: integrating 6+ tools creates its own complexity.
The pendulum is swinging toward consolidation. Running 8 different SaaS tools with 8 different billing models and 8 different integration points is expensive and fragile. Platforms like Databricks and Snowflake are expanding to cover ingestion, transformation, orchestration, and ML in a single platform. dbt is adding semantic layer and metrics capabilities. The trend is not back to monolithic on-prem, but toward fewer, more integrated cloud-native platforms. The 'best-of-breed' era is being challenged by the 'best-of-suite' approach.
ELT over ETL. Load raw data first, transform inside the warehouse. Cloud warehouses have cheap storage and elastic compute, so there is no reason to transform before loading. Loading raw data preserves optionality: if you need a different transformation later, the raw data is already there. ETL permanently discards data that was filtered during the transform step.
Separation of storage and compute. Store data once. Spin up as many compute clusters as you need for different workloads. A reporting cluster, an ad-hoc analysis cluster, and an ML training cluster can all read the same data without copying it. Scale compute independently of storage. This was the architectural innovation that made cloud warehouses fundamentally different from on-prem databases.
SQL as the lingua franca. The modern stack standardized on SQL for transformation, analytics, and increasingly for ML feature engineering. dbt runs SQL. BI tools query SQL. Even streaming tools (ksqlDB, Flink SQL) are adding SQL interfaces. This lowers the barrier to entry: a data analyst who knows SQL can contribute to the transformation layer without learning Python or Scala.
Infrastructure as code. Transformations are version-controlled (dbt in Git). Orchestration DAGs are defined in code (Airflow, Dagster). Infrastructure is provisioned with Terraform or Pulumi. This enables CI/CD for data pipelines: run tests on a pull request, deploy to staging, validate, then promote to production. The manual, click-through approach of legacy tools is replaced by reproducible, auditable code.
Modularity with standard interfaces. Each tool in the stack communicates through standard interfaces: SQL, JDBC/ODBC, REST APIs, and file formats (Parquet, Avro). This lets you swap components without rebuilding the entire stack. If you outgrow your ingestion tool, replace it. If you switch warehouses, your dbt models still work (with minor dialect adjustments). The modular approach reduces vendor lock-in, though the current consolidation trend is testing this principle.
These questions appear in system design rounds and general knowledge interviews. Interviewers test whether you understand the architecture holistically, not just individual tools.
How to approach this
Define the modern data stack as cloud-native, ELT-first, and modular. Contrast with legacy: on-prem, ETL (transform before load), monolithic tools. Key differences: storage-compute separation, managed connectors for ingestion, SQL-based transformation in the warehouse, and BI tools that query the warehouse directly. Mention the tradeoff: modularity gives flexibility but adds integration complexity. Show you understand the current consolidation trend where platforms are expanding their scope.
How to approach this
Keep it simple. Managed ingestion tool for the 10 to 15 SaaS sources they probably use (Stripe, Hubspot, Segment). Snowflake or BigQuery as the warehouse (choose based on existing cloud provider). dbt for transformations. Start with dbt Cloud for orchestration instead of deploying Airflow (simpler, less infrastructure). Metabase or Looker for BI. Total: 4 tools, all managed services, no infrastructure to maintain. Explain why you chose managed over self-hosted: a 50-person startup does not have the bandwidth to operate Airflow and a Kubernetes cluster.
How to approach this
ELT: load raw data into the warehouse, then transform using SQL inside the warehouse. ETL: transform data before loading. ELT won because cloud warehouses have cheap storage and elastic compute, so there is no reason to minimize what you load. But ETL is still appropriate when: (1) the source data is too large to load raw (transform to reduce volume at the edge), (2) PII must be stripped before entering the warehouse for compliance, (3) the transformation requires non-SQL logic (Python, ML models) that runs better outside the warehouse.
How to approach this
Four limitations: (1) tool sprawl (8+ tools with different billing, auth, and integration points), (2) cost unpredictability (usage-based pricing on warehouse and ingestion tools can spike), (3) real-time gaps (the modern stack is batch-first; streaming requires a parallel architecture), (4) operational complexity (each tool needs monitoring, upgrades, and incident response). The market is responding: platforms are consolidating, streaming capabilities are being added, and FinOps tools help manage costs.
System design questions test your understanding of the full data stack. Practice designing end-to-end pipelines on DataDriven.