In our 1,042-round corpus, tool-specific questions appear in roughly 22% of rounds. The top five tools show up more often than the next twenty combined: Airflow in 14% of rounds, dbt in 9%, Snowflake in 8%, Spark in 7%, Kafka in 5%. We ranked every category on this page by how often each tool actually comes up in interviews and production job postings. Skip the rest until you need it.
Rounds mentioning Airflow
Rounds mentioning dbt
Companies in dataset
Rounds analyzed
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Orchestrators appear in 14% of verified DE interview rounds, almost entirely framed as Airflow. 9% of rounds mention Dagster or Prefect as a comparison. If you know one orchestrator cold, you can handle 95% of questions in this category.
The industry standard for workflow orchestration. Pipelines are defined as Python DAGs (directed acyclic graphs). Rich UI for monitoring, extensive operator library for integrating with external systems, and a large community. The main downsides are complex setup (scheduler, webserver, database, workers) and a task-centric model that can feel clunky for modern data workflows.
A modern orchestrator built around the concept of “software-defined assets.” Instead of defining tasks that run in order, you define the data assets you want to exist and Dagster figures out the execution plan. Built-in testing, excellent local development experience, and native support for incremental processing. Smaller job market than Airflow but growing quickly.
A Python-native orchestrator with a managed cloud option. Pipelines are regular Python functions decorated with @flow and @task. Simpler to set up than Airflow, with dynamic workflows and built-in retry logic. Good for teams that want orchestration without managing infrastructure.
An orchestrator with a notebook-like UI for building pipelines interactively. Each block in a pipeline is editable and testable in the browser. Good for teams transitioning from Jupyter notebooks to production pipelines. Newer and smaller community than the others.
Transformation tools clean, reshape, and model data after it has been ingested. The choice between SQL-first tools (dbt) and code-first tools (Spark) depends on where your data lives and how complex your transformations are.
dbt lets you write SQL SELECT statements and turns them into tables and views in your warehouse. It handles dependency ordering, testing, documentation, and incremental materialization. dbt has become the standard for SQL-based transformation in the warehouse. Almost every modern data team uses it.
A distributed compute engine for processing large datasets. Supports Python (PySpark), Scala, SQL, and R. Spark processes data in memory across a cluster of machines, making it fast for large-scale transformations that do not fit in a single warehouse query. Used for raw file processing, complex Python transformations, and ML feature engineering.
A SQL transformation framework with built-in change management, automatic column-level lineage, and virtual environments for testing changes without creating new tables. Newer than dbt with a smaller community, but addresses some of dbt's pain points around CI/CD and environment management.
Where your data lives determines query performance, cost, and what tools can access it. Most teams use a combination of object storage (cheap, durable) and a warehouse (fast queries).
Cheap, durable, and infinitely scalable file storage. The foundation of every data lake. Store raw data, Parquet files, and backups here. Not queryable directly without a compute engine (Athena, Spark, Trino) or a table format (Delta Lake, Iceberg).
A fully managed cloud data warehouse with separated storage and compute. Known for easy scaling, zero-copy data sharing, and excellent support for semi-structured data (JSON, Avro, Parquet). The largest independent data warehouse vendor by market share.
A serverless data warehouse where you pay per query (or reserve slots for predictable pricing). No cluster management. Excellent for ad-hoc analysis and teams that want zero infrastructure overhead. Strong ML integration through BigQuery ML.
Open table formats that add ACID transactions, time travel, and schema evolution to files in object storage. Delta Lake is Databricks' format. Iceberg is the open standard with growing support across engines (Spark, Trino, Snowflake, BigQuery). These formats turn a data lake into a lakehouse.
Streaming tools process data as it arrives instead of in batches. Not every team needs streaming, but companies with real-time requirements (fraud detection, live dashboards, event-driven architectures) rely on these tools.
A distributed event streaming platform. Kafka acts as a durable, high-throughput message bus between producers (services that generate events) and consumers (services that process them). It is the backbone of event-driven architectures at companies of all sizes. Confluent offers a managed cloud version.
A stream processing framework for real-time computation. Flink processes events with exactly-once semantics, supports complex event processing (windowed aggregations, joins across streams), and handles both batch and streaming workloads. More complex to operate than Kafka but necessary for stateful stream processing.
Cloud-managed alternatives to self-hosted Kafka. Simpler to set up and operate but with less flexibility and higher per-GB costs at scale. Good for teams that want streaming without the operational burden of managing Kafka clusters.
BI tools are how business users consume the data that pipelines produce. Data engineers need to understand BI tools because their query patterns, refresh schedules, and data model requirements directly affect how you design the warehouse layer.
A BI platform with a modeling language (LookML) that defines metrics and dimensions once and reuses them across dashboards. Strong for teams that want a single source of metric definitions. Acquired by Google. Tightly integrated with BigQuery.
The most widely used enterprise BI tool. Known for powerful visual analytics, drag-and-drop interface, and strong community. Connects to every major database. Acquired by Salesforce. Heavy on the analyst side; data engineers interact with it through data source optimization and extract scheduling.
Open-source BI tools that run on your own infrastructure. Metabase is simpler and better for non-technical users. Superset (Apache) is more powerful and customizable. Both are good for startups and teams that want BI without licensing costs.
Ingestion tools extract data from source systems (databases, APIs, SaaS tools) and load it into your warehouse or lake. Build vs buy is a key decision here: managed tools save engineering time but cost money per connector.
Managed data connectors that replicate data from SaaS tools, databases, and APIs into your warehouse. Fivetran is fully managed (no infrastructure to run). Airbyte is open source with a cloud-hosted option. Both handle schema changes, incremental loading, and CDC. The main trade-off is cost (Fivetran charges per row synced) vs control (Airbyte gives you the code).
An open-source change data capture (CDC) platform. Debezium reads database transaction logs (binlog in MySQL, WAL in PostgreSQL) and streams changes to Kafka. This gives you real-time replication of database changes without impacting source database performance. The standard tool for building CDC pipelines.
Tool selection depends on team size, data volume, budget, and existing infrastructure. Here are rules of thumb.
Fivetran or Airbyte for ingestion. Snowflake or BigQuery for storage. dbt for transformation. Airflow managed (MWAA, Astronomer) or Dagster Cloud for orchestration. Metabase for BI. Minimize infrastructure you manage.
Same as above, plus Spark for heavy processing, Kafka for event streaming if needed, and a data catalog (Atlan, DataHub) for governance. Consider self-hosting Airflow if you need more customization than managed services offer.
Databricks or custom lakehouse for unified processing. Self-hosted Kafka for streaming. Internal data platform with self-service tools. Investment in monitoring, alerting, and cost management tooling. Data mesh or domain-oriented architecture.
Tool questions make up 22%. The underlying skills make up the rest. Spend 76% of your prep time here.