Scheduler, Webserver, Executor, Metastore
Airflow is the orchestrator that ties data pipelines together. If you are interviewing for a data engineering role, you will face questions about scheduler internals, DAG design patterns, executor tradeoffs, and what happens when tasks fail in production.
TL;DR: Airflow Interview Questions
Apache Airflow is the dominant Python-based workflow orchestrator for data engineering in 2026. Interviews focus on six areas: scheduler internals and DAG parsing, executor types (Local, Celery, Kubernetes), operators and task dependencies, idempotency and retries, XComs and TaskFlow API, and backfills, pools, and trigger rules.
For most roles you need to explain how the scheduler parses DAGs, when to choose CeleryExecutor vs KubernetesExecutor, how to make tasks idempotent, and how to handle dependencies between DAGs. Senior roles add capacity planning, dynamic task mapping (Airflow 2.3+), Datasets-based event-driven scheduling, and integration with Kubernetes pods for isolation.
Airflow Executors: Decision Matrix
One of the most-asked Airflow architecture questions. Know when to pick each executor and why.
| Executor | How tasks run | Best for | Cold start |
|---|---|---|---|
| Local | Subprocesses on the scheduler machine | Small deployments, dev environments | Instant |
| Celery | Distributed across persistent workers via Redis/Rabbit | Stable workloads with predictable resource needs | Fast (workers are warm) |
| Kubernetes | A new pod per task with dynamic resource allocation | Heterogeneous workloads, isolation, autoscaling | Slow (10-60s pod startup) |
| CeleryKubernetes | Hybrid: Celery for default queue, K8s for heavy tasks | Mixed workloads where most are light, some need isolation | Mixed |
Every problem comes from a real interview report. Run code in your browser.
What Interviewers Expect
Airflow questions reveal whether you have operated pipelines in production. The interview tests whether you understand what breaks, how to recover, and how to design DAGs that hold up under load.
Junior candidates should explain the architecture (scheduler, webserver, executor, metastore), write a basic DAG with dependencies, and understand what execution_date means.
Mid-level candidates need to discuss executor tradeoffs, implement idempotent tasks, handle cross-DAG dependencies, and explain backfill strategies.
Senior candidates must design DAG architectures for complex workflows, explain dynamic task generation versus mapped tasks, manage resource contention with pools, and describe CI/CD strategies for deploying DAGs safely.
Core Concepts Interviewers Test
Scheduler, Webserver, Executor, Metastore
The scheduler parses DAG files, creates DagRun and TaskInstance records, and sends tasks to the executor. The webserver serves the UI and reads from the metastore (Postgres or MySQL). The executor runs tasks: LocalExecutor uses multiprocessing, CeleryExecutor uses distributed workers, KubernetesExecutor spins up a pod per task. Interviewers expect you to name all four components and explain what fails when each one goes down.
DAG Authoring and Best Practices
A DAG is a Python file that defines tasks and their dependencies. Top-level code in a DAG file runs every time the scheduler parses it (every few seconds), so it must be fast and side-effect-free. Interviewers test whether you know to avoid database calls, API requests, or heavy imports at the module level.
Operators and Sensors
Operators define what a task does: PythonOperator runs a callable, BashOperator runs a shell command, SqlOperator runs a query. Sensors wait for an external condition (file exists, partition ready, API response). Interviewers care whether you use the right operator for the job and whether you understand reschedule mode vs poke mode for sensors.
XComs
XComs (cross-communications) let tasks pass small amounts of data to downstream tasks via the metastore. They are not designed for large data. Interviewers ask about XCom limitations: size limits (varies by backend), serialization overhead, and the anti-pattern of passing DataFrames through XCom instead of writing to object storage.
Trigger Rules
The default trigger rule is all_success: a task runs only when all upstream tasks succeed. Alternatives include all_failed, one_success, one_failed, none_failed, none_skipped, and always. Interviewers test whether you can design error-handling DAGs using trigger rules, such as running a cleanup task regardless of upstream success or failure.
TaskFlow API
TaskFlow (introduced in Airflow 2.0) uses the @task decorator to define tasks as Python functions. It automatically handles XCom serialization and dependency inference. Interviewers want to see you use TaskFlow for new DAGs and understand how it maps to the traditional operator model under the hood.
Airflow Interview Questions with Guidance
Explain the Airflow scheduler parsing loop. Why is it important that DAG files execute quickly?
The scheduler continuously scans the DAGs folder and imports each Python file to discover DAGs. This happens every scheduler_heartbeat_sec (default 5 seconds). If a DAG file takes 10 seconds to import (because it makes API calls or reads from a database at the top level), the scheduler falls behind. Slow parsing creates a backlog: DAGs are not scheduled on time, tasks pile up, and the entire system degrades. A strong answer mentions min_file_process_interval and dagbag_import_timeout as tuning controls.
Compare CeleryExecutor, KubernetesExecutor, and LocalExecutor. When would you choose each?
LocalExecutor runs tasks as subprocesses on the scheduler machine. Good for small deployments. CeleryExecutor distributes tasks to a pool of persistent workers via a message broker (Redis or RabbitMQ). Good for stable workloads with predictable resource needs. KubernetesExecutor creates a new pod for each task, providing perfect isolation and dynamic scaling. Good for heterogeneous workloads where tasks have different resource requirements. A strong answer mentions CeleryKubernetesExecutor as a hybrid and discusses the cold-start latency tradeoff of KubernetesExecutor.
A DAG runs daily but yesterday's run failed. Today's run is queued. What happens and how do you fix it?
By default, Airflow respects depends_on_past=False, so today's run will execute regardless. If depends_on_past=True on any task, that task waits until the same task in the previous DagRun succeeds. To fix the failed run: clear the failed task instances in the UI or CLI, which re-queues them. If the failure was transient, they will succeed on retry. A strong answer discusses wait_for_downstream, catchup behavior, and the difference between clearing a task (re-runs it) and marking it as success (skips it).
How do you handle dependencies between DAGs in Airflow?
Use TriggerDagRunOperator to start another DAG from a task. Use ExternalTaskSensor to wait for a task in another DAG to complete. Use Datasets (Airflow 2.4+) for event-driven triggering: a producer DAG marks a dataset as updated, and consumer DAGs with that dataset in their schedule automatically trigger. A strong answer notes that ExternalTaskSensor requires matching execution dates and discusses the coupling risk of cross-DAG dependencies.
What is the difference between execution_date and the actual time a DAG runs? Why does this confuse people?
execution_date (now called logical_date) marks the start of the data interval, not when the DAG runs. A daily DAG with execution_date 2026-01-15 runs at the end of that interval: 2026-01-16T00:00. So the DAG for 'today' runs 'tomorrow'. Interviewers ask this because it is the most common source of confusion in Airflow. A strong answer explains data_interval_start, data_interval_end, and how to use them in templates for idempotent queries.
How do you implement idempotent tasks in Airflow? Why is idempotency important?
Idempotent tasks produce the same result regardless of how many times they run. This is critical because Airflow retries failed tasks and operators can re-run tasks manually. Implementation: use MERGE/upsert instead of INSERT, write output to a date-partitioned path and overwrite the partition, or use DELETE + INSERT within a transaction. A strong answer gives a concrete example: a task that writes to S3 at s3://bucket/output/dt=2026-01-15/ and overwrites on re-run, versus one that appends to a single file and creates duplicates.
You need to backfill a DAG for the past 90 days. Describe your approach and potential issues.
Use the CLI: airflow dags backfill with start and end dates. Set max_active_runs and concurrency to limit parallel runs and avoid overwhelming downstream systems. Potential issues: resource contention (90 runs competing for workers), external API rate limits, database locks on upsert targets, and XCom storage growth. A strong answer mentions using pools to limit concurrency for specific resource-intensive tasks and testing with a small date range first.
What are Airflow pools and how do you use them to manage resource contention?
Pools limit the number of concurrent task instances that can run across all DAGs. For example, a 'database_writes' pool with 5 slots caps concurrent writes at 5 tasks to the database, preventing connection pool exhaustion. Tasks are assigned to pools in their operator definition. A strong answer mentions priority_weight for controlling which tasks get pool slots first and the default_pool that all tasks use if no pool is specified.
How do you test Airflow DAGs before deploying them to production?
Three levels of testing. Unit tests: import the DAG file and verify it parses without errors, check task count, dependencies, and default_args. Integration tests: run individual tasks using airflow tasks test with a specific execution date and verify output. End-to-end tests: trigger the full DAG in a staging environment. A strong answer mentions using DAG.test() (Airflow 2.5+), CI/CD integration to catch import errors, and the importance of testing with representative data rather than empty tables.
Explain dynamic task generation in Airflow. What are the tradeoffs?
Dynamic tasks are generated at DAG parse time based on external data (a config file, database query, or API response). For example, generating one task per table in a list. Tradeoffs: the DAG structure is fixed at parse time, so adding a table requires the scheduler to re-parse. If the external source is slow or unavailable, the DAG fails to parse. Dynamic task mapping (Airflow 2.3+) is better because it defers expansion to runtime, allowing the task count to change per DagRun. A strong answer distinguishes parse-time dynamic DAGs from runtime-mapped tasks.
Worked Example: TaskFlow API DAG
from airflow.decorators import dag, task
from datetime import datetime
@dag(
schedule="@daily",
start_date=datetime(2026, 1, 1),
catchup=False,
default_args={"retries": 2, "retry_delay": 300},
)
def daily_sales_pipeline():
@task()
def extract() -> dict:
"""Pull today's sales from the API."""
# Returns small metadata, NOT the full dataset
return {"s3_path": "s3://bucket/raw/sales/2026-01-15.parquet",
"row_count": 42_000}
@task()
def transform(extract_result: dict) -> dict:
"""Clean and deduplicate. Write to staging path."""
path = extract_result["s3_path"]
# ... transformation logic ...
return {"s3_path": "s3://bucket/staging/sales/2026-01-15.parquet",
"row_count": 41_800}
@task()
def load(transform_result: dict) -> None:
"""MERGE into the target table. Idempotent."""
path = transform_result["s3_path"]
# MERGE ON sale_id guarantees re-runs do not create duplicates
# Dependencies inferred from function calls
raw = extract()
staged = transform(raw)
load(staged)
daily_sales_pipeline()Notice that XCom passes only metadata (S3 paths and row counts), not the data itself. Datasets live in object storage. Passing DataFrames through XCom is a common anti-pattern that breaks down at scale.
Common Mistakes in Airflow Interviews
- Putting database calls or API requests in the top-level scope of a DAG file, causing the scheduler to slow down on every parse cycle
- Using XCom to pass large datasets (DataFrames, file contents) instead of writing to object storage and passing the path
- Not setting retries and retry_delay on tasks, causing transient failures to require manual intervention
- Confusing execution_date with the current timestamp, leading to off-by-one-day data processing errors
- Running all tasks in the default pool without limits, overwhelming databases or APIs with concurrent connections
- Ignoring the difference between clearing a failed task (re-runs it) and marking it success (skips it), leading to data gaps
Airflow Interview Questions FAQ
Is Airflow the only orchestrator I should know for interviews?+
Do I need to know Airflow 2.x features specifically?+
Should I deploy Airflow locally to practice?+
How important is Airflow for companies using managed orchestration services?+
What is workflow orchestration and how does Airflow implement it?+
What is an Airflow DAG?+
What are the most common Airflow operators interviewers ask about?+
How do you define task dependencies in Airflow?+
What is the Airflow scheduler and how does it work?+
What is the difference between Airflow and Airflow ETL?+
Practice Airflow Interview Questions
Build the orchestration knowledge interviewers test for. Write DAGs, debug scheduling issues, and learn the architecture.