Scheduler, Webserver, Executor, Metastore

Airflow is the orchestrator that ties data pipelines together. If you are interviewing for a data engineering role, you will face questions about scheduler internals, DAG design patterns, executor tradeoffs, and what happens when tasks fail in production.

TL;DR: Airflow Interview Questions

Apache Airflow is the dominant Python-based workflow orchestrator for data engineering in 2026. Interviews focus on six areas: scheduler internals and DAG parsing, executor types (Local, Celery, Kubernetes), operators and task dependencies, idempotency and retries, XComs and TaskFlow API, and backfills, pools, and trigger rules.

For most roles you need to explain how the scheduler parses DAGs, when to choose CeleryExecutor vs KubernetesExecutor, how to make tasks idempotent, and how to handle dependencies between DAGs. Senior roles add capacity planning, dynamic task mapping (Airflow 2.3+), Datasets-based event-driven scheduling, and integration with Kubernetes pods for isolation.

Airflow Executors: Decision Matrix

One of the most-asked Airflow architecture questions. Know when to pick each executor and why.

Executor	How tasks run	Best for	Cold start
Local	Subprocesses on the scheduler machine	Small deployments, dev environments	Instant
Celery	Distributed across persistent workers via Redis/Rabbit	Stable workloads with predictable resource needs	Fast (workers are warm)
Kubernetes	A new pod per task with dynamic resource allocation	Heterogeneous workloads, isolation, autoscaling	Slow (10-60s pod startup)
CeleryKubernetes	Hybrid: Celery for default queue, K8s for heavy tasks	Mixed workloads where most are light, some need isolation	Mixed

What Interviewers Expect

Airflow questions reveal whether you have operated pipelines in production. The interview tests whether you understand what breaks, how to recover, and how to design DAGs that hold up under load.

Junior candidates should explain the architecture (scheduler, webserver, executor, metastore), write a basic DAG with dependencies, and understand what execution_date means.

Mid-level candidates need to discuss executor tradeoffs, implement idempotent tasks, handle cross-DAG dependencies, and explain backfill strategies.

Senior candidates must design DAG architectures for complex workflows, explain dynamic task generation versus mapped tasks, manage resource contention with pools, and describe CI/CD strategies for deploying DAGs safely.

Core Concepts Interviewers Test

Scheduler, Webserver, Executor, Metastore

The scheduler parses DAG files, creates DagRun and TaskInstance records, and sends tasks to the executor. The webserver serves the UI and reads from the metastore (Postgres or MySQL). The executor runs tasks: LocalExecutor uses multiprocessing, CeleryExecutor uses distributed workers, KubernetesExecutor spins up a pod per task. Interviewers expect you to name all four components and explain what fails when each one goes down.

DAG Authoring and Best Practices

A DAG is a Python file that defines tasks and their dependencies. Top-level code in a DAG file runs every time the scheduler parses it (every few seconds), so it must be fast and side-effect-free. Interviewers test whether you know to avoid database calls, API requests, or heavy imports at the module level.

Operators and Sensors

Operators define what a task does: PythonOperator runs a callable, BashOperator runs a shell command, SqlOperator runs a query. Sensors wait for an external condition (file exists, partition ready, API response). Interviewers care whether you use the right operator for the job and whether you understand reschedule mode vs poke mode for sensors.

XComs

XComs (cross-communications) let tasks pass small amounts of data to downstream tasks via the metastore. They are not designed for large data. Interviewers ask about XCom limitations: size limits (varies by backend), serialization overhead, and the anti-pattern of passing DataFrames through XCom instead of writing to object storage.

Trigger Rules

The default trigger rule is all_success: a task runs only when all upstream tasks succeed. Alternatives include all_failed, one_success, one_failed, none_failed, none_skipped, and always. Interviewers test whether you can design error-handling DAGs using trigger rules, such as running a cleanup task regardless of upstream success or failure.

TaskFlow API

TaskFlow (introduced in Airflow 2.0) uses the @task decorator to define tasks as Python functions. It automatically handles XCom serialization and dependency inference. Interviewers want to see you use TaskFlow for new DAGs and understand how it maps to the traditional operator model under the hood.

Airflow Interview Questions with Guidance

Explain the Airflow scheduler parsing loop. Why is it important that DAG files execute quickly?

The scheduler continuously scans the DAGs folder and imports each Python file to discover DAGs. This happens every scheduler_heartbeat_sec (default 5 seconds). If a DAG file takes 10 seconds to import (because it makes API calls or reads from a database at the top level), the scheduler falls behind. Slow parsing creates a backlog: DAGs are not scheduled on time, tasks pile up, and the entire system degrades. A strong answer mentions min_file_process_interval and dagbag_import_timeout as tuning controls.

Compare CeleryExecutor, KubernetesExecutor, and LocalExecutor. When would you choose each?

LocalExecutor runs tasks as subprocesses on the scheduler machine. Good for small deployments. CeleryExecutor distributes tasks to a pool of persistent workers via a message broker (Redis or RabbitMQ). Good for stable workloads with predictable resource needs. KubernetesExecutor creates a new pod for each task, providing perfect isolation and dynamic scaling. Good for heterogeneous workloads where tasks have different resource requirements. A strong answer mentions CeleryKubernetesExecutor as a hybrid and discusses the cold-start latency tradeoff of KubernetesExecutor.

A DAG runs daily but yesterday's run failed. Today's run is queued. What happens and how do you fix it?

By default, Airflow respects depends_on_past=False, so today's run will execute regardless. If depends_on_past=True on any task, that task waits until the same task in the previous DagRun succeeds. To fix the failed run: clear the failed task instances in the UI or CLI, which re-queues them. If the failure was transient, they will succeed on retry. A strong answer discusses wait_for_downstream, catchup behavior, and the difference between clearing a task (re-runs it) and marking it as success (skips it).

How do you handle dependencies between DAGs in Airflow?

Use TriggerDagRunOperator to start another DAG from a task. Use ExternalTaskSensor to wait for a task in another DAG to complete. Use Datasets (Airflow 2.4+) for event-driven triggering: a producer DAG marks a dataset as updated, and consumer DAGs with that dataset in their schedule automatically trigger. A strong answer notes that ExternalTaskSensor requires matching execution dates and discusses the coupling risk of cross-DAG dependencies.

What is the difference between execution_date and the actual time a DAG runs? Why does this confuse people?

execution_date (now called logical_date) marks the start of the data interval, not when the DAG runs. A daily DAG with execution_date 2026-01-15 runs at the end of that interval: 2026-01-16T00:00. So the DAG for 'today' runs 'tomorrow'. Interviewers ask this because it is the most common source of confusion in Airflow. A strong answer explains data_interval_start, data_interval_end, and how to use them in templates for idempotent queries.

How do you implement idempotent tasks in Airflow? Why is idempotency important?

Idempotent tasks produce the same result regardless of how many times they run. This is critical because Airflow retries failed tasks and operators can re-run tasks manually. Implementation: use MERGE/upsert instead of INSERT, write output to a date-partitioned path and overwrite the partition, or use DELETE + INSERT within a transaction. A strong answer gives a concrete example: a task that writes to S3 at s3://bucket/output/dt=2026-01-15/ and overwrites on re-run, versus one that appends to a single file and creates duplicates.

You need to backfill a DAG for the past 90 days. Describe your approach and potential issues.

Use the CLI: airflow dags backfill with start and end dates. Set max_active_runs and concurrency to limit parallel runs and avoid overwhelming downstream systems. Potential issues: resource contention (90 runs competing for workers), external API rate limits, database locks on upsert targets, and XCom storage growth. A strong answer mentions using pools to limit concurrency for specific resource-intensive tasks and testing with a small date range first.

What are Airflow pools and how do you use them to manage resource contention?

Pools limit the number of concurrent task instances that can run across all DAGs. For example, a 'database_writes' pool with 5 slots caps concurrent writes at 5 tasks to the database, preventing connection pool exhaustion. Tasks are assigned to pools in their operator definition. A strong answer mentions priority_weight for controlling which tasks get pool slots first and the default_pool that all tasks use if no pool is specified.

How do you test Airflow DAGs before deploying them to production?

Three levels of testing. Unit tests: import the DAG file and verify it parses without errors, check task count, dependencies, and default_args. Integration tests: run individual tasks using airflow tasks test with a specific execution date and verify output. End-to-end tests: trigger the full DAG in a staging environment. A strong answer mentions using DAG.test() (Airflow 2.5+), CI/CD integration to catch import errors, and the importance of testing with representative data rather than empty tables.

Explain dynamic task generation in Airflow. What are the tradeoffs?

Dynamic tasks are generated at DAG parse time based on external data (a config file, database query, or API response). For example, generating one task per table in a list. Tradeoffs: the DAG structure is fixed at parse time, so adding a table requires the scheduler to re-parse. If the external source is slow or unavailable, the DAG fails to parse. Dynamic task mapping (Airflow 2.3+) is better because it defers expansion to runtime, allowing the task count to change per DagRun. A strong answer distinguishes parse-time dynamic DAGs from runtime-mapped tasks.

Worked Example: TaskFlow API DAG

from airflow.decorators import dag, task
from datetime import datetime

@dag(
    schedule="@daily",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    default_args={"retries": 2, "retry_delay": 300},
)
def daily_sales_pipeline():

    @task()
    def extract() -> dict:
        """Pull today's sales from the API."""
        # Returns small metadata, NOT the full dataset
        return {"s3_path": "s3://bucket/raw/sales/2026-01-15.parquet",
                "row_count": 42_000}

    @task()
    def transform(extract_result: dict) -> dict:
        """Clean and deduplicate. Write to staging path."""
        path = extract_result["s3_path"]
        # ... transformation logic ...
        return {"s3_path": "s3://bucket/staging/sales/2026-01-15.parquet",
                "row_count": 41_800}

    @task()
    def load(transform_result: dict) -> None:
        """MERGE into the target table. Idempotent."""
        path = transform_result["s3_path"]
        # MERGE ON sale_id guarantees re-runs do not create duplicates

    # Dependencies inferred from function calls
    raw = extract()
    staged = transform(raw)
    load(staged)

daily_sales_pipeline()

Notice that XCom passes only metadata (S3 paths and row counts), not the data itself. Datasets live in object storage. Passing DataFrames through XCom is a common anti-pattern that breaks down at scale.

Common Mistakes in Airflow Interviews

Putting database calls or API requests in the top-level scope of a DAG file, causing the scheduler to slow down on every parse cycle
Using XCom to pass large datasets (DataFrames, file contents) instead of writing to object storage and passing the path
Not setting retries and retry_delay on tasks, causing transient failures to require manual intervention
Confusing execution_date with the current timestamp, leading to off-by-one-day data processing errors
Running all tasks in the default pool without limits, overwhelming databases or APIs with concurrent connections
Ignoring the difference between clearing a failed task (re-runs it) and marking it success (skips it), leading to data gaps

Airflow Interview Questions FAQ

Is Airflow the only orchestrator I should know for interviews?+

Airflow is the most commonly tested orchestrator, but some companies use Dagster, Prefect, or Mage. If the job description mentions a specific tool, study it. Otherwise, Airflow knowledge transfers well: the concepts of DAGs, task dependencies, retries, and idempotency are universal across orchestrators.

Do I need to know Airflow 2.x features specifically?+

Yes. TaskFlow API, dynamic task mapping, Datasets, and the KubernetesExecutor are Airflow 2.x features that interviewers expect you to know. If you only know Airflow 1.x patterns (SubDagOperator, PythonOperator with provide_context), update your knowledge.

Should I deploy Airflow locally to practice?+

Yes. Use the official Docker Compose setup (docker-compose.yaml from the Airflow docs) or Astro CLI from Astronomer. Write at least 5 DAGs that cover sensors, branching, dynamic tasks, and cross-DAG dependencies. Interview confidence comes from having debugged real scheduling issues.

How important is Airflow for companies using managed orchestration services?+

Many companies use managed Airflow: AWS MWAA, Google Cloud Composer, or Astronomer. The core Airflow concepts are identical. Managed services handle infrastructure (scheduler scaling, metadata DB maintenance) but you still write DAGs. Interviewers may ask what you gain from managed services and what limitations they introduce (version pinning, custom operator restrictions).

What is workflow orchestration and how does Airflow implement it?+

Workflow orchestration is the practice of authoring, scheduling, and monitoring multi-step data pipelines as code. Airflow implements it with DAGs (directed acyclic graphs) of tasks, a scheduler that triggers tasks based on time or events, executors that run tasks (Local, Celery, Kubernetes), and a metastore that tracks state. Airflow is the dominant open-source orchestrator in 2026.

What is an Airflow DAG?+

A DAG (directed acyclic graph) is a collection of tasks with their dependencies declared in a Python file. Tasks run when their upstream dependencies succeed (or fail, depending on trigger rules). The 'acyclic' constraint means no task can depend on itself directly or transitively. Airflow parses DAG files continuously and creates DagRun records on the schedule.

What are the most common Airflow operators interviewers ask about?+

PythonOperator (run a callable), BashOperator (run a shell command), SQLExecuteQueryOperator (run a SQL query), ExternalTaskSensor (wait for a task in another DAG), S3KeySensor (wait for a file), TriggerDagRunOperator (start another DAG), and BranchPythonOperator (conditionally route). Plus the @task decorator from TaskFlow for Pythonic task definition.

How do you define task dependencies in Airflow?+

Use the bitshift operator: task_a >> task_b means task_b runs after task_a. You can chain: task_a >> task_b >> task_c. Use lists for fan-out/fan-in: task_a >> [task_b, task_c] >> task_d. With TaskFlow you can use direct function calls and Airflow infers dependencies from XCom usage. Avoid set_downstream/set_upstream methods in modern code.

What is the Airflow scheduler and how does it work?+

The scheduler is a long-running process that parses DAG files, evaluates which DagRuns and TaskInstances should be created based on schedule and dependencies, and queues tasks for the executor. It runs a parsing loop every scheduler_heartbeat_sec (default 5s) and a separate file processor pool that imports each DAG file periodically (controlled by min_file_process_interval).

What is the difference between Airflow and Airflow ETL?+

Airflow itself is a workflow orchestrator: it does not move data. Airflow ETL refers to the practice of using Airflow to schedule and coordinate ETL/ELT jobs. The actual data movement happens in operators (SQL queries, Python scripts, Spark jobs) that Airflow triggers. Many data engineers use Airflow + dbt for the transform layer, with Airflow handling orchestration and dbt handling SQL transformations.

02 / Why practice

Practice Airflow Interview Questions

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Open the problems

Related Guides

dbt Interview Questions→

The transformation layer that Airflow orchestrates in modern data stacks

Data Pipeline Questions→

End-to-end pipeline design questions that complement Airflow orchestration knowledge

DE Interview Prep→

Complete guide to all five data engineering interview rounds