Data Engineering Interview Prep

Apache Airflow Interview Questions for Data Engineers (2026)

Airflow is the orchestrator that ties data pipelines together. If you are interviewing for a data engineering role, you will face questions about scheduler internals, DAG design patterns, executor tradeoffs, and what happens when tasks fail in production.

Covers Airflow 2.x features including TaskFlow API, dynamic task mapping, Datasets, and the KubernetesExecutor.

What Interviewers Expect

Airflow questions reveal whether you have actually operated pipelines in production. Anyone can write a simple DAG. The interview tests whether you understand what breaks, how to recover, and how to design DAGs that are reliable by default.

Junior candidates should explain the architecture (scheduler, webserver, executor, metastore), write a basic DAG with dependencies, and understand what execution_date means.

Mid-level candidates need to discuss executor tradeoffs, implement idempotent tasks, handle cross-DAG dependencies, and explain backfill strategies.

Senior candidates must design DAG architectures for complex workflows, explain dynamic task generation versus mapped tasks, manage resource contention with pools, and describe CI/CD strategies for deploying DAGs safely.

Core Concepts Interviewers Test

Scheduler, Webserver, Executor, Metastore

The scheduler parses DAG files, creates DagRun and TaskInstance records, and sends tasks to the executor. The webserver serves the UI and reads from the metastore (Postgres or MySQL). The executor runs tasks: LocalExecutor uses multiprocessing, CeleryExecutor uses distributed workers, KubernetesExecutor spins up a pod per task. Interviewers expect you to name all four components and explain what fails when each one goes down.

DAG Authoring and Best Practices

A DAG is a Python file that defines tasks and their dependencies. Top-level code in a DAG file runs every time the scheduler parses it (every few seconds), so it must be fast and side-effect-free. Interviewers test whether you know to avoid database calls, API requests, or heavy imports at the module level.

Operators and Sensors

Operators define what a task does: PythonOperator runs a callable, BashOperator runs a shell command, SqlOperator runs a query. Sensors wait for an external condition (file exists, partition ready, API response). Interviewers care whether you use the right operator for the job and whether you understand reschedule mode vs poke mode for sensors.

XComs

XComs (cross-communications) let tasks pass small amounts of data to downstream tasks via the metastore. They are not designed for large data. Interviewers ask about XCom limitations: size limits (varies by backend), serialization overhead, and the anti-pattern of passing DataFrames through XCom instead of writing to object storage.

Trigger Rules

The default trigger rule is all_success: a task runs only when all upstream tasks succeed. Alternatives include all_failed, one_success, one_failed, none_failed, none_skipped, and always. Interviewers test whether you can design error-handling DAGs using trigger rules, such as running a cleanup task regardless of upstream success or failure.

TaskFlow API

TaskFlow (introduced in Airflow 2.0) uses the @task decorator to define tasks as Python functions. It automatically handles XCom serialization and dependency inference. Interviewers want to see you use TaskFlow for new DAGs and understand how it maps to the traditional operator model under the hood.

Airflow Interview Questions with Guidance

Q1

Explain the Airflow scheduler parsing loop. Why is it important that DAG files execute quickly?

A strong answer includes:

The scheduler continuously scans the DAGs folder and imports each Python file to discover DAGs. This happens every scheduler_heartbeat_sec (default 5 seconds). If a DAG file takes 10 seconds to import (because it makes API calls or reads from a database at the top level), the scheduler falls behind. Slow parsing creates a backlog: DAGs are not scheduled on time, tasks pile up, and the entire system degrades. A strong answer mentions min_file_process_interval and dagbag_import_timeout as tuning controls.

Q2

Compare CeleryExecutor, KubernetesExecutor, and LocalExecutor. When would you choose each?

A strong answer includes:

LocalExecutor runs tasks as subprocesses on the scheduler machine. Good for small deployments. CeleryExecutor distributes tasks to a pool of persistent workers via a message broker (Redis or RabbitMQ). Good for stable workloads with predictable resource needs. KubernetesExecutor creates a new pod for each task, providing perfect isolation and dynamic scaling. Good for heterogeneous workloads where tasks have different resource requirements. A strong answer mentions CeleryKubernetesExecutor as a hybrid and discusses the cold-start latency tradeoff of KubernetesExecutor.

Q3

A DAG runs daily but yesterday's run failed. Today's run is queued. What happens and how do you fix it?

A strong answer includes:

By default, Airflow respects depends_on_past=False, so today's run will execute regardless. If depends_on_past=True on any task, that task waits until the same task in the previous DagRun succeeds. To fix the failed run: clear the failed task instances in the UI or CLI, which re-queues them. If the failure was transient, they will succeed on retry. A strong answer discusses wait_for_downstream, catchup behavior, and the difference between clearing a task (re-runs it) and marking it as success (skips it).

Q4

How do you handle dependencies between DAGs in Airflow?

A strong answer includes:

Use TriggerDagRunOperator to start another DAG from a task. Use ExternalTaskSensor to wait for a task in another DAG to complete. Use Datasets (Airflow 2.4+) for event-driven triggering: a producer DAG marks a dataset as updated, and consumer DAGs with that dataset in their schedule automatically trigger. A strong answer notes that ExternalTaskSensor requires matching execution dates and discusses the coupling risk of cross-DAG dependencies.

Q5

What is the difference between execution_date and the actual time a DAG runs? Why does this confuse people?

A strong answer includes:

execution_date (now called logical_date) marks the start of the data interval, not when the DAG runs. A daily DAG with execution_date 2026-01-15 runs at the end of that interval: 2026-01-16T00:00. This means the DAG for 'today' actually runs 'tomorrow.' Interviewers ask this because it is the most common source of confusion in Airflow. A strong answer explains data_interval_start, data_interval_end, and how to use them in templates for idempotent queries.

Q6

How do you implement idempotent tasks in Airflow? Why is idempotency important?

A strong answer includes:

Idempotent tasks produce the same result regardless of how many times they run. This is critical because Airflow retries failed tasks and operators can re-run tasks manually. Implementation: use MERGE/upsert instead of INSERT, write output to a date-partitioned path and overwrite the partition, or use DELETE + INSERT within a transaction. A strong answer gives a concrete example: a task that writes to S3 at s3://bucket/output/dt=2026-01-15/ and overwrites on re-run, versus one that appends to a single file and creates duplicates.

Q7

You need to backfill a DAG for the past 90 days. Describe your approach and potential issues.

A strong answer includes:

Use the CLI: airflow dags backfill with start and end dates. Set max_active_runs and concurrency to limit parallel runs and avoid overwhelming downstream systems. Potential issues: resource contention (90 runs competing for workers), external API rate limits, database locks on upsert targets, and XCom storage growth. A strong answer mentions using pools to limit concurrency for specific resource-intensive tasks and testing with a small date range first.

Q8

What are Airflow pools and how do you use them to manage resource contention?

A strong answer includes:

Pools limit the number of concurrent task instances that can run across all DAGs. For example, a 'database_writes' pool with 5 slots caps concurrent writes at 5 tasks to the database, preventing connection pool exhaustion. Tasks are assigned to pools in their operator definition. A strong answer mentions priority_weight for controlling which tasks get pool slots first and the default_pool that all tasks use if no pool is specified.

Q9

How do you test Airflow DAGs before deploying them to production?

A strong answer includes:

Three levels of testing. Unit tests: import the DAG file and verify it parses without errors, check task count, dependencies, and default_args. Integration tests: run individual tasks using airflow tasks test with a specific execution date and verify output. End-to-end tests: trigger the full DAG in a staging environment. A strong answer mentions using DAG.test() (Airflow 2.5+), CI/CD integration to catch import errors, and the importance of testing with representative data rather than empty tables.

Q10

Explain dynamic task generation in Airflow. What are the tradeoffs?

A strong answer includes:

Dynamic tasks are generated at DAG parse time based on external data (a config file, database query, or API response). For example, generating one task per table in a list. Tradeoffs: the DAG structure is fixed at parse time, so adding a table requires the scheduler to re-parse. If the external source is slow or unavailable, the DAG fails to parse. Dynamic task mapping (Airflow 2.3+) is better because it defers expansion to runtime, allowing the task count to change per DagRun. A strong answer distinguishes parse-time dynamic DAGs from runtime-mapped tasks.

Worked Example: TaskFlow API DAG

This shows the modern Airflow 2.x pattern for writing DAGs. Interviewers expect you to use TaskFlow for new code.

from airflow.decorators import dag, task
from datetime import datetime

@dag(
    schedule="@daily",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    default_args={"retries": 2, "retry_delay": 300},
)
def daily_sales_pipeline():

    @task()
    def extract() -> dict:
        """Pull today's sales from the API."""
        # Returns small metadata, NOT the full dataset
        return {"s3_path": "s3://bucket/raw/sales/2026-01-15.parquet",
                "row_count": 42_000}

    @task()
    def transform(extract_result: dict) -> dict:
        """Clean and deduplicate. Write to staging path."""
        path = extract_result["s3_path"]
        # ... transformation logic ...
        return {"s3_path": "s3://bucket/staging/sales/2026-01-15.parquet",
                "row_count": 41_800}

    @task()
    def load(transform_result: dict) -> None:
        """MERGE into the target table. Idempotent."""
        path = transform_result["s3_path"]
        # MERGE ON sale_id guarantees re-runs do not create duplicates

    # Dependencies inferred from function calls
    raw = extract()
    staged = transform(raw)
    load(staged)

daily_sales_pipeline()

Notice that XCom passes only metadata (S3 paths and row counts), not actual data. The actual datasets live in object storage. This is the pattern interviewers want to see. Passing DataFrames through XCom is a common anti-pattern that fails at scale.

Common Mistakes in Airflow Interviews

Putting database calls or API requests in the top-level scope of a DAG file, causing the scheduler to slow down on every parse cycle

Using XCom to pass large datasets (DataFrames, file contents) instead of writing to object storage and passing the path

Not setting retries and retry_delay on tasks, causing transient failures to require manual intervention

Confusing execution_date with the current timestamp, leading to off-by-one-day data processing errors

Running all tasks in the default pool without limits, overwhelming databases or APIs with concurrent connections

Ignoring the difference between clearing a failed task (re-runs it) and marking it success (skips it), leading to data gaps

Airflow Interview Questions FAQ

Is Airflow the only orchestrator I should know for interviews?+
Airflow is the most commonly tested orchestrator, but some companies use Dagster, Prefect, or Mage. If the job description mentions a specific tool, study it. Otherwise, Airflow knowledge transfers well: the concepts of DAGs, task dependencies, retries, and idempotency are universal across orchestrators.
Do I need to know Airflow 2.x features specifically?+
Yes. TaskFlow API, dynamic task mapping, Datasets, and the KubernetesExecutor are Airflow 2.x features that interviewers expect you to know. If you only know Airflow 1.x patterns (SubDagOperator, PythonOperator with provide_context), update your knowledge.
Should I deploy Airflow locally to practice?+
Yes. Use the official Docker Compose setup (docker-compose.yaml from the Airflow docs) or Astro CLI from Astronomer. Write at least 5 DAGs that cover sensors, branching, dynamic tasks, and cross-DAG dependencies. Interview confidence comes from having debugged real scheduling issues.
How important is Airflow for companies using managed orchestration services?+
Many companies use managed Airflow: AWS MWAA, Google Cloud Composer, or Astronomer. The core Airflow concepts are identical. Managed services handle infrastructure (scheduler scaling, metadata DB maintenance) but you still write DAGs. Interviewers may ask what you gain from managed services and what limitations they introduce (version pinning, custom operator restrictions).

Practice Airflow Interview Questions

Build the orchestration knowledge that interviewers test for. Write DAGs, debug scheduling issues, and understand the architecture deeply.