Data Engineering Interview Prep
Airflow is the orchestrator that ties data pipelines together. If you are interviewing for a data engineering role, you will face questions about scheduler internals, DAG design patterns, executor tradeoffs, and what happens when tasks fail in production.
Covers Airflow 2.x features including TaskFlow API, dynamic task mapping, Datasets, and the KubernetesExecutor.
Airflow questions reveal whether you have actually operated pipelines in production. Anyone can write a simple DAG. The interview tests whether you understand what breaks, how to recover, and how to design DAGs that are reliable by default.
Junior candidates should explain the architecture (scheduler, webserver, executor, metastore), write a basic DAG with dependencies, and understand what execution_date means.
Mid-level candidates need to discuss executor tradeoffs, implement idempotent tasks, handle cross-DAG dependencies, and explain backfill strategies.
Senior candidates must design DAG architectures for complex workflows, explain dynamic task generation versus mapped tasks, manage resource contention with pools, and describe CI/CD strategies for deploying DAGs safely.
The scheduler parses DAG files, creates DagRun and TaskInstance records, and sends tasks to the executor. The webserver serves the UI and reads from the metastore (Postgres or MySQL). The executor runs tasks: LocalExecutor uses multiprocessing, CeleryExecutor uses distributed workers, KubernetesExecutor spins up a pod per task. Interviewers expect you to name all four components and explain what fails when each one goes down.
A DAG is a Python file that defines tasks and their dependencies. Top-level code in a DAG file runs every time the scheduler parses it (every few seconds), so it must be fast and side-effect-free. Interviewers test whether you know to avoid database calls, API requests, or heavy imports at the module level.
Operators define what a task does: PythonOperator runs a callable, BashOperator runs a shell command, SqlOperator runs a query. Sensors wait for an external condition (file exists, partition ready, API response). Interviewers care whether you use the right operator for the job and whether you understand reschedule mode vs poke mode for sensors.
XComs (cross-communications) let tasks pass small amounts of data to downstream tasks via the metastore. They are not designed for large data. Interviewers ask about XCom limitations: size limits (varies by backend), serialization overhead, and the anti-pattern of passing DataFrames through XCom instead of writing to object storage.
The default trigger rule is all_success: a task runs only when all upstream tasks succeed. Alternatives include all_failed, one_success, one_failed, none_failed, none_skipped, and always. Interviewers test whether you can design error-handling DAGs using trigger rules, such as running a cleanup task regardless of upstream success or failure.
TaskFlow (introduced in Airflow 2.0) uses the @task decorator to define tasks as Python functions. It automatically handles XCom serialization and dependency inference. Interviewers want to see you use TaskFlow for new DAGs and understand how it maps to the traditional operator model under the hood.
Explain the Airflow scheduler parsing loop. Why is it important that DAG files execute quickly?
The scheduler continuously scans the DAGs folder and imports each Python file to discover DAGs. This happens every scheduler_heartbeat_sec (default 5 seconds). If a DAG file takes 10 seconds to import (because it makes API calls or reads from a database at the top level), the scheduler falls behind. Slow parsing creates a backlog: DAGs are not scheduled on time, tasks pile up, and the entire system degrades. A strong answer mentions min_file_process_interval and dagbag_import_timeout as tuning controls.
Compare CeleryExecutor, KubernetesExecutor, and LocalExecutor. When would you choose each?
LocalExecutor runs tasks as subprocesses on the scheduler machine. Good for small deployments. CeleryExecutor distributes tasks to a pool of persistent workers via a message broker (Redis or RabbitMQ). Good for stable workloads with predictable resource needs. KubernetesExecutor creates a new pod for each task, providing perfect isolation and dynamic scaling. Good for heterogeneous workloads where tasks have different resource requirements. A strong answer mentions CeleryKubernetesExecutor as a hybrid and discusses the cold-start latency tradeoff of KubernetesExecutor.
A DAG runs daily but yesterday's run failed. Today's run is queued. What happens and how do you fix it?
By default, Airflow respects depends_on_past=False, so today's run will execute regardless. If depends_on_past=True on any task, that task waits until the same task in the previous DagRun succeeds. To fix the failed run: clear the failed task instances in the UI or CLI, which re-queues them. If the failure was transient, they will succeed on retry. A strong answer discusses wait_for_downstream, catchup behavior, and the difference between clearing a task (re-runs it) and marking it as success (skips it).
How do you handle dependencies between DAGs in Airflow?
Use TriggerDagRunOperator to start another DAG from a task. Use ExternalTaskSensor to wait for a task in another DAG to complete. Use Datasets (Airflow 2.4+) for event-driven triggering: a producer DAG marks a dataset as updated, and consumer DAGs with that dataset in their schedule automatically trigger. A strong answer notes that ExternalTaskSensor requires matching execution dates and discusses the coupling risk of cross-DAG dependencies.
What is the difference between execution_date and the actual time a DAG runs? Why does this confuse people?
execution_date (now called logical_date) marks the start of the data interval, not when the DAG runs. A daily DAG with execution_date 2026-01-15 runs at the end of that interval: 2026-01-16T00:00. This means the DAG for 'today' actually runs 'tomorrow.' Interviewers ask this because it is the most common source of confusion in Airflow. A strong answer explains data_interval_start, data_interval_end, and how to use them in templates for idempotent queries.
How do you implement idempotent tasks in Airflow? Why is idempotency important?
Idempotent tasks produce the same result regardless of how many times they run. This is critical because Airflow retries failed tasks and operators can re-run tasks manually. Implementation: use MERGE/upsert instead of INSERT, write output to a date-partitioned path and overwrite the partition, or use DELETE + INSERT within a transaction. A strong answer gives a concrete example: a task that writes to S3 at s3://bucket/output/dt=2026-01-15/ and overwrites on re-run, versus one that appends to a single file and creates duplicates.
You need to backfill a DAG for the past 90 days. Describe your approach and potential issues.
Use the CLI: airflow dags backfill with start and end dates. Set max_active_runs and concurrency to limit parallel runs and avoid overwhelming downstream systems. Potential issues: resource contention (90 runs competing for workers), external API rate limits, database locks on upsert targets, and XCom storage growth. A strong answer mentions using pools to limit concurrency for specific resource-intensive tasks and testing with a small date range first.
What are Airflow pools and how do you use them to manage resource contention?
Pools limit the number of concurrent task instances that can run across all DAGs. For example, a 'database_writes' pool with 5 slots caps concurrent writes at 5 tasks to the database, preventing connection pool exhaustion. Tasks are assigned to pools in their operator definition. A strong answer mentions priority_weight for controlling which tasks get pool slots first and the default_pool that all tasks use if no pool is specified.
How do you test Airflow DAGs before deploying them to production?
Three levels of testing. Unit tests: import the DAG file and verify it parses without errors, check task count, dependencies, and default_args. Integration tests: run individual tasks using airflow tasks test with a specific execution date and verify output. End-to-end tests: trigger the full DAG in a staging environment. A strong answer mentions using DAG.test() (Airflow 2.5+), CI/CD integration to catch import errors, and the importance of testing with representative data rather than empty tables.
Explain dynamic task generation in Airflow. What are the tradeoffs?
Dynamic tasks are generated at DAG parse time based on external data (a config file, database query, or API response). For example, generating one task per table in a list. Tradeoffs: the DAG structure is fixed at parse time, so adding a table requires the scheduler to re-parse. If the external source is slow or unavailable, the DAG fails to parse. Dynamic task mapping (Airflow 2.3+) is better because it defers expansion to runtime, allowing the task count to change per DagRun. A strong answer distinguishes parse-time dynamic DAGs from runtime-mapped tasks.
This shows the modern Airflow 2.x pattern for writing DAGs. Interviewers expect you to use TaskFlow for new code.
from airflow.decorators import dag, task
from datetime import datetime
@dag(
schedule="@daily",
start_date=datetime(2026, 1, 1),
catchup=False,
default_args={"retries": 2, "retry_delay": 300},
)
def daily_sales_pipeline():
@task()
def extract() -> dict:
"""Pull today's sales from the API."""
# Returns small metadata, NOT the full dataset
return {"s3_path": "s3://bucket/raw/sales/2026-01-15.parquet",
"row_count": 42_000}
@task()
def transform(extract_result: dict) -> dict:
"""Clean and deduplicate. Write to staging path."""
path = extract_result["s3_path"]
# ... transformation logic ...
return {"s3_path": "s3://bucket/staging/sales/2026-01-15.parquet",
"row_count": 41_800}
@task()
def load(transform_result: dict) -> None:
"""MERGE into the target table. Idempotent."""
path = transform_result["s3_path"]
# MERGE ON sale_id guarantees re-runs do not create duplicates
# Dependencies inferred from function calls
raw = extract()
staged = transform(raw)
load(staged)
daily_sales_pipeline()Notice that XCom passes only metadata (S3 paths and row counts), not actual data. The actual datasets live in object storage. This is the pattern interviewers want to see. Passing DataFrames through XCom is a common anti-pattern that fails at scale.
Putting database calls or API requests in the top-level scope of a DAG file, causing the scheduler to slow down on every parse cycle
Using XCom to pass large datasets (DataFrames, file contents) instead of writing to object storage and passing the path
Not setting retries and retry_delay on tasks, causing transient failures to require manual intervention
Confusing execution_date with the current timestamp, leading to off-by-one-day data processing errors
Running all tasks in the default pool without limits, overwhelming databases or APIs with concurrent connections
Ignoring the difference between clearing a failed task (re-runs it) and marking it success (skips it), leading to data gaps
Build the orchestration knowledge that interviewers test for. Write DAGs, debug scheduling issues, and understand the architecture deeply.