Airflow DAG: Complete Reference for Data Engineers (2026)

A DAG defines the topology of a pipeline and hands it to the scheduler. It sits at the orchestration layer, above the task workers and below the metadata database. Whatever you build in Airflow, from a SparkSubmitOperator to a sensor waiting on S3, lives inside a DAG and inherits its contract for idempotency, ordering, and retries. This reference treats the DAG as a system-design artifact first...

Core DAG Concepts

The interfaces between your DAG and the rest of the orchestration layer. When these are off, the pipeline ends up fighting the scheduler.

What Is a DAG?

DAG stands for Directed Acyclic Graph. Directed means each edge has a direction (task A runs before task B). Acyclic means there are no cycles (you cannot create a loop where task A depends on task B which depends on task A). Graph means a collection of nodes (tasks) connected by edges (dependencies). In Airflow, a DAG is a Python file that defines a set of tasks and the order they should run. The DAG itself does not do any work. It describes what work should happen and in what order. Think of it as a blueprint, not a factory.

How Airflow Uses DAGs

Airflow's scheduler continuously scans a folder (the dags_folder) and imports every Python file it finds. Each file can define one or more DAGs. When the scheduler imports a DAG file, it builds an internal representation of the tasks and dependencies. This happens every few seconds (controlled by min_file_process_interval). The scheduler then creates DagRuns based on the DAG's schedule (daily, hourly, cron expression, or dataset-triggered). Each DagRun creates TaskInstances for every task in the DAG. The scheduler puts ready TaskInstances into a queue, and the executor runs them.

DAG Runs and Task Instances

A DagRun is a single execution of a DAG. If your DAG runs daily, you get one DagRun per day. Each DagRun has a logical_date (formerly execution_date) that represents the start of the data interval it covers. A daily DAG with logical_date 2026-01-15 runs at the end of that interval (2026-01-16T00:00) and processes data from 2026-01-15. Each DagRun creates TaskInstances for every task in the DAG. A TaskInstance represents one specific task running in one specific DagRun. TaskInstances have states: queued, running, success, failed, skipped, up_for_retry.

The Parse-Time Trap

Code at the top level of a DAG file runs every time the scheduler parses the file, which happens every few seconds. If your top-level code makes a database call, an API request, or a slow import, the scheduler slows down for every DAG it manages. The rule: the top level of a DAG file should only define the DAG object and its tasks. All work happens inside task callables, which only run when the task executes, not when the file is parsed.

DAG Structure and Components

The building blocks of an Airflow DAG.

Operators

Operators define what a task does. PythonOperator runs a Python callable, BashOperator runs a shell command, SqlOperator runs a SQL query. There are also transfer operators (move data between systems), sensor operators (wait for conditions), and provider operators (AWS, GCP, Slack). Most tasks don't need a custom operator; PythonOperator covers the majority of cases. PythonOperator(task_id='transform', python_callable=transform_data, op_kwargs={'date': '{{ ds }}'})

Tasks and Task Dependencies

A task is an instance of an operator within a DAG. Tasks are connected by dependencies that define execution order. Use >> (bitshift operator) to set dependencies: extract >> transform >> load means extract runs first, then transform, then load. You can also fan out (extract >> [transform_a, transform_b]) or fan in ([transform_a, transform_b] >> load). Dependencies only control order, not data passing. Tasks do not automatically receive output from upstream tasks. extract >> [transform_users, transform_events] >> load_warehouse

Sensors

Sensors are special operators that wait for an external condition to be true. FileSensor waits for a file to appear. ExternalTaskSensor waits for a task in another DAG to complete. HttpSensor waits for an HTTP endpoint to return a success status. Sensors have two modes: poke (checks repeatedly, holding a worker slot) and reschedule (releases the worker between checks). Use reschedule mode for long waits to avoid blocking worker slots. A sensor that pokes every 30 seconds for 6 hours ties up a worker slot the entire time. FileSensor(task_id='wait_for_file', filepath='/data/input/{{ ds }}.csv', mode='reschedule', poke_interval=300)

XComs (Cross-Communications)

XComs let tasks pass small pieces of data to downstream tasks via the Airflow metastore (database). A task pushes a value with xcom_push, and a downstream task pulls it with xcom_pull. The TaskFlow API (@task decorator) handles XCom automatically: the return value of a decorated function is pushed, and calling the function in another task pulls it. XComs are not for large data. They store values in the metastore database. Passing a DataFrame through XCom is an anti-pattern. Instead, write data to object storage (S3, GCS) and pass the file path as an XCom. ti.xcom_push(key='row_count', value=42) / ti.xcom_pull(task_ids='extract', key='row_count')

Trigger Rules

The default trigger rule is all_success: a task runs only when all upstream tasks succeed. This is correct for most cases, but sometimes you need different behavior. all_failed: run only if all upstreams failed (useful for error notification tasks). one_success: run if any upstream succeeded (useful for branching). none_failed: run if no upstream failed (allows skipped upstreams). always: run regardless of upstream state (useful for cleanup tasks). Understanding trigger rules is essential for building DAGs with error handling and branching logic. PythonOperator(task_id='cleanup', python_callable=cleanup, trigger_rule='all_done')

DAG Best Practices

Rules for DAGs that need to run in production.

Idempotency

Every task should produce the same result regardless of how many times it runs. This is critical because Airflow retries failed tasks and operators can manually re-run tasks. If your task inserts rows into a database, a retry creates duplicates. Instead, use MERGE/upsert, or DELETE + INSERT within a transaction, or write to a date-partitioned path and overwrite the partition. Idempotent tasks are safe to retry, safe to backfill, and safe to re-run when debugging.

Retries and Retry Delays

Every production task should have retries configured. Transient failures (network timeouts, API rate limits, temporary database locks) are normal. A task with retries=3 and retry_delay=timedelta(minutes=5) handles most transient failures automatically without human intervention. For tasks that call external APIs, use exponential backoff: retry_delay doubles with each attempt. Set retry_exponential_backoff=True and max_retry_delay to cap it.

Keep DAG Files Fast and Side-Effect-Free

The scheduler parses DAG files every few seconds. If your DAG file makes API calls, reads from a database, or imports heavy libraries at the top level, you slow down the scheduler for every DAG. Move all work into task callables. Use lazy imports (import inside the function, not at the top of the file) for heavy libraries like pandas or boto3. A DAG file should define tasks and dependencies and nothing else.

Use TaskFlow API for New DAGs

The TaskFlow API (@task decorator, introduced in Airflow 2.0) simplifies DAG authoring. Instead of instantiating PythonOperator and passing callables, decorate Python functions with @task. Return values become XComs and dependencies get inferred from function calls. The code reads more naturally. Use TaskFlow for new DAGs unless there's a specific reason to use the classic operator style.

Pools for Resource Management

Pools limit concurrent task instances across all DAGs. If your data warehouse allows 10 concurrent connections, create a pool with 10 slots and assign all warehouse-writing tasks to that pool. This prevents 50 tasks from trying to write simultaneously and crashing the database. Pools are global: they apply across all DAGs and all DagRuns. Use priority_weight to control which tasks get pool slots first.

Avoid Overly Complex DAGs

A DAG with 200 tasks and complex branching logic is hard to debug, slow to parse, and difficult for anyone else on the team to understand. If your DAG is growing beyond 30 to 40 tasks, consider breaking it into multiple DAGs connected by Datasets or TriggerDagRunOperator. Smaller, focused DAGs are easier to test, faster to troubleshoot, and more maintainable over time.

4 Airflow DAG Interview Questions

Questions that test your understanding of DAGs beyond the basics.

Airflow

Explain the difference between logical_date (execution_date) and the actual time a DAG runs. Why does this distinction matter?

The logical_date marks the start of the data interval, not when the DAG runs. A daily DAG scheduled for 2026-01-15 runs on 2026-01-16 at midnight, at the end of the interval. This matters because SQL queries parameterized with the logical_date process the correct day's data. Misreading it is a common source of off-by-one-day bugs. Use data_interval_start and data_interval_end in templates for clarity. The interviewer is looking for someone who has hit this in production.

Airflow

Your DAG runs daily, but yesterday's run failed and today's run is queued. Walk through what happens and how you fix it.

By default (depends_on_past=False), today's run executes regardless of yesterday's failure. If depends_on_past=True on any task, that task waits for its previous DagRun to succeed. To fix the failure, check the logs for the failed task, identify the root cause, fix the issue, and clear the failed task instances (which re-queues them). The interviewer is checking whether you know the difference between clearing a task (re-runs it) and marking it as success (skips it), and whether you would clear just the failed task or the whole DagRun.

Airflow

How would you design a DAG that handles failures gracefully, including sending an alert and running a cleanup task?

Set retries and retry_delay on each task for transient failures. Use on_failure_callback at the DAG level to send alerts (Slack, PagerDuty, email) when any task fails after exhausting retries. Add a cleanup task with trigger_rule='all_done' that runs regardless of upstream state. For critical tasks, add a downstream sensor that verifies the expected output exists. The interviewer is looking for layered error handling: retries for transient issues, alerts for persistent failures, cleanup for any outcome.

Airflow

You need to process data from 3 different sources, each arriving at different times, and then join them in a final transformation. How do you design this DAG?

Three extraction tasks, one per source, running in parallel. Each writes its output to a staging location (S3 path or staging table). A sensor or trigger rule ensures the join task only runs when all three extractions succeed. The join task reads from the staging locations and produces the final output. Use the all_success trigger rule on the join task. Talk through what happens when one source is delayed: a sensor with a timeout and a callback that alerts the team. The interviewer is checking whether you account for the messy cases: sources that arrive late, sources that arrive with different schemas, sources with overlapping data.

Airflow DAG FAQ

What is the difference between a DAG and a pipeline?+

A DAG is a specific data structure: a directed acyclic graph that defines tasks and their dependencies. A pipeline is a broader concept: a series of steps that move data from source to destination. In Airflow, a DAG is how you define a pipeline. But pipelines can exist without Airflow: a cron job that runs a shell script is a pipeline without a DAG. When someone says 'data pipeline,' they usually mean the full system. When they say 'DAG,' they usually mean the Airflow definition.

How many tasks should a single DAG have?+

There is no hard limit, but practical guidelines help. A DAG with 5 to 30 tasks is typical and manageable. Beyond 40 to 50 tasks, parsing time increases, the Airflow UI becomes cluttered, and debugging failures gets harder. If your DAG is growing large, consider splitting it into multiple DAGs connected by Datasets or TriggerDagRunOperator. Each DAG should represent a coherent unit of work with a clear purpose.

Should I use the TaskFlow API or the classic operator style?+

Use TaskFlow for new DAGs. It produces cleaner code, handles XCom automatically, and infers dependencies from function calls. Use the classic operator style when you need operators that TaskFlow does not wrap well (some provider operators, sensors with specific configurations) or when you are maintaining existing DAGs that use the classic style. Mixing both in the same DAG works fine.

How do I test Airflow DAGs?+

Three levels. First: import the DAG file in a test and verify it parses without errors, has the expected task count, and has correct dependencies. Second: run individual tasks with airflow tasks test to verify they produce expected output. Third: trigger the full DAG in a staging environment with test data. For CI/CD, the first level catches most problems (import errors, missing variables, dependency cycles). DAG.test() in Airflow 2.5+ simplifies end-to-end testing without needing a running scheduler.

02 / Why practice

Practice Pipeline Architecture Questions

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Open the problems

Related Guides

Airflow Interview Questions→

Scheduler, executors, XComs, and production reliability questions

Pipeline Architecture→

End-to-end pipeline design patterns for data engineering

System Design for DE→

Pipeline architecture, batch vs streaming, and scale reasoning