Tool Reference

Airflow DAG: Complete Reference

A DAG is where your pipeline's topology meets the scheduler's clock. It sits at the orchestration layer: above the task workers, below the metadata database, beside the executor. Everything you build in Airflow, from a SparkSubmitOperator to a sensor waiting on S3, lives inside a DAG and inherits its contract about idempotency, ordering, and retries. This reference treats the DAG as a system-design artifact first and a Python file second.

18%

Pipeline topics in DE rounds

120

Pipeline challenges

17%

L6 staff rounds

1,042

Rounds analyzed

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Core DAG Concepts

Think of these as the interfaces between your DAG and the rest of the orchestration layer. Get them wrong and your pipeline fights the scheduler instead of using it.

What Is a DAG?

DAG stands for Directed Acyclic Graph. Directed means each edge has a direction (task A runs before task B). Acyclic means there are no cycles (you cannot create a loop where task A depends on task B which depends on task A). Graph means a collection of nodes (tasks) connected by edges (dependencies). In Airflow, a DAG is a Python file that defines a set of tasks and the order they should run. The DAG itself does not do any work. It describes what work should happen and in what order. Think of it as a blueprint, not a factory.

How Airflow Uses DAGs

Airflow's scheduler continuously scans a folder (the dags_folder) and imports every Python file it finds. Each file can define one or more DAGs. When the scheduler imports a DAG file, it builds an internal representation of the tasks and dependencies. This happens every few seconds (controlled by min_file_process_interval). The scheduler then creates DagRuns based on the DAG's schedule (daily, hourly, cron expression, or dataset-triggered). Each DagRun creates TaskInstances for every task in the DAG. The scheduler puts ready TaskInstances into a queue, and the executor runs them.

DAG Runs and Task Instances

A DagRun is a single execution of a DAG. If your DAG runs daily, you get one DagRun per day. Each DagRun has a logical_date (formerly execution_date) that represents the start of the data interval it covers. A daily DAG with logical_date 2026-01-15 runs at the end of that interval (2026-01-16T00:00) and processes data from 2026-01-15. Each DagRun creates TaskInstances for every task in the DAG. A TaskInstance represents one specific task running in one specific DagRun. TaskInstances have states: queued, running, success, failed, skipped, up_for_retry.

The Parse-Time Trap

This is the single most important concept for writing production Airflow DAGs. Code at the top level of a DAG file runs every time the scheduler parses the file. This happens every few seconds. If your top-level code makes a database call, an API request, or a slow import, the scheduler slows down for every single DAG it manages. The rule is simple: the top level of a DAG file should only define the DAG object and its tasks. All actual work happens inside task callables, which only run when the task executes, not when the file is parsed.

DAG Structure: Components and Patterns

The building blocks of every Airflow DAG.

Operators

Operators define what a task does. PythonOperator runs a Python callable. BashOperator runs a shell command. SqlOperator runs a SQL query. Each operator type handles a specific kind of work. There are also transfer operators (move data between systems), sensor operators (wait for conditions), and provider operators (interact with external services like AWS, GCP, or Slack). You do not write custom operators for most tasks. PythonOperator covers 80% of use cases.

PythonOperator(task_id='transform', python_callable=transform_data, op_kwargs={'date': '{{ ds }}'})

Tasks and Task Dependencies

A task is an instance of an operator within a DAG. Tasks are connected by dependencies that define execution order. Use >> (bitshift operator) to set dependencies: extract >> transform >> load means extract runs first, then transform, then load. You can also fan out (extract >> [transform_a, transform_b]) or fan in ([transform_a, transform_b] >> load). Dependencies only control order, not data passing. Tasks do not automatically receive output from upstream tasks.

extract >> [transform_users, transform_events] >> load_warehouse

Sensors

Sensors are special operators that wait for an external condition to be true. FileSensor waits for a file to appear. ExternalTaskSensor waits for a task in another DAG to complete. HttpSensor waits for an HTTP endpoint to return a success status. Sensors have two modes: poke (checks repeatedly, holding a worker slot) and reschedule (releases the worker between checks). Use reschedule mode for long waits to avoid blocking worker slots. A sensor that pokes every 30 seconds for 6 hours ties up a worker slot the entire time.

FileSensor(task_id='wait_for_file', filepath='/data/input/{{ ds }}.csv', mode='reschedule', poke_interval=300)

XComs (Cross-Communications)

XComs let tasks pass small pieces of data to downstream tasks via the Airflow metastore (database). A task pushes a value with xcom_push, and a downstream task pulls it with xcom_pull. The TaskFlow API (@task decorator) handles XCom automatically: the return value of a decorated function is pushed, and calling the function in another task pulls it. XComs are not for large data. They store values in the metastore database. Passing a DataFrame through XCom is an anti-pattern. Instead, write data to object storage (S3, GCS) and pass the file path as an XCom.

ti.xcom_push(key='row_count', value=42) / ti.xcom_pull(task_ids='extract', key='row_count')

Trigger Rules

The default trigger rule is all_success: a task runs only when all upstream tasks succeed. This is correct for most cases, but sometimes you need different behavior. all_failed: run only if all upstreams failed (useful for error notification tasks). one_success: run if any upstream succeeded (useful for branching). none_failed: run if no upstream failed (allows skipped upstreams). always: run regardless of upstream state (useful for cleanup tasks). Understanding trigger rules is essential for building DAGs with error handling and branching logic.

PythonOperator(task_id='cleanup', python_callable=cleanup, trigger_rule='all_done')

DAG Best Practices

Rules that separate production-quality DAGs from tutorial-quality DAGs.

Idempotency

Every task should produce the same result regardless of how many times it runs. This is critical because Airflow retries failed tasks and operators can manually re-run tasks. If your task inserts rows into a database, a retry creates duplicates. Instead, use MERGE/upsert, or DELETE + INSERT within a transaction, or write to a date-partitioned path and overwrite the partition. Idempotent tasks are safe to retry, safe to backfill, and safe to re-run when debugging.

Retries and Retry Delays

Every production task should have retries configured. Transient failures (network timeouts, API rate limits, temporary database locks) are normal. A task with retries=3 and retry_delay=timedelta(minutes=5) handles most transient failures automatically without human intervention. For tasks that call external APIs, use exponential backoff: retry_delay doubles with each attempt. Set retry_exponential_backoff=True and max_retry_delay to cap it.

Keep DAG Files Fast and Side-Effect-Free

The scheduler parses DAG files every few seconds. If your DAG file makes API calls, reads from a database, or imports heavy libraries at the top level, you slow down the scheduler for every DAG. Move all work into task callables. Use lazy imports (import inside the function, not at the top of the file) for heavy libraries like pandas or boto3. A DAG file should define tasks and dependencies and nothing else.

Use TaskFlow API for New DAGs

The TaskFlow API (@task decorator, introduced in Airflow 2.0) simplifies DAG authoring. Instead of instantiating PythonOperator and passing callables, you decorate Python functions with @task. Return values automatically become XComs. Dependencies are inferred from function calls. The result is cleaner, more Pythonic code that is easier to read and maintain. Use TaskFlow for all new DAGs unless you have a specific reason to use the classic operator style.

Pools for Resource Management

Pools limit concurrent task instances across all DAGs. If your data warehouse allows 10 concurrent connections, create a pool with 10 slots and assign all warehouse-writing tasks to that pool. This prevents 50 tasks from trying to write simultaneously and crashing the database. Pools are global: they apply across all DAGs and all DagRuns. Use priority_weight to control which tasks get pool slots first.

Avoid Overly Complex DAGs

A DAG with 200 tasks and complex branching logic is hard to debug, slow to parse, and difficult for anyone else on the team to understand. If your DAG is growing beyond 30 to 40 tasks, consider breaking it into multiple DAGs connected by Datasets or TriggerDagRunOperator. Smaller, focused DAGs are easier to test, faster to troubleshoot, and more maintainable over time.

4 Airflow DAG Interview Questions

Questions that test your understanding of DAGs beyond the basics.

Airflow

Explain the difference between logical_date (execution_date) and the actual time a DAG runs. Why does this distinction matter?

The logical_date marks the start of the data interval, not when the DAG runs. A daily DAG scheduled for 2026-01-15 actually runs on 2026-01-16 at midnight (at the end of the interval). This matters because SQL queries parameterized with the logical_date process the correct day's data. Getting this wrong is the most common source of off-by-one-day bugs in Airflow pipelines. Use data_interval_start and data_interval_end in templates for clarity. The interviewer wants to see you have hit this in production and understand the implications.

Airflow

Your DAG runs daily, but yesterday's run failed and today's run is queued. Walk through what happens and how you fix it.

By default (depends_on_past=False), today's run executes regardless of yesterday's failure. If depends_on_past=True on any task, that task waits for its previous DagRun to succeed. To fix the failure: check the logs for the failed task, identify the root cause, fix the issue, and clear the failed task instances (which re-queues them). The interviewer checks whether you know the difference between clearing a task (re-runs it) and marking it as success (skips it), and whether you would clear just the failed task or the entire DagRun.

Airflow

How would you design a DAG that handles failures gracefully, including sending an alert and running a cleanup task?

Set retries and retry_delay on each task for transient failures. Use on_failure_callback at the DAG level to send alerts (Slack, PagerDuty, email) when any task fails after exhausting retries. Add a cleanup task with trigger_rule='all_done' that runs regardless of upstream success or failure. For critical tasks, use a downstream sensor that verifies the expected output exists. The interviewer looks for layered error handling: retries for transient issues, alerts for persistent failures, and cleanup for any outcome.

Airflow

You need to process data from 3 different sources, each arriving at different times, and then join them in a final transformation. How do you design this DAG?

Three extraction tasks (one per source) that run in parallel. Each extraction task writes its output to a staging location (S3 path or staging table). A sensor or trigger rule ensures the join task only runs when all three extractions succeed. The join task reads from the staging locations and produces the final output. Use all_success trigger rule on the join task. Discuss what happens when one source is delayed: use a sensor with a timeout and a timeout callback that alerts the team. The interviewer tests whether you think about real-world complexity: sources that arrive late, sources that arrive with different schemas, and sources that arrive with overlapping data.

Airflow DAG FAQ

What is the difference between a DAG and a pipeline?+
A DAG is a specific data structure: a directed acyclic graph that defines tasks and their dependencies. A pipeline is a broader concept: a series of steps that move data from source to destination. In Airflow, a DAG is how you define a pipeline. But pipelines can exist without Airflow: a cron job that runs a shell script is a pipeline without a DAG. When someone says 'data pipeline,' they usually mean the full system. When they say 'DAG,' they usually mean the Airflow definition.
How many tasks should a single DAG have?+
There is no hard limit, but practical guidelines help. A DAG with 5 to 30 tasks is typical and manageable. Beyond 40 to 50 tasks, parsing time increases, the Airflow UI becomes cluttered, and debugging failures gets harder. If your DAG is growing large, consider splitting it into multiple DAGs connected by Datasets or TriggerDagRunOperator. Each DAG should represent a coherent unit of work with a clear purpose.
Should I use the TaskFlow API or the classic operator style?+
Use TaskFlow for new DAGs. It produces cleaner code, handles XCom automatically, and infers dependencies from function calls. Use the classic operator style when you need operators that TaskFlow does not wrap well (some provider operators, sensors with specific configurations) or when you are maintaining existing DAGs that use the classic style. Mixing both in the same DAG works fine.
How do I test Airflow DAGs?+
Three levels. First: import the DAG file in a test and verify it parses without errors, has the expected task count, and has correct dependencies. Second: run individual tasks with airflow tasks test to verify they produce expected output. Third: trigger the full DAG in a staging environment with test data. For CI/CD, the first level catches most problems (import errors, missing variables, dependency cycles). DAG.test() in Airflow 2.5+ simplifies end-to-end testing without needing a running scheduler.

Orchestrate the Pipeline. Pass the Interview.

Pipeline architecture questions reward candidates who can diagram a DAG on a whiteboard in thirty seconds. Build that reflex here.

Start Practicing