A DAG is where your pipeline's topology meets the scheduler's clock. It sits at the orchestration layer: above the task workers, below the metadata database, beside the executor. Everything you build in Airflow, from a SparkSubmitOperator to a sensor waiting on S3, lives inside a DAG and inherits its contract about idempotency, ordering, and retries. This reference treats the DAG as a system-design artifact first and a Python file second.
Pipeline topics in DE rounds
Pipeline challenges
L6 staff rounds
Rounds analyzed
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Think of these as the interfaces between your DAG and the rest of the orchestration layer. Get them wrong and your pipeline fights the scheduler instead of using it.
DAG stands for Directed Acyclic Graph. Directed means each edge has a direction (task A runs before task B). Acyclic means there are no cycles (you cannot create a loop where task A depends on task B which depends on task A). Graph means a collection of nodes (tasks) connected by edges (dependencies). In Airflow, a DAG is a Python file that defines a set of tasks and the order they should run. The DAG itself does not do any work. It describes what work should happen and in what order. Think of it as a blueprint, not a factory.
Airflow's scheduler continuously scans a folder (the dags_folder) and imports every Python file it finds. Each file can define one or more DAGs. When the scheduler imports a DAG file, it builds an internal representation of the tasks and dependencies. This happens every few seconds (controlled by min_file_process_interval). The scheduler then creates DagRuns based on the DAG's schedule (daily, hourly, cron expression, or dataset-triggered). Each DagRun creates TaskInstances for every task in the DAG. The scheduler puts ready TaskInstances into a queue, and the executor runs them.
A DagRun is a single execution of a DAG. If your DAG runs daily, you get one DagRun per day. Each DagRun has a logical_date (formerly execution_date) that represents the start of the data interval it covers. A daily DAG with logical_date 2026-01-15 runs at the end of that interval (2026-01-16T00:00) and processes data from 2026-01-15. Each DagRun creates TaskInstances for every task in the DAG. A TaskInstance represents one specific task running in one specific DagRun. TaskInstances have states: queued, running, success, failed, skipped, up_for_retry.
This is the single most important concept for writing production Airflow DAGs. Code at the top level of a DAG file runs every time the scheduler parses the file. This happens every few seconds. If your top-level code makes a database call, an API request, or a slow import, the scheduler slows down for every single DAG it manages. The rule is simple: the top level of a DAG file should only define the DAG object and its tasks. All actual work happens inside task callables, which only run when the task executes, not when the file is parsed.
The building blocks of every Airflow DAG.
Operators define what a task does. PythonOperator runs a Python callable. BashOperator runs a shell command. SqlOperator runs a SQL query. Each operator type handles a specific kind of work. There are also transfer operators (move data between systems), sensor operators (wait for conditions), and provider operators (interact with external services like AWS, GCP, or Slack). You do not write custom operators for most tasks. PythonOperator covers 80% of use cases.
PythonOperator(task_id='transform', python_callable=transform_data, op_kwargs={'date': '{{ ds }}'})A task is an instance of an operator within a DAG. Tasks are connected by dependencies that define execution order. Use >> (bitshift operator) to set dependencies: extract >> transform >> load means extract runs first, then transform, then load. You can also fan out (extract >> [transform_a, transform_b]) or fan in ([transform_a, transform_b] >> load). Dependencies only control order, not data passing. Tasks do not automatically receive output from upstream tasks.
extract >> [transform_users, transform_events] >> load_warehouseSensors are special operators that wait for an external condition to be true. FileSensor waits for a file to appear. ExternalTaskSensor waits for a task in another DAG to complete. HttpSensor waits for an HTTP endpoint to return a success status. Sensors have two modes: poke (checks repeatedly, holding a worker slot) and reschedule (releases the worker between checks). Use reschedule mode for long waits to avoid blocking worker slots. A sensor that pokes every 30 seconds for 6 hours ties up a worker slot the entire time.
FileSensor(task_id='wait_for_file', filepath='/data/input/{{ ds }}.csv', mode='reschedule', poke_interval=300)XComs let tasks pass small pieces of data to downstream tasks via the Airflow metastore (database). A task pushes a value with xcom_push, and a downstream task pulls it with xcom_pull. The TaskFlow API (@task decorator) handles XCom automatically: the return value of a decorated function is pushed, and calling the function in another task pulls it. XComs are not for large data. They store values in the metastore database. Passing a DataFrame through XCom is an anti-pattern. Instead, write data to object storage (S3, GCS) and pass the file path as an XCom.
ti.xcom_push(key='row_count', value=42) / ti.xcom_pull(task_ids='extract', key='row_count')The default trigger rule is all_success: a task runs only when all upstream tasks succeed. This is correct for most cases, but sometimes you need different behavior. all_failed: run only if all upstreams failed (useful for error notification tasks). one_success: run if any upstream succeeded (useful for branching). none_failed: run if no upstream failed (allows skipped upstreams). always: run regardless of upstream state (useful for cleanup tasks). Understanding trigger rules is essential for building DAGs with error handling and branching logic.
PythonOperator(task_id='cleanup', python_callable=cleanup, trigger_rule='all_done')Rules that separate production-quality DAGs from tutorial-quality DAGs.
Every task should produce the same result regardless of how many times it runs. This is critical because Airflow retries failed tasks and operators can manually re-run tasks. If your task inserts rows into a database, a retry creates duplicates. Instead, use MERGE/upsert, or DELETE + INSERT within a transaction, or write to a date-partitioned path and overwrite the partition. Idempotent tasks are safe to retry, safe to backfill, and safe to re-run when debugging.
Every production task should have retries configured. Transient failures (network timeouts, API rate limits, temporary database locks) are normal. A task with retries=3 and retry_delay=timedelta(minutes=5) handles most transient failures automatically without human intervention. For tasks that call external APIs, use exponential backoff: retry_delay doubles with each attempt. Set retry_exponential_backoff=True and max_retry_delay to cap it.
The scheduler parses DAG files every few seconds. If your DAG file makes API calls, reads from a database, or imports heavy libraries at the top level, you slow down the scheduler for every DAG. Move all work into task callables. Use lazy imports (import inside the function, not at the top of the file) for heavy libraries like pandas or boto3. A DAG file should define tasks and dependencies and nothing else.
The TaskFlow API (@task decorator, introduced in Airflow 2.0) simplifies DAG authoring. Instead of instantiating PythonOperator and passing callables, you decorate Python functions with @task. Return values automatically become XComs. Dependencies are inferred from function calls. The result is cleaner, more Pythonic code that is easier to read and maintain. Use TaskFlow for all new DAGs unless you have a specific reason to use the classic operator style.
Pools limit concurrent task instances across all DAGs. If your data warehouse allows 10 concurrent connections, create a pool with 10 slots and assign all warehouse-writing tasks to that pool. This prevents 50 tasks from trying to write simultaneously and crashing the database. Pools are global: they apply across all DAGs and all DagRuns. Use priority_weight to control which tasks get pool slots first.
A DAG with 200 tasks and complex branching logic is hard to debug, slow to parse, and difficult for anyone else on the team to understand. If your DAG is growing beyond 30 to 40 tasks, consider breaking it into multiple DAGs connected by Datasets or TriggerDagRunOperator. Smaller, focused DAGs are easier to test, faster to troubleshoot, and more maintainable over time.
Questions that test your understanding of DAGs beyond the basics.
The logical_date marks the start of the data interval, not when the DAG runs. A daily DAG scheduled for 2026-01-15 actually runs on 2026-01-16 at midnight (at the end of the interval). This matters because SQL queries parameterized with the logical_date process the correct day's data. Getting this wrong is the most common source of off-by-one-day bugs in Airflow pipelines. Use data_interval_start and data_interval_end in templates for clarity. The interviewer wants to see you have hit this in production and understand the implications.
By default (depends_on_past=False), today's run executes regardless of yesterday's failure. If depends_on_past=True on any task, that task waits for its previous DagRun to succeed. To fix the failure: check the logs for the failed task, identify the root cause, fix the issue, and clear the failed task instances (which re-queues them). The interviewer checks whether you know the difference between clearing a task (re-runs it) and marking it as success (skips it), and whether you would clear just the failed task or the entire DagRun.
Set retries and retry_delay on each task for transient failures. Use on_failure_callback at the DAG level to send alerts (Slack, PagerDuty, email) when any task fails after exhausting retries. Add a cleanup task with trigger_rule='all_done' that runs regardless of upstream success or failure. For critical tasks, use a downstream sensor that verifies the expected output exists. The interviewer looks for layered error handling: retries for transient issues, alerts for persistent failures, and cleanup for any outcome.
Three extraction tasks (one per source) that run in parallel. Each extraction task writes its output to a staging location (S3 path or staging table). A sensor or trigger rule ensures the join task only runs when all three extractions succeed. The join task reads from the staging locations and produces the final output. Use all_success trigger rule on the join task. Discuss what happens when one source is delayed: use a sensor with a timeout and a timeout callback that alerts the team. The interviewer tests whether you think about real-world complexity: sources that arrive late, sources that arrive with different schemas, and sources that arrive with overlapping data.
Pipeline architecture questions reward candidates who can diagram a DAG on a whiteboard in thirty seconds. Build that reflex here.
Start Practicing