Orchestration and Dependencies: Intermediate

A logistics platform at growth stage ran twenty-eight DAGs across three teams. Each DAG individually was clean. The combined system was a mess. The shipments DAG ran hourly, the inventory DAG ran every six hours, and the customer-facing analytics DAG ran daily. The three needed to share a single fact table, and that table was being recomputed three different ways at three different cadences. Some mornings the analytics DAG ran before the inventory DAG and produced numbers that disagreed with the inventory team's dashboard. The fix was not a bigger DAG. The fix was a more careful vocabulary: the difference between a task, a DAG, and a run, and the difference between a sensor that polls and an asset trigger that fires. This lesson is about the second-order vocabulary that turns a working DAG into a system of DAGs that cooperate.

Schedules: Cron, Interval, Event

Daily Life
Interviews

Choose between cron, interval, and event-triggered schedules based on the data arrival pattern and consumer freshness expectation.

Every DAG has a schedule. The schedule decides when a run starts. Three forms cover almost every production case: cron expressions for repeated wall-clock times, intervals for relative cadences (every fifteen minutes, every hour), and event triggers for runs that start when something external happens. A modern orchestrator supports all three, and the right choice depends on the source's behavior, not on engineer preference.

Cron Expressions

A cron expression is five fields that name a recurring time. The fields are minute, hour, day of month, month, and day of week. The expression `0 2 * * *` means 'minute zero of hour two of every day of every month, regardless of weekday', which is 2am every night. Cron expressions are dense and easy to misread; production teams paste them through validators and add comments next to every one.
Cron ExpressionMeaningTypical Use
0 2 * * *2:00 AM every dayDaily ETL run after most app traffic has passed
*/15 * * * *Every 15 minutes, top of the quarter hourNear-real-time pull from a rate-limited API
0 */6 * * *Every 6 hours starting at midnightRefresh of slow-changing reference data
0 9 * * 19:00 AM every MondayWeekly executive report generation
0 0 1 * *Midnight on the first day of every monthMonthly close, billing reconciliation

Interval Schedules

An interval schedule says 'run every N units' without anchoring to a wall-clock time. The first run happens when the DAG is deployed; subsequent runs happen every N units after the previous one. Intervals are the right schedule when the absolute time of a run does not matter, only the gap between runs. A pipeline that polls a Kafka offset every five minutes does not care whether the run starts at 2:00 or 2:03. The interval matters; the wall-clock alignment does not.

Event Triggers

Event triggers start a DAG when something external happens. The external thing can be a file landing in S3, a message arriving in a queue, an upstream DAG completing, a webhook firing from a third-party API. The schedule is conceptually 'when X occurs', not 'every Y minutes'. Event triggers are the most precise schedule type: the work runs exactly when there is work to do. They also introduce coupling to the external system's reliability, which is why they are usually paired with a fallback interval schedule that catches missed events.
1# Three schedule types in Airflow
2
3# Cron-based
4with DAG('daily_etl', schedule='0 2 * * *', ...):
5 pass
6
7# Interval-based (every 15 minutes)
8with DAG('frequent_poll', schedule=timedelta(minutes=15), ...):
9 pass
10
11# Event-based (Dataset trigger fires when an upstream asset updates)
12orders_dataset = Dataset('snowflake://prod/raw.orders')
13with DAG('downstream_mart', schedule=[orders_dataset], ...):
14 pass

Picking the Right Schedule

Cron Wins When
  • The work must happen at a specific wall-clock time
  • Downstream consumers expect data to be available at a fixed time
  • Compliance or business rules require alignment with a calendar moment
  • Coordination with other systems uses the same calendar (close at month end)
Interval or Event Wins When
  • The work depends on data arrival, not on the clock
  • The source is event-driven (Kafka, webhooks, S3 notifications)
  • The cadence matters more than the alignment
  • Wall-clock alignment would force unnecessary delays or rushes

Schedule Drift and the Catchup Setting

What happens when the orchestrator was offline for six hours and the schedule said the DAG should have run six times? Two answers exist, and the one chosen has consequences. Catchup mode runs every missed schedule sequentially when the orchestrator comes back. No-catchup mode runs only the most recent missed schedule and skips the rest. Catchup is correct when every run produces a different output that downstream consumers need (a daily partition that must exist for every day). No-catchup is correct when the runs are idempotent reflections of current state and only the latest matters. Picking the wrong one causes either gaps in historical data or a thundering herd of catch-up runs that overwhelm the warehouse.
Schedule decisions to make explicit at deploy time:
  • Catchup or no-catchup: do missed runs matter?
  • Timezone: most orchestrators default to UTC; production teams often align with business hours
  • Maximum concurrent runs: a slow run should not stack two on top of each other
  • Backfill window: how far back can the schedule be re-run on demand?
CronIntervalEvent
Cron
Wall-clock alignment
Five-field expression naming a recurring time. Right when the consumer expects data at a fixed business moment (6am, end of month, Monday morning).
Interval
Relative cadence
Run every N units. Right when the gap between runs matters more than alignment with a clock. Common for polling sources and micro-batch consumers.
Event
Arrival-driven
Fire when an external condition becomes true: file lands, asset updates, webhook fires. Most precise schedule type; couples reliability to the external system.
TIP
When picking a schedule, name the consumer first. The freshness expectation of the consumer is what should decide the cadence, not the convenience of the producer.
check
Three schedule forms cover almost every production case: cron, interval, event-trigger.
alert
Catchup and no-catchup are not interchangeable; each has a correct case and a wrong one.
query
Event triggers are the most precise schedule type but also the one most coupled to external reliability.

Task vs DAG vs Run for Retries

Daily Life
Interviews

Distinguish task, DAG, run, and task instance, and apply the distinction to retry decisions.

Three words appear in every conversation about orchestration: task, DAG, and run. New engineers use the three interchangeably, and most of the time the imprecision does not bite. It bites hard when retry semantics are at stake, because the orchestrator retries at exactly one of those three levels and the answer matters. The vocabulary below is precise on purpose.

The Three Words

TermDefinitionExample
TaskA unit of work declared in the DAG (extract_orders)One Python operator with retry policy 3
DAGThe full graph of tasks and dependencies, declared oncedaily_orders_by_region, with extract -> clean -> aggregate
RunOne execution of the DAG triggered by the scheduleThe 2026-04-25 run of daily_orders_by_region
Task instanceOne task within one runextract_orders during the 2026-04-25 run

Retries Happen at the Task Instance Level

When an orchestrator retries a failure, it retries a task instance. It does not retry the task in the abstract, because the task is a definition, not a running thing. It does not retry the DAG, because the DAG is the structure of work, not a running thing either. It retries this task in this run on this date with these inputs. That precision is what makes idempotency possible: the retry has the same inputs as the failure, so a correct task produces the same outputs.
1# Retry semantics tied to the task instance, not the DAG or the task definition
2# When extract_orders fails on the 2026-04-25 run:
3# - The task instance (extract_orders, 2026-04-25) is retried
4# - retries=3 means three more attempts of THIS task on THIS date
5# - Other dates' runs of extract_orders are unaffected
6# - The DAG itself is not 'retried'; only the failed task instance is
7
8extract = PythonOperator(
9 task_id='extract_orders',
10 python_callable=extract_orders_for_date,
11 op_kwargs={'run_date': '{{ ds }}'}, # the run's logical date
12 retries=3,
13 retry_delay=timedelta(minutes=2),
14)

Why Run-Level Retries Are a Mistake

A DAG run that fails halfway has produced partial output. The first instinct of an inexperienced operator is to mark the whole run as failed and re-run it. That instinct is sometimes wrong. If the first three of five tasks succeeded, re-running the entire run repeats those three successes, which costs compute and can produce unexpected results if any of those tasks is not perfectly idempotent. The orchestrator's retry mechanism, by default, only retries the failed task and resumes from there. Marking the whole run for retry overrides that default and is occasionally the right call, but it should be a deliberate choice, not a reflex.
Task-Instance Retry (Default)
  • Only the failed task is rerun
  • Successful tasks keep their state
  • Cheap: only re-pays for the failed work
  • Safe even when upstream tasks are not perfectly idempotent
DAG-Run Retry (Manual)
  • Every task in the run is rerun
  • Successful tasks lose their state and rerun
  • Expensive: re-pays for everything
  • Required only when upstream task outputs are corrupted

Logical Date vs Execution Date

A run carries a logical date (the date the run represents) and an execution date (the date the run actually happened). For a daily DAG that runs at 2am to process the previous day's data, the run on 2026-04-26 at 2am represents 2026-04-25. The logical date is 2026-04-25; the execution date is 2026-04-26. Tasks should reference the logical date, not the execution date, because the logical date is what the data partition is keyed on. Confusing the two is the second most common bug after cycles in DAGs, and it always shows up first in backfills.
When task, DAG, and run get conflated, the bugs that follow:
  • An operator marks the whole run as failed when only one task failed, wasting compute
  • Retry policy is set on the DAG when it should be per-task, applying wrong policy to wrong work
  • A backfill is mistaken for a fresh run, so logical date is wrong and partitions are misaligned
  • Schema changes deployed to a task definition are confused with changes to a single task instance
1# Show the difference between task instances of the same task across runs
2# Each call represents one task instance: a task on a specific run date
3
4class TaskInstance:
5 def __init__(self, task_id, run_date, max_retries=3):
6 self.task_id = task_id
7 self.run_date = run_date
8 self.max_retries = max_retries
9 self.attempts = 0
10 self.status = 'pending'
11
12 def attempt(self, will_fail):
13 self.attempts += 1
14 if will_fail and self.attempts <= self.max_retries:
15 self.status = 'failed_will_retry'
16 print(f' attempt {self.attempts} of {self.task_id} for {self.run_date}: failed, retry')
17 elif will_fail:
18 self.status = 'failed_terminal'
19 print(f' attempt {self.attempts} of {self.task_id} for {self.run_date}: failed, no retries left')
20 else:
21 self.status = 'success'
22 print(f' attempt {self.attempts} of {self.task_id} for {self.run_date}: success')
23
24# Same task, different run dates: each is a separate task instance with its own retry budget
25ti_apr24 = TaskInstance('extract_orders', '2026-04-24')
26ti_apr25 = TaskInstance('extract_orders', '2026-04-25')
27
28print('Apr 24 run:')
29ti_apr24.attempt(will_fail=False)
30print('Apr 25 run:')
31ti_apr25.attempt(will_fail=True)
32ti_apr25.attempt(will_fail=True)
33ti_apr25.attempt(will_fail=False)
34
35print(f'\nApr 24 final status: {ti_apr24.status}')
36print(f'Apr 25 final status: {ti_apr25.status}')
37print('Note: Apr 24 retry budget is independent of Apr 25 retry budget.')
38
TaskDAGRunTask Instance
Task
The definition of work
Declared once in the DAG file. Names the operation, its retry policy, and its inputs. Lives in version control.
DAG
The structure of work
The directed acyclic graph of tasks and edges. Versioned together with the task definitions. The unit of deployment.
Run
An execution of the DAG
Triggered by the schedule. Carries a logical date. The thing that succeeds, fails, or gets backfilled.
Task Instance
One task in one run
The smallest addressable unit. Retries operate here. Logs are stored here. State is tracked here.
Do
  • Set retry policy at the task level, not the DAG level
  • Reference the logical date inside tasks; it is the date the data partition is keyed on
  • Re-run only the failed task instance when possible; preserve successful state
Don't
  • Re-run the whole DAG when a single task failed and other tasks finished cleanly
  • Use execution date when logical date is what the data uses (backfills break)
  • Apply one retry policy to every task in the DAG; tighten or loosen per task as needed
check
Retries operate on task instances, the smallest addressable unit; the DAG and the run are not retried.
alert
Re-running an entire run repeats successful tasks and is rarely the right response to a single failure.
query
Logical date is what the data partition is keyed on; execution date is when the run actually fired.
extract
extract
transform
transform
load
load
Storage
warehouse

An orchestration DAG: tasks are nodes, dependencies are edges, and there are no cycles. The orchestrator runs each task only after its upstream finishes, retries failures, and backfills - things cron cannot do.

Cross-DAG Dependencies

Daily Life
Interviews

Choose between sensors, asset triggers, and time offsets for cross-DAG dependencies based on coupling and freshness needs.

Real production environments do not run one giant DAG. They run dozens of smaller DAGs, owned by different teams, on different cadences. Some of those DAGs depend on each other. The marketing analytics DAG reads tables produced by the orders DAG; the ML feature DAG reads tables produced by the events DAG. The dependency edge crosses a DAG boundary. Modeling that edge correctly is the difference between a system that scales across teams and one that breaks every time someone changes a schedule.

Three Ways to Express a Cross-DAG Edge

MechanismHow It WorksTrade-off
Time offsetSchedule the downstream DAG to run after the upstream is expected to finishBrittle: ignores the upstream's actual state, breaks when upstream runs slow
SensorA polling task in the downstream DAG waits for the upstream's run to succeedReliable but adds polling latency; ties downstream to upstream run state
Asset triggerDownstream DAG runs when the asset (table, file) the upstream produces is updatedDecouples DAGs cleanly; depends on the orchestrator supporting asset-aware scheduling

Why Time Offsets Fail

Scheduling the downstream DAG at 3am because the upstream usually finishes by 2:45am is the cross-DAG version of the cron chain failure from the beginner tier. The bug is identical: when the upstream runs slow, the downstream still starts on schedule and reads stale state. Cross-DAG time offsets are the most common architectural anti-pattern in mature production environments because they accumulate one DAG at a time and look harmless until the day they fail. The fix is the same as inside a single DAG: encode the dependency, not the time.

Sensors: The Old Way

A sensor is a task that waits for an external condition to become true. The downstream DAG starts on its own schedule, but its first task is a sensor that polls the upstream DAG's run state every minute. The sensor task succeeds when the upstream's most recent run for the same logical date succeeded. Until then, the sensor sits in a poking state. Sensors are the reliable but slightly slow way to model cross-DAG dependencies. Polling adds minutes of lag, and a poorly tuned sensor can hold a worker slot indefinitely. Modern sensors use a deferrable mode that releases the slot while waiting, but the polling lag remains.
1# Airflow ExternalTaskSensor: wait for upstream DAG's task to succeed for the same logical date
2from airflow.sensors.external_task import ExternalTaskSensor
3
4wait_for_orders = ExternalTaskSensor(
5 task_id='wait_for_orders_dag',
6 external_dag_id='daily_orders_by_region',
7 external_task_id='aggregate_orders',
8 execution_delta=timedelta(0), # same logical date
9 timeout=60 * 60 * 2, # give up after 2 hours
10 poke_interval=60, # check every minute
11 mode='reschedule', # release the worker slot between pokes
12)

Asset Triggers: The Modern Way

Asset-based scheduling reframes the dependency. The downstream DAG does not depend on the upstream DAG running. It depends on the data being fresh. When the upstream produces (or refreshes) the asset (a Snowflake table, an S3 prefix, a feature in a feature store), the orchestrator marks the asset as updated. Any DAG that schedules on that asset triggers automatically. Airflow ships Datasets for this; Dagster builds the entire orchestration model around software-defined assets; Prefect supports it through deployments and triggers. The shift is from 'wait for the producer' to 'trigger when the data is fresh', and it decouples the producer and consumer cleanly.
1# Asset-based cross-DAG trigger (Airflow Datasets)
2from airflow.datasets import Dataset
3
4orders_table = Dataset('snowflake://prod/mart/orders_by_region')
5
6# Upstream DAG declares it produces the dataset
7with DAG('daily_orders_by_region', schedule='0 2 * * *', ...) as dag:
8 aggregate = PythonOperator(
9 task_id='aggregate_orders',
10 python_callable=aggregate_to_region,
11 outlets=[orders_table], # this task updates the dataset
12 )
13
14# Downstream DAG schedules ON the dataset, not on a clock
15with DAG('marketing_attribution', schedule=[orders_table], ...) as dag:
16 attribution = PythonOperator(task_id='compute_attribution', ...)

The Freshness Contract

Cross-DAG dependencies require a contract between producer and consumer about freshness. The producer commits to delivering the asset by a stated time (or with a stated cadence). The consumer commits to relying only on the contract, not on the producer's internal scheduling. The contract is what allows the producer to refactor (split a DAG, change cadence, switch tools) without coordinating with every downstream consumer. Without the contract, every change to a producer DAG is a breaking change to N downstream DAGs, and refactoring becomes a months-long project.
Time-Offset Cross-DAG
  • Downstream starts at a fixed wall-clock time
  • Slow upstream produces stale downstream
  • Schedule changes upstream require schedule changes downstream
  • Cross-team coordination needed for any cadence shift
Asset-Triggered Cross-DAG
  • Downstream starts when the asset is fresh
  • Slow upstream delays downstream rather than corrupting it
  • Schedule changes upstream are transparent to downstream
  • The contract is the freshness guarantee, not the schedule
Signs that cross-DAG dependencies are hand-wired through time:
  • Downstream DAG schedule comments explain 'runs after the orders DAG, which usually finishes by 2:45'
  • Slack messages exist asking the orders team to delay a deploy because it would break attribution
  • Failed runs always blame 'the downstream ran before the upstream finished'
  • A schedule change in one DAG triggers a chain of schedule changes in others

When Sensors Are Still the Right Answer

Asset triggers are the modern default, but sensors still earn their keep in two cases. First, when the producer is outside the orchestrator (a third-party SaaS that drops a file in S3), an asset trigger may not be wired up; a sensor that polls for the file is the only option. Second, when the producer DAG is owned by another company or another orchestrator, sensors bridge the gap. Sensors are not legacy; they are the right tool for the boundary between the orchestrator and the world outside it.
TIP
When designing a new cross-DAG edge, default to an asset trigger. Reach for a sensor only when the producer is outside the orchestrator's control. Never reach for a time offset; the failure mode is too well-known.

Sensors and External Triggers

Daily Life
Interviews

Place sensors and external triggers correctly at the boundary of the orchestrator, with timeouts and the right execution mode.

A pipeline often has to wait for something outside its control. A vendor SFTP drops a file at an irregular time. A REST API publishes a daily endpoint that becomes available between 1am and 4am. A Kafka topic accumulates messages, and a downstream batch process should kick off when the offset crosses a threshold. Sensors are the orchestration primitive that turns 'wait for the world' into a task the DAG can schedule around. Knowing the families of sensors and their trade-offs is part of the intermediate vocabulary.

What a Sensor Is, Precisely

A sensor is a task that periodically checks an external condition and succeeds when the condition becomes true. The condition can be 'a file exists in this S3 prefix', 'this API endpoint returns 200', 'this Kafka offset is at least N', 'this upstream task succeeded'. The sensor blocks the DAG until its condition is met or its timeout fires. Most modern sensors use a deferrable execution model that releases the worker slot while waiting, so a sensor that pokes for two hours does not waste a slot for two hours.

Common Sensor Families

SensorWhat It Waits ForTypical Use
FileSensorA file exists at a path or prefixSFTP file drop from a partner, batch export from an external system
S3KeySensor / GCSObjectSensorAn object exists in cloud storageVendor data drop, batch export from another team's pipeline
HTTPSensorAn HTTP endpoint returns a success statusWait for an external API to publish a daily endpoint
ExternalTaskSensorAnother DAG's task succeeded for the same logical dateCross-DAG dependency without an asset trigger
PythonSensorArbitrary Python predicate returns TrueCustom condition (Kafka offset, queue depth, third-party SDK)

Poking vs Reschedule vs Deferrable

Sensors come in three execution modes. Poking holds a worker slot and re-evaluates on a fixed interval. Reschedule releases the slot between pokes and asks the scheduler to run the sensor again later. Deferrable hands off to an asynchronous trigger system that watches the condition and resumes the task when it fires. Deferrable is the modern default because it scales to thousands of concurrent waits without burning worker capacity. Poking is the legacy default and is acceptable for short waits with few concurrent sensors. Reschedule is the middle ground.
Poking Mode
  • Worker slot held continuously while waiting
  • Re-evaluation cost is small but accumulates
  • Hundreds of waiting sensors exhaust the worker pool
  • Simpler to debug; the slot is right there
Deferrable Mode
  • Worker slot released; trigger watches asynchronously
  • Scales to thousands of concurrent waits
  • Resumes on the trigger event with low latency
  • Requires a triggerer process running alongside the scheduler

External Triggers (Webhooks and Events)

An external trigger is the inverse of a sensor. Instead of polling for a condition, the orchestrator exposes an endpoint that an external system can call to start a DAG. Modern orchestrators ship with a REST API that accepts trigger requests, and webhook integrations route events from systems like Stripe, GitHub, or a SaaS app into DAG runs. External triggers are the right answer when the external system can call the orchestrator; sensors are the right answer when only the orchestrator can call the external system.
1# A Kafka-offset PythonSensor in deferrable mode
2from airflow.sensors.python import PythonSensor
3
4def kafka_offset_at_least(target_offset: int) -> bool:
5 from kafka import KafkaConsumer
6 consumer = KafkaConsumer(bootstrap_servers='kafka:9092', group_id='orchestrator')
7 partitions = consumer.partitions_for_topic('events')
8 end_offsets = consumer.end_offsets(
9 [TopicPartition('events', p) for p in partitions]
10 )
11 return min(end_offsets.values()) >= target_offset
12
13wait_for_offset = PythonSensor(
14 task_id='wait_for_minimum_offset',
15 python_callable=kafka_offset_at_least,
16 op_kwargs={'target_offset': 1_000_000},
17 poke_interval=30,
18 timeout=60 * 30,
19 mode='reschedule',
20)

Failure Modes Specific to Sensors

Sensor anti-patterns that show up in production:
  • Sensor with no timeout that pokes forever when the upstream is permanently broken
  • Hundreds of poking sensors burning the entire worker pool on a slow morning
  • Sensor that polls a third-party API and trips its rate limit
  • Sensor whose success condition is a side effect (file exists) without a content check

When to Skip the Sensor Entirely

Sometimes the right design is no sensor. If the external system can call the orchestrator's API or send a webhook, an external trigger replaces the sensor. If the upstream is another DAG in the same orchestrator, an asset trigger replaces the sensor. Sensors earn their place at the boundary between the orchestrator and the outside world; everywhere else, alternatives are usually cleaner.

Every sensor should have a timeout. A sensor without a timeout is a worker slot that may never come back.

1# Simulate a deferrable sensor: it does not hold a worker slot while waiting
2# Compare with a poking sensor, which does
3
4import time
5
6class PokingSensor:
7 def __init__(self, name):
8 self.name = name
9 self.slot_held_seconds = 0
10
11 def wait(self, condition_at_seconds):
12 for t in range(condition_at_seconds + 1):
13 self.slot_held_seconds += 1
14 # In real Airflow, a worker slot is held the entire time
15 return self.slot_held_seconds
16
17class DeferrableSensor:
18 def __init__(self, name):
19 self.name = name
20 self.slot_held_seconds = 0
21
22 def wait(self, condition_at_seconds):
23 # Trigger watches asynchronously; worker slot is released
24 # Slot is held only at start (1s) and at resume (1s)
25 self.slot_held_seconds = 2
26 return self.slot_held_seconds
27
28p = PokingSensor('poke_for_file')
29d = DeferrableSensor('defer_for_file')
30
31for sensor in (p, d):
32 sensor.wait(condition_at_seconds=600) # condition true after 10 minutes
33 print(f'{sensor.name}: held worker slot for {sensor.slot_held_seconds}s')
34
35print('\nPoking burns slots proportional to wait time. Deferrable scales.')
36
TIP
Use deferrable sensors when the orchestrator supports them, set a timeout that reflects the longest reasonable wait, and prefer external triggers when the upstream system can call the orchestrator.

Three Cadences, One Output

Daily Life
Interviews

Design a multi-cadence ingestion pattern with three upstream DAGs and one downstream DAG that joins on freshness rather than time offsets.

A real orchestration design rarely has one schedule. The example below builds a single downstream table that is fed by three sources, each on its own cadence. The shape is common in production: a daily executive table that combines streaming events, hourly Stripe data, and once-a-day Salesforce CRM. Designing this correctly requires every concept from the previous sections.

The Sources and Their Cadences

SourceCadenceWhy
Mobile events (Kafka)Continuous; micro-batched into S3 every 5 minutesVolume is too high for hourly; downstream wants sub-hour freshness
Stripe payments (REST API)Every 15 minutesRate limit allows it; finance wants near-real-time revenue
Salesforce CRM (REST API)Once daily at 1amCRM data changes slowly; daily is more than enough

The Downstream Table

The downstream is mart.daily_revenue_by_account, a single table that the executive dashboard reads. One row per account per day, with columns for total events, total revenue, and account-level metadata from CRM. The freshness bar is 'available by 6am Pacific each morning'. Three sources at three cadences must produce this one table on time, every day, without time-offset hand-wiring.

Three DAGs, Three Cadences

The first design instinct is to put everything in one DAG that runs daily. That instinct is wrong. The hourly Stripe pull and the five-minute Kafka micro-batch should not run on a daily cadence; they should run on their natural cadences and feed durable raw tables that the daily DAG reads. The right shape is three DAGs at three cadences, one daily DAG that joins the latest of each, and asset triggers that connect them.
1DAG : kafka_events_to_raw schedule : every 5 minutes micro_batch_consume -> land_to_s3 -> register_partition outlets : raw.events DAG : stripe_payments_to_raw schedule : every 15 minutes fetch_stripe_pages -> land_to_snowflake outlets : raw.payments DAG : salesforce_to_raw schedule : 0 1 * * * fetch_accounts_full -> land_to_snowflake outlets : raw.accounts DAG : daily_revenue_by_account schedule : 0 5 * * * read raw.events, raw.payments, raw.accounts
2
3
4
5JOIN + aggregate publish mart.daily_revenue_by_account

How the Cadences Cooperate

Each upstream DAG runs on its own schedule and produces an asset (raw.events, raw.payments, raw.accounts). The downstream DAG runs at 5am, after the latest natural runs of all three upstreams have completed for the previous day. It does not depend on the upstream DAGs running at any specific time; it depends on the data being fresh, which the upstream cadences guarantee. The 5am clock-based schedule is the right answer here precisely because the freshness bar (6am Pacific) is a wall-clock commitment, and the wall-clock anchor lives at the seam between the upstream cadences and the downstream commitment.

What Each Task Reads

1INSERT INTO mart.daily_revenue_by_account
2SELECT
3 a.account_id,
4 a.account_name,
5 a.region,
6 DATE(: run_date) AS revenue_date,
7 COUNT(DISTINCT e.event_id) AS event_count,
8 COALESCE(SUM(p.amount_cents) / 100.0, 0) AS revenue_usd
9FROM raw.accounts a
10LEFT JOIN raw.events e
11 ON e.account_id = a.account_id AND DATE(e.event_timestamp) = : run_date
12LEFT JOIN raw.payments p
13 ON p.account_id = a.account_id AND DATE(p.created_at) = : run_date
14WHERE a.is_active = TRUE
15GROUP BY 1, 2, 3, 4 ;

Failure Behavior at Each Cadence

DAGIf It FailsRecovery
kafka_events_to_raw (5 min)Next run picks up from the same Kafka offsetAt-least-once consumption with idempotent S3 writes
stripe_payments_to_raw (15 min)Next run extends the window to cover both batchesHigh-water mark only advances on success
salesforce_to_raw (daily)Alert at 1:30am; on-call investigatesRetries with backoff; manual rerun if all retries fail
daily_revenue_by_account (5am)Sensor or asset check on raw tables; runs when all three are freshAlert if not done by 5:45am to give 15 minutes before SLA
1/* Simulate the join the daily DAG performs at 5am */
2/* Three small CTEs stand in for the three raw tables */
3/* The query produces one row per account per day */
4WITH raw_events AS (
5 SELECT
6 'a1' AS account_id,
7 'evt_1' AS event_id,
8 DATE('2026-04-25') AS event_date
9
10 UNION ALL
11
12 SELECT
13 'a1',
14 'evt_2',
15 DATE('2026-04-25')
16
17 UNION ALL
18
19 SELECT
20 'a2',
21 'evt_3',
22 DATE('2026-04-25')
23
24 UNION ALL
25
26 SELECT
27 'a1',
28 'evt_4',
29 DATE('2026-04-25')
30),
31raw_payments AS (
32 SELECT
33 'a1' AS account_id,
34 1500 AS amount_cents,
35 DATE('2026-04-25') AS pay_date
36
37 UNION ALL
38
39 SELECT
40 'a2',
41 2200,
42 DATE('2026-04-25')
43),
44raw_accounts AS (
45 SELECT
46 'a1' AS account_id,
47 'Acme' AS account_name,
48 'US' AS region
49
50 UNION ALL
51
52 SELECT
53 'a2',
54 'Globex',
55 'US'
56
57 UNION ALL
58
59 SELECT
60 'a3',
61 'Initech',
62 'EU'
63)
64
65SELECT
66 a.account_id,
67 a.account_name,
68 a.region,
69 COUNT(DISTINCT e.event_id) AS event_count,
70 COALESCE(
71 SUM(p.amount_cents) / 100,
72 0
73 ) AS revenue_usd
74FROM raw_accounts AS a
75LEFT JOIN raw_events AS e
76 ON e.account_id = a.account_id
77LEFT JOIN raw_payments AS p
78 ON p.account_id = a.account_id
79GROUP BY 1, 2, 3
80ORDER BY 1
One Big Daily DAG (Wrong)
  • Kafka events micro-batch on a daily cadence (16 hours of staleness)
  • Stripe runs once a day; finance has stale revenue for hours
  • A single failure halts unrelated upstream work
  • Coordinating cadences requires DAG-internal scheduling tricks
Three Cadence DAGs + Daily Mart (Right)
  • Each source runs on its natural cadence
  • Stripe and events stay fresh continuously for other consumers too
  • Failures are isolated to the affected cadence
  • Daily mart is a small DAG that joins the latest, no exotic scheduling
Lessons from the worked example:
  • Cadences belong to sources; the downstream join consumes the latest of each
  • Asset triggers (or sensors) at the seam replace time-offset assumptions
  • Wall-clock SLAs anchor the downstream DAG's schedule, not the upstreams' schedules
  • Each upstream DAG's failure stays isolated; downstream waits or fails fast with a clear error
TIP
When sources have different natural cadences, do not force them into one DAG. Three small DAGs and one downstream join cooperate cleanly through asset triggers and beat any single mega-DAG.
PUTTING IT ALL TOGETHER

> A logistics company has six teams, each running their own DAGs. The shipments DAG runs hourly, the inventory DAG runs every six hours, the analytics DAG runs daily. The analytics DAG has been time-staggered to start after the shipments DAG 'usually' finishes. It worked for a year, but a Black Friday spike caused shipments to run for three hours and analytics produced wrong numbers. The CTO asks: 'How is this rebuilt so it never breaks again?'

First, name the schedule type for every DAG. Shipments and inventory are interval-based (their cadence is what matters). Analytics is event-driven (it cares about freshness of its inputs).
Replace the analytics DAG's time-offset start with an asset trigger that fires when the shipments and inventory raw tables are fresh for the run date. Combined with a 5am wall-clock fallback, the SLA is preserved even on slow upstream days.
Apply the task vs DAG vs run vocabulary precisely: when shipments fails for the 2026-04-25 run, only that task instance is retried, and the analytics DAG either waits or fails fast. The DAG itself is not rerun, and unaffected days are not touched.
Add a freshness contract between teams: shipments commits to having raw.shipments updated by 4am Pacific, inventory by 4:30am. The analytics team relies on the contract, not on the producer's schedule. Schedule changes upstream become transparent.
From Lesson 2, recognize that this architecture mixes batch (daily analytics, daily Salesforce) with near-streaming (Kafka events, 15-minute Stripe). Each cadence belongs in its own DAG so the freshness tier of each is preserved.
KEY TAKEAWAYS
Three schedule forms cover production: cron for wall-clock alignment, interval for relative cadence, event triggers for arrival-driven runs.
Task, DAG, and run are different things: retries operate on task instances, the DAG is a structure, and the run is one execution. Conflating them produces the wrong fix.
Cross-DAG dependencies belong on assets, not on time: asset triggers decouple producer and consumer through freshness; sensors fill the gap when the producer is outside the orchestrator.
Sensors should be deferrable and bounded: deferrable mode releases the worker slot, and a timeout prevents a slot from waiting forever.
Multi-cadence systems need multi-cadence DAGs: do not force three different freshness tiers into one schedule. Let each source run on its natural cadence and join on freshness in a small downstream DAG.

Schedules trigger DAGs, runs are the work, and dependencies cross every boundary

Category
Pipeline Architecture
Difficulty
intermediate
Duration
30 minutes
Challenges
0 hands-on challenges

Topics covered: Schedules: Cron, Interval, Event, Task vs DAG vs Run for Retries, Cross-DAG Dependencies, Sensors and External Triggers, Three Cadences, One Output

Lesson Sections

  1. Schedules: Cron, Interval, Event (concepts: paDagOrchestration)

    Every DAG has a schedule. The schedule decides when a run starts. Three forms cover almost every production case: cron expressions for repeated wall-clock times, intervals for relative cadences (every fifteen minutes, every hour), and event triggers for runs that start when something external happens. A modern orchestrator supports all three, and the right choice depends on the source's behavior, not on engineer preference. Cron Expressions A cron expression is five fields that name a recurring

  2. Task vs DAG vs Run for Retries (concepts: paRetryHandling)

    Three words appear in every conversation about orchestration: task, DAG, and run. New engineers use the three interchangeably, and most of the time the imprecision does not bite. It bites hard when retry semantics are at stake, because the orchestrator retries at exactly one of those three levels and the answer matters. The vocabulary below is precise on purpose. The Three Words Retries Happen at the Task Instance Level When an orchestrator retries a failure, it retries a task instance. It does

  3. Cross-DAG Dependencies (concepts: paDependencyMgmt)

    Real production environments do not run one giant DAG. They run dozens of smaller DAGs, owned by different teams, on different cadences. Some of those DAGs depend on each other. The marketing analytics DAG reads tables produced by the orders DAG; the ML feature DAG reads tables produced by the events DAG. The dependency edge crosses a DAG boundary. Modeling that edge correctly is the difference between a system that scales across teams and one that breaks every time someone changes a schedule. T

  4. Sensors and External Triggers (concepts: paDagOrchestration)

    A pipeline often has to wait for something outside its control. A vendor SFTP drops a file at an irregular time. A REST API publishes a daily endpoint that becomes available between 1am and 4am. A Kafka topic accumulates messages, and a downstream batch process should kick off when the offset crosses a threshold. Sensors are the orchestration primitive that turns 'wait for the world' into a task the DAG can schedule around. Knowing the families of sensors and their trade-offs is part of the inte

  5. Three Cadences, One Output (concepts: paDagOrchestration)

    A real orchestration design rarely has one schedule. The example below builds a single downstream table that is fed by three sources, each on its own cadence. The shape is common in production: a daily executive table that combines streaming events, hourly Stripe data, and once-a-day Salesforce CRM. Designing this correctly requires every concept from the previous sections. The Sources and Their Cadences The Downstream Table The downstream is mart.daily_revenue_by_account, a single table that th