Orchestration and Dependencies: Intermediate

A logistics platform at growth stage ran twenty-eight DAGs across three teams. Each DAG individually was clean. The combined system was a mess. The shipments DAG ran hourly, the inventory DAG ran every six hours, and the customer-facing analytics DAG ran daily. The three needed to share a single fact table, and that table was being recomputed three different ways at three different cadences. Some mornings the analytics DAG ran before the inventory DAG and produced numbers that disagreed with the inventory team's dashboard. The fix was not a bigger DAG. The fix was a more careful vocabulary: the difference between a task, a DAG, and a run, and the difference between a sensor that polls and an asset trigger that fires. This lesson is about the second-order vocabulary that turns a working DAG into a system of DAGs that cooperate.

What you will be able to do

Recognize the three schedule types (cron, interval, event-triggered) and when each fits

Distinguish task, DAG, and run as separate concepts and apply the distinction to retry semantics

Design cross-DAG dependencies using sensors, asset triggers, and freshness contracts

Schedules: Cron, Interval, Event

Daily Life

Interviews

Choose between cron, interval, and event-triggered schedules based on the data arrival pattern and consumer freshness expectation.

Every DAG has a schedule. The schedule decides when a run starts. Three forms cover almost every production case: cron expressions for repeated wall-clock times, intervals for relative cadences (every fifteen minutes, every hour), and event triggers for runs that start when something external happens. A modern orchestrator supports all three, and the right choice depends on the source's behavior, not on engineer preference.

Cron Expressions

A cron expression is five fields that name a recurring time. The fields are minute, hour, day of month, month, and day of week. The expression `0 2 * * *` means 'minute zero of hour two of every day of every month, regardless of weekday', which is 2am every night. Cron expressions are dense and easy to misread; production teams paste them through validators and add comments next to every one.

Cron Expression	Meaning	Typical Use
0 2 * * *	2:00 AM every day	Daily ETL run after most app traffic has passed
/15 * * *	Every 15 minutes, top of the quarter hour	Near-real-time pull from a rate-limited API
0 /6 * *	Every 6 hours starting at midnight	Refresh of slow-changing reference data
0 9 * * 1	9:00 AM every Monday	Weekly executive report generation
0 0 1 * *	Midnight on the first day of every month	Monthly close, billing reconciliation

Interval Schedules

An interval schedule says 'run every N units' without anchoring to a wall-clock time. The first run happens when the DAG is deployed; subsequent runs happen every N units after the previous one. Intervals are the right schedule when the absolute time of a run does not matter, only the gap between runs. A pipeline that polls a Kafka offset every five minutes does not care whether the run starts at 2:00 or 2:03. The interval matters; the wall-clock alignment does not.

Event Triggers

Event triggers start a DAG when something external happens. The external thing can be a file landing in S3, a message arriving in a queue, an upstream DAG completing, a webhook firing from a third-party API. The schedule is conceptually 'when X occurs', not 'every Y minutes'. Event triggers are the most precise schedule type: the work runs exactly when there is work to do. They also introduce coupling to the external system's reliability, which is why they are usually paired with a fallback interval schedule that catches missed events.

	# Three schedule types in Airflow

	# Cron-based
	with DAG('daily_etl', schedule='0 2 * * *', ...):
	pass

	# Interval-based (every 15 minutes)
	with DAG('frequent_poll', schedule=timedelta(minutes=15), ...):
	pass

	# Event-based (Dataset trigger fires when an upstream asset updates)
	orders_dataset = Dataset('snowflake://prod/raw.orders')
	with DAG('downstream_mart', schedule=[orders_dataset], ...):
	pass

Picking the Right Schedule

•Cron Wins When

The work must happen at a specific wall-clock time
Downstream consumers expect data to be available at a fixed time
Compliance or business rules require alignment with a calendar moment
Coordination with other systems uses the same calendar (close at month end)

✓Interval or Event Wins When

The work depends on data arrival, not on the clock
The source is event-driven (Kafka, webhooks, S3 notifications)
The cadence matters more than the alignment
Wall-clock alignment would force unnecessary delays or rushes

Schedule Drift and the Catchup Setting

What happens when the orchestrator was offline for six hours and the schedule said the DAG should have run six times? Two answers exist, and the one chosen has consequences. Catchup mode runs every missed schedule sequentially when the orchestrator comes back. No-catchup mode runs only the most recent missed schedule and skips the rest. Catchup is correct when every run produces a different output that downstream consumers need (a daily partition that must exist for every day). No-catchup is correct when the runs are idempotent reflections of current state and only the latest matters. Picking the wrong one causes either gaps in historical data or a thundering herd of catch-up runs that overwhelm the warehouse.

Schedule decisions to make explicit at deploy time:

▸Catchup or no-catchup: do missed runs matter?
▸Timezone: most orchestrators default to UTC; production teams often align with business hours
▸Maximum concurrent runs: a slow run should not stack two on top of each other
▸Backfill window: how far back can the schedule be re-run on demand?

CronIntervalEvent

Cron

Wall-clock alignment

Five-field expression naming a recurring time. Right when the consumer expects data at a fixed business moment (6am, end of month, Monday morning).

Interval

Relative cadence

Run every N units. Right when the gap between runs matters more than alignment with a clock. Common for polling sources and micro-batch consumers.

Event

Arrival-driven

Fire when an external condition becomes true: file lands, asset updates, webhook fires. Most precise schedule type; couples reliability to the external system.

TIP

When picking a schedule, name the consumer first. The freshness expectation of the consumer is what should decide the cadence, not the convenience of the producer.

Three schedule forms cover almost every production case: cron, interval, event-trigger.

Catchup and no-catchup are not interchangeable; each has a correct case and a wrong one.

Event triggers are the most precise schedule type but also the one most coupled to external reliability.

Task vs DAG vs Run for Retries

Daily Life

Interviews

Distinguish task, DAG, run, and task instance, and apply the distinction to retry decisions.

Three words appear in every conversation about orchestration: task, DAG, and run. New engineers use the three interchangeably, and most of the time the imprecision does not bite. It bites hard when retry semantics are at stake, because the orchestrator retries at exactly one of those three levels and the answer matters. The vocabulary below is precise on purpose.

The Three Words

Term	Definition	Example
Task	A unit of work declared in the DAG (extract_orders)	One Python operator with retry policy 3
DAG	The full graph of tasks and dependencies, declared once	daily_orders_by_region, with extract -> clean -> aggregate
Run	One execution of the DAG triggered by the schedule	The 2026-04-25 run of daily_orders_by_region
Task instance	One task within one run	extract_orders during the 2026-04-25 run

Retries Happen at the Task Instance Level

When an orchestrator retries a failure, it retries a task instance. It does not retry the task in the abstract, because the task is a definition, not a running thing. It does not retry the DAG, because the DAG is the structure of work, not a running thing either. It retries this task in this run on this date with these inputs. That precision is what makes idempotency possible: the retry has the same inputs as the failure, so a correct task produces the same outputs.

	# Retry semantics tied to the task instance, not the DAG or the task definition
	# When extract_orders fails on the 2026-04-25 run:
	# - The task instance (extract_orders, 2026-04-25) is retried
	# - retries=3 means three more attempts of THIS task on THIS date
	# - Other dates' runs of extract_orders are unaffected
	# - The DAG itself is not 'retried'; only the failed task instance is

	extract = PythonOperator(
	task_id='extract_orders',
	python_callable=extract_orders_for_date,
	op_kwargs={'run_date': '{{ ds }}'}, # the run's logical date
	retries=3,
	retry_delay=timedelta(minutes=2),
	)

Why Run-Level Retries Are a Mistake

A DAG run that fails halfway has produced partial output. The first instinct of an inexperienced operator is to mark the whole run as failed and re-run it. That instinct is sometimes wrong. If the first three of five tasks succeeded, re-running the entire run repeats those three successes, which costs compute and can produce unexpected results if any of those tasks is not perfectly idempotent. The orchestrator's retry mechanism, by default, only retries the failed task and resumes from there. Marking the whole run for retry overrides that default and is occasionally the right call, but it should be a deliberate choice, not a reflex.

✓Task-Instance Retry (Default)

Only the failed task is rerun
Successful tasks keep their state
Cheap: only re-pays for the failed work
Safe even when upstream tasks are not perfectly idempotent

•DAG-Run Retry (Manual)

Every task in the run is rerun
Successful tasks lose their state and rerun
Expensive: re-pays for everything
Required only when upstream task outputs are corrupted

Logical Date vs Execution Date

A run carries a logical date (the date the run represents) and an execution date (the date the run actually happened). For a daily DAG that runs at 2am to process the previous day's data, the run on 2026-04-26 at 2am represents 2026-04-25. The logical date is 2026-04-25; the execution date is 2026-04-26. Tasks should reference the logical date, not the execution date, because the logical date is what the data partition is keyed on. Confusing the two is the second most common bug after cycles in DAGs, and it always shows up first in backfills.

When task, DAG, and run get conflated, the bugs that follow:

▸An operator marks the whole run as failed when only one task failed, wasting compute
▸Retry policy is set on the DAG when it should be per-task, applying wrong policy to wrong work
▸A backfill is mistaken for a fresh run, so logical date is wrong and partitions are misaligned
▸Schema changes deployed to a task definition are confused with changes to a single task instance

	# Show the difference between task instances of the same task across runs
	# Each call represents one task instance: a task on a specific run date

	class TaskInstance:
	def __init__(self, task_id, run_date, max_retries=3):
	self.task_id = task_id
	self.run_date = run_date
	self.max_retries = max_retries
	self.attempts = 0
	self.status = 'pending'

	def attempt(self, will_fail):
	self.attempts += 1
	if will_fail and self.attempts <= self.max_retries:
	self.status = 'failed_will_retry'
	print(f' attempt {self.attempts} of {self.task_id} for {self.run_date}: failed, retry')
	elif will_fail:
	self.status = 'failed_terminal'
	print(f' attempt {self.attempts} of {self.task_id} for {self.run_date}: failed, no retries left')
	else:
	self.status = 'success'
	print(f' attempt {self.attempts} of {self.task_id} for {self.run_date}: success')

	# Same task, different run dates: each is a separate task instance with its own retry budget
	ti_apr24 = TaskInstance('extract_orders', '2026-04-24')
	ti_apr25 = TaskInstance('extract_orders', '2026-04-25')

	print('Apr 24 run:')
	ti_apr24.attempt(will_fail=False)
	print('Apr 25 run:')
	ti_apr25.attempt(will_fail=True)
	ti_apr25.attempt(will_fail=True)
	ti_apr25.attempt(will_fail=False)

	print(f'\nApr 24 final status: {ti_apr24.status}')
	print(f'Apr 25 final status: {ti_apr25.status}')
	print('Note: Apr 24 retry budget is independent of Apr 25 retry budget.')

TaskDAGRunTask Instance

Task

The definition of work

Declared once in the DAG file. Names the operation, its retry policy, and its inputs. Lives in version control.

DAG

The structure of work

The directed acyclic graph of tasks and edges. Versioned together with the task definitions. The unit of deployment.

Run

An execution of the DAG

Triggered by the schedule. Carries a logical date. The thing that succeeds, fails, or gets backfilled.

Task Instance

One task in one run

The smallest addressable unit. Retries operate here. Logs are stored here. State is tracked here.

✓Do

Set retry policy at the task level, not the DAG level
Reference the logical date inside tasks; it is the date the data partition is keyed on
Re-run only the failed task instance when possible; preserve successful state

✗Don't

Re-run the whole DAG when a single task failed and other tasks finished cleanly
Use execution date when logical date is what the data uses (backfills break)
Apply one retry policy to every task in the DAG; tighten or loosen per task as needed

Retries operate on task instances, the smallest addressable unit; the DAG and the run are not retried.

Re-running an entire run repeats successful tasks and is rarely the right response to a single failure.

Logical date is what the data partition is keyed on; execution date is when the run actually fired.

extract

transform

load

Storage

warehouse

An orchestration DAG: tasks are nodes, dependencies are edges, and there are no cycles. The orchestrator runs each task only after its upstream finishes, retries failures, and backfills - things cron cannot do.

Cross-DAG Dependencies

Daily Life

Interviews

Choose between sensors, asset triggers, and time offsets for cross-DAG dependencies based on coupling and freshness needs.

Real production environments do not run one giant DAG. They run dozens of smaller DAGs, owned by different teams, on different cadences. Some of those DAGs depend on each other. The marketing analytics DAG reads tables produced by the orders DAG; the ML feature DAG reads tables produced by the events DAG. The dependency edge crosses a DAG boundary. Modeling that edge correctly is the difference between a system that scales across teams and one that breaks every time someone changes a schedule.

Three Ways to Express a Cross-DAG Edge

Mechanism	How It Works	Trade-off
Time offset	Schedule the downstream DAG to run after the upstream is expected to finish	Brittle: ignores the upstream's actual state, breaks when upstream runs slow
Sensor	A polling task in the downstream DAG waits for the upstream's run to succeed	Reliable but adds polling latency; ties downstream to upstream run state
Asset trigger	Downstream DAG runs when the asset (table, file) the upstream produces is updated	Decouples DAGs cleanly; depends on the orchestrator supporting asset-aware scheduling

Why Time Offsets Fail

Scheduling the downstream DAG at 3am because the upstream usually finishes by 2:45am is the cross-DAG version of the cron chain failure from the beginner tier. The bug is identical: when the upstream runs slow, the downstream still starts on schedule and reads stale state. Cross-DAG time offsets are the most common architectural anti-pattern in mature production environments because they accumulate one DAG at a time and look harmless until the day they fail. The fix is the same as inside a single DAG: encode the dependency, not the time.

Sensors: The Old Way

A sensor is a task that waits for an external condition to become true. The downstream DAG starts on its own schedule, but its first task is a sensor that polls the upstream DAG's run state every minute. The sensor task succeeds when the upstream's most recent run for the same logical date succeeded. Until then, the sensor sits in a poking state. Sensors are the reliable but slightly slow way to model cross-DAG dependencies. Polling adds minutes of lag, and a poorly tuned sensor can hold a worker slot indefinitely. Modern sensors use a deferrable mode that releases the slot while waiting, but the polling lag remains.

	# Airflow ExternalTaskSensor: wait for upstream DAG's task to succeed for the same logical date
	from airflow.sensors.external_task import ExternalTaskSensor

	wait_for_orders = ExternalTaskSensor(
	task_id='wait_for_orders_dag',
	external_dag_id='daily_orders_by_region',
	external_task_id='aggregate_orders',
	execution_delta=timedelta(0), # same logical date
	timeout=60 * 60 * 2, # give up after 2 hours
	poke_interval=60, # check every minute
	mode='reschedule', # release the worker slot between pokes
	)

Asset Triggers: The Modern Way

Asset-based scheduling reframes the dependency. The downstream DAG does not depend on the upstream DAG running. It depends on the data being fresh. When the upstream produces (or refreshes) the asset (a Snowflake table, an S3 prefix, a feature in a feature store), the orchestrator marks the asset as updated. Any DAG that schedules on that asset triggers automatically. Airflow ships Datasets for this; Dagster builds the entire orchestration model around software-defined assets; Prefect supports it through deployments and triggers. The shift is from 'wait for the producer' to 'trigger when the data is fresh', and it decouples the producer and consumer cleanly.

	# Asset-based cross-DAG trigger (Airflow Datasets)
	from airflow.datasets import Dataset

	orders_table = Dataset('snowflake://prod/mart/orders_by_region')

	# Upstream DAG declares it produces the dataset
	with DAG('daily_orders_by_region', schedule='0 2 * * *', ...) as dag:
	aggregate = PythonOperator(
	task_id='aggregate_orders',
	python_callable=aggregate_to_region,
	outlets=[orders_table], # this task updates the dataset
	)

	# Downstream DAG schedules ON the dataset, not on a clock
	with DAG('marketing_attribution', schedule=[orders_table], ...) as dag:
	attribution = PythonOperator(task_id='compute_attribution', ...)

The Freshness Contract

Cross-DAG dependencies require a contract between producer and consumer about freshness. The producer commits to delivering the asset by a stated time (or with a stated cadence). The consumer commits to relying only on the contract, not on the producer's internal scheduling. The contract is what allows the producer to refactor (split a DAG, change cadence, switch tools) without coordinating with every downstream consumer. Without the contract, every change to a producer DAG is a breaking change to N downstream DAGs, and refactoring becomes a months-long project.

•Time-Offset Cross-DAG

Downstream starts at a fixed wall-clock time
Slow upstream produces stale downstream
Schedule changes upstream require schedule changes downstream
Cross-team coordination needed for any cadence shift

✓Asset-Triggered Cross-DAG

Downstream starts when the asset is fresh
Slow upstream delays downstream rather than corrupting it
Schedule changes upstream are transparent to downstream
The contract is the freshness guarantee, not the schedule

Signs that cross-DAG dependencies are hand-wired through time:

▸Downstream DAG schedule comments explain 'runs after the orders DAG, which usually finishes by 2:45'
▸Slack messages exist asking the orders team to delay a deploy because it would break attribution
▸Failed runs always blame 'the downstream ran before the upstream finished'
▸A schedule change in one DAG triggers a chain of schedule changes in others

When Sensors Are Still the Right Answer

Asset triggers are the modern default, but sensors still earn their keep in two cases. First, when the producer is outside the orchestrator (a third-party SaaS that drops a file in S3), an asset trigger may not be wired up; a sensor that polls for the file is the only option. Second, when the producer DAG is owned by another company or another orchestrator, sensors bridge the gap. Sensors are not legacy; they are the right tool for the boundary between the orchestrator and the world outside it.

TIP

When designing a new cross-DAG edge, default to an asset trigger. Reach for a sensor only when the producer is outside the orchestrator's control. Never reach for a time offset; the failure mode is too well-known.

Sensors and External Triggers

Daily Life

Interviews

Place sensors and external triggers correctly at the boundary of the orchestrator, with timeouts and the right execution mode.

A pipeline often has to wait for something outside its control. A vendor SFTP drops a file at an irregular time. A REST API publishes a daily endpoint that becomes available between 1am and 4am. A Kafka topic accumulates messages, and a downstream batch process should kick off when the offset crosses a threshold. Sensors are the orchestration primitive that turns 'wait for the world' into a task the DAG can schedule around. Knowing the families of sensors and their trade-offs is part of the intermediate vocabulary.

What a Sensor Is, Precisely

A sensor is a task that periodically checks an external condition and succeeds when the condition becomes true. The condition can be 'a file exists in this S3 prefix', 'this API endpoint returns 200', 'this Kafka offset is at least N', 'this upstream task succeeded'. The sensor blocks the DAG until its condition is met or its timeout fires. Most modern sensors use a deferrable execution model that releases the worker slot while waiting, so a sensor that pokes for two hours does not waste a slot for two hours.

Common Sensor Families

Sensor	What It Waits For	Typical Use
FileSensor	A file exists at a path or prefix	SFTP file drop from a partner, batch export from an external system
S3KeySensor / GCSObjectSensor	An object exists in cloud storage	Vendor data drop, batch export from another team's pipeline
HTTPSensor	An HTTP endpoint returns a success status	Wait for an external API to publish a daily endpoint
ExternalTaskSensor	Another DAG's task succeeded for the same logical date	Cross-DAG dependency without an asset trigger
PythonSensor	Arbitrary Python predicate returns True	Custom condition (Kafka offset, queue depth, third-party SDK)

Poking vs Reschedule vs Deferrable

Sensors come in three execution modes. Poking holds a worker slot and re-evaluates on a fixed interval. Reschedule releases the slot between pokes and asks the scheduler to run the sensor again later. Deferrable hands off to an asynchronous trigger system that watches the condition and resumes the task when it fires. Deferrable is the modern default because it scales to thousands of concurrent waits without burning worker capacity. Poking is the legacy default and is acceptable for short waits with few concurrent sensors. Reschedule is the middle ground.

•Poking Mode

Worker slot held continuously while waiting
Re-evaluation cost is small but accumulates
Hundreds of waiting sensors exhaust the worker pool
Simpler to debug; the slot is right there

✓Deferrable Mode

Worker slot released; trigger watches asynchronously
Scales to thousands of concurrent waits
Resumes on the trigger event with low latency
Requires a triggerer process running alongside the scheduler

External Triggers (Webhooks and Events)

An external trigger is the inverse of a sensor. Instead of polling for a condition, the orchestrator exposes an endpoint that an external system can call to start a DAG. Modern orchestrators ship with a REST API that accepts trigger requests, and webhook integrations route events from systems like Stripe, GitHub, or a SaaS app into DAG runs. External triggers are the right answer when the external system can call the orchestrator; sensors are the right answer when only the orchestrator can call the external system.

	# A Kafka-offset PythonSensor in deferrable mode
	from airflow.sensors.python import PythonSensor

	def kafka_offset_at_least(target_offset: int) -> bool:
	from kafka import KafkaConsumer
	consumer = KafkaConsumer(bootstrap_servers='kafka:9092', group_id='orchestrator')
	partitions = consumer.partitions_for_topic('events')
	end_offsets = consumer.end_offsets(
	[TopicPartition('events', p) for p in partitions]
	)
	return min(end_offsets.values()) >= target_offset

	wait_for_offset = PythonSensor(
	task_id='wait_for_minimum_offset',
	python_callable=kafka_offset_at_least,
	op_kwargs={'target_offset': 1_000_000},
	poke_interval=30,
	timeout=60 * 30,
	mode='reschedule',
	)

Failure Modes Specific to Sensors

Sensor anti-patterns that show up in production:

▸Sensor with no timeout that pokes forever when the upstream is permanently broken
▸Hundreds of poking sensors burning the entire worker pool on a slow morning
▸Sensor that polls a third-party API and trips its rate limit
▸Sensor whose success condition is a side effect (file exists) without a content check

When to Skip the Sensor Entirely

Sometimes the right design is no sensor. If the external system can call the orchestrator's API or send a webhook, an external trigger replaces the sensor. If the upstream is another DAG in the same orchestrator, an asset trigger replaces the sensor. Sensors earn their place at the boundary between the orchestrator and the outside world; everywhere else, alternatives are usually cleaner.

Every sensor should have a timeout. A sensor without a timeout is a worker slot that may never come back.

	# Simulate a deferrable sensor: it does not hold a worker slot while waiting
	# Compare with a poking sensor, which does

	import time

	class PokingSensor:
	def __init__(self, name):
	self.name = name
	self.slot_held_seconds = 0

	def wait(self, condition_at_seconds):
	for t in range(condition_at_seconds + 1):
	self.slot_held_seconds += 1
	# In real Airflow, a worker slot is held the entire time
	return self.slot_held_seconds

	class DeferrableSensor:
	def __init__(self, name):
	self.name = name
	self.slot_held_seconds = 0

	def wait(self, condition_at_seconds):
	# Trigger watches asynchronously; worker slot is released
	# Slot is held only at start (1s) and at resume (1s)
	self.slot_held_seconds = 2
	return self.slot_held_seconds

	p = PokingSensor('poke_for_file')
	d = DeferrableSensor('defer_for_file')

	for sensor in (p, d):
	sensor.wait(condition_at_seconds=600) # condition true after 10 minutes
	print(f'{sensor.name}: held worker slot for {sensor.slot_held_seconds}s')

	print('\nPoking burns slots proportional to wait time. Deferrable scales.')

TIP

Use deferrable sensors when the orchestrator supports them, set a timeout that reflects the longest reasonable wait, and prefer external triggers when the upstream system can call the orchestrator.

Three Cadences, One Output

Daily Life

Interviews

Design a multi-cadence ingestion pattern with three upstream DAGs and one downstream DAG that joins on freshness rather than time offsets.

A real orchestration design rarely has one schedule. The example below builds a single downstream table that is fed by three sources, each on its own cadence. The shape is common in production: a daily executive table that combines streaming events, hourly Stripe data, and once-a-day Salesforce CRM. Designing this correctly requires every concept from the previous sections.

The Sources and Their Cadences

Source	Cadence	Why
Mobile events (Kafka)	Continuous; micro-batched into S3 every 5 minutes	Volume is too high for hourly; downstream wants sub-hour freshness
Stripe payments (REST API)	Every 15 minutes	Rate limit allows it; finance wants near-real-time revenue
Salesforce CRM (REST API)	Once daily at 1am	CRM data changes slowly; daily is more than enough

The Downstream Table

The downstream is mart.daily_revenue_by_account, a single table that the executive dashboard reads. One row per account per day, with columns for total events, total revenue, and account-level metadata from CRM. The freshness bar is 'available by 6am Pacific each morning'. Three sources at three cadences must produce this one table on time, every day, without time-offset hand-wiring.

Three DAGs, Three Cadences

The first design instinct is to put everything in one DAG that runs daily. That instinct is wrong. The hourly Stripe pull and the five-minute Kafka micro-batch should not run on a daily cadence; they should run on their natural cadences and feed durable raw tables that the daily DAG reads. The right shape is three DAGs at three cadences, one daily DAG that joins the latest of each, and asset triggers that connect them.

	DAG : kafka_events_to_raw schedule : every 5 minutes micro_batch_consume -> land_to_s3 -> register_partition outlets : raw.events DAG : stripe_payments_to_raw schedule : every 15 minutes fetch_stripe_pages -> land_to_snowflake outlets : raw.payments DAG : salesforce_to_raw schedule : 0 1 * * * fetch_accounts_full -> land_to_snowflake outlets : raw.accounts DAG : daily_revenue_by_account schedule : 0 5 * * * read raw.events, raw.payments, raw.accounts



	JOIN + aggregate publish mart.daily_revenue_by_account

How the Cadences Cooperate

Each upstream DAG runs on its own schedule and produces an asset (raw.events, raw.payments, raw.accounts). The downstream DAG runs at 5am, after the latest natural runs of all three upstreams have completed for the previous day. It does not depend on the upstream DAGs running at any specific time; it depends on the data being fresh, which the upstream cadences guarantee. The 5am clock-based schedule is the right answer here precisely because the freshness bar (6am Pacific) is a wall-clock commitment, and the wall-clock anchor lives at the seam between the upstream cadences and the downstream commitment.

What Each Task Reads

	INSERT INTO mart.daily_revenue_by_account
	SELECT
	a.account_id,
	a.account_name,
	a.region,
	DATE(: run_date) AS revenue_date,
	COUNT(DISTINCT e.event_id) AS event_count,
	COALESCE(SUM(p.amount_cents) / 100.0, 0) AS revenue_usd
	FROM raw.accounts a
	LEFT JOIN raw.events e
	ON e.account_id = a.account_id AND DATE(e.event_timestamp) = : run_date
	LEFT JOIN raw.payments p
	ON p.account_id = a.account_id AND DATE(p.created_at) = : run_date
	WHERE a.is_active = TRUE
	GROUP BY 1, 2, 3, 4 ;

Failure Behavior at Each Cadence

DAG	If It Fails	Recovery
kafka_events_to_raw (5 min)	Next run picks up from the same Kafka offset	At-least-once consumption with idempotent S3 writes
stripe_payments_to_raw (15 min)	Next run extends the window to cover both batches	High-water mark only advances on success
salesforce_to_raw (daily)	Alert at 1:30am; on-call investigates	Retries with backoff; manual rerun if all retries fail
daily_revenue_by_account (5am)	Sensor or asset check on raw tables; runs when all three are fresh	Alert if not done by 5:45am to give 15 minutes before SLA

	/* Simulate the join the daily DAG performs at 5am */
	/* Three small CTEs stand in for the three raw tables */
	/* The query produces one row per account per day */
	WITH raw_events AS (
	SELECT
	'a1' AS account_id,
	'evt_1' AS event_id,
	DATE('2026-04-25') AS event_date

	UNION ALL

	SELECT
	'a1',
	'evt_2',
	DATE('2026-04-25')

	UNION ALL

	SELECT
	'a2',
	'evt_3',
	DATE('2026-04-25')

	UNION ALL

	SELECT
	'a1',
	'evt_4',
	DATE('2026-04-25')
	),
	raw_payments AS (
	SELECT
	'a1' AS account_id,
	1500 AS amount_cents,
	DATE('2026-04-25') AS pay_date

	UNION ALL

	SELECT
	'a2',
	2200,
	DATE('2026-04-25')
	),
	raw_accounts AS (
	SELECT
	'a1' AS account_id,
	'Acme' AS account_name,
	'US' AS region

	UNION ALL

	SELECT
	'a2',
	'Globex',
	'US'

	UNION ALL

	SELECT
	'a3',
	'Initech',
	'EU'
	)

	SELECT
	a.account_id,
	a.account_name,
	a.region,
	COUNT(DISTINCT e.event_id) AS event_count,
	COALESCE(
	SUM(p.amount_cents) / 100,
	0
	) AS revenue_usd
	FROM raw_accounts AS a
	LEFT JOIN raw_events AS e
	ON e.account_id = a.account_id
	LEFT JOIN raw_payments AS p
	ON p.account_id = a.account_id
	GROUP BY 1, 2, 3
	ORDER BY 1

•One Big Daily DAG (Wrong)

Kafka events micro-batch on a daily cadence (16 hours of staleness)
Stripe runs once a day; finance has stale revenue for hours
A single failure halts unrelated upstream work
Coordinating cadences requires DAG-internal scheduling tricks

✓Three Cadence DAGs + Daily Mart (Right)

Each source runs on its natural cadence
Stripe and events stay fresh continuously for other consumers too
Failures are isolated to the affected cadence
Daily mart is a small DAG that joins the latest, no exotic scheduling

Lessons from the worked example:

▸Cadences belong to sources; the downstream join consumes the latest of each
▸Asset triggers (or sensors) at the seam replace time-offset assumptions
▸Wall-clock SLAs anchor the downstream DAG's schedule, not the upstreams' schedules
▸Each upstream DAG's failure stays isolated; downstream waits or fails fast with a clear error

TIP

When sources have different natural cadences, do not force them into one DAG. Three small DAGs and one downstream join cooperate cleanly through asset triggers and beat any single mega-DAG.

❯❯❯PUTTING IT ALL TOGETHER

> A logistics company has six teams, each running their own DAGs. The shipments DAG runs hourly, the inventory DAG runs every six hours, the analytics DAG runs daily. The analytics DAG has been time-staggered to start after the shipments DAG 'usually' finishes. It worked for a year, but a Black Friday spike caused shipments to run for three hours and analytics produced wrong numbers. The CTO asks: 'How is this rebuilt so it never breaks again?'

First, name the schedule type for every DAG. Shipments and inventory are interval-based (their cadence is what matters). Analytics is event-driven (it cares about freshness of its inputs).

Replace the analytics DAG's time-offset start with an asset trigger that fires when the shipments and inventory raw tables are fresh for the run date. Combined with a 5am wall-clock fallback, the SLA is preserved even on slow upstream days.

Apply the task vs DAG vs run vocabulary precisely: when shipments fails for the 2026-04-25 run, only that task instance is retried, and the analytics DAG either waits or fails fast. The DAG itself is not rerun, and unaffected days are not touched.

Add a freshness contract between teams: shipments commits to having raw.shipments updated by 4am Pacific, inventory by 4:30am. The analytics team relies on the contract, not on the producer's schedule. Schedule changes upstream become transparent.

From Lesson 2, recognize that this architecture mixes batch (daily analytics, daily Salesforce) with near-streaming (Kafka events, 15-minute Stripe). Each cadence belongs in its own DAG so the freshness tier of each is preserved.

KEY TAKEAWAYS

Three schedule forms cover production: cron for wall-clock alignment, interval for relative cadence, event triggers for arrival-driven runs.

Task, DAG, and run are different things: retries operate on task instances, the DAG is a structure, and the run is one execution. Conflating them produces the wrong fix.

Cross-DAG dependencies belong on assets, not on time: asset triggers decouple producer and consumer through freshness; sensors fill the gap when the producer is outside the orchestrator.

Sensors should be deferrable and bounded: deferrable mode releases the worker slot, and a timeout prevents a slot from waiting forever.

Multi-cadence systems need multi-cadence DAGs: do not force three different freshness tiers into one schedule. Let each source run on its natural cadence and join on freshness in a small downstream DAG.

Schedules trigger DAGs, runs are the work, and dependencies cross every boundary

Category: Pipeline Architecture
Difficulty: intermediate
Duration: 30 minutes
Challenges: 0 hands-on challenges

Topics covered: Schedules: Cron, Interval, Event, Task vs DAG vs Run for Retries, Cross-DAG Dependencies, Sensors and External Triggers, Three Cadences, One Output

Lesson Sections

Schedules: Cron, Interval, Event (concepts: paDagOrchestration)
Every DAG has a schedule. The schedule decides when a run starts. Three forms cover almost every production case: cron expressions for repeated wall-clock times, intervals for relative cadences (every fifteen minutes, every hour), and event triggers for runs that start when something external happens. A modern orchestrator supports all three, and the right choice depends on the source's behavior, not on engineer preference. Cron Expressions A cron expression is five fields that name a recurring
Task vs DAG vs Run for Retries (concepts: paRetryHandling)
Three words appear in every conversation about orchestration: task, DAG, and run. New engineers use the three interchangeably, and most of the time the imprecision does not bite. It bites hard when retry semantics are at stake, because the orchestrator retries at exactly one of those three levels and the answer matters. The vocabulary below is precise on purpose. The Three Words Retries Happen at the Task Instance Level When an orchestrator retries a failure, it retries a task instance. It does
Cross-DAG Dependencies (concepts: paDependencyMgmt)
Real production environments do not run one giant DAG. They run dozens of smaller DAGs, owned by different teams, on different cadences. Some of those DAGs depend on each other. The marketing analytics DAG reads tables produced by the orders DAG; the ML feature DAG reads tables produced by the events DAG. The dependency edge crosses a DAG boundary. Modeling that edge correctly is the difference between a system that scales across teams and one that breaks every time someone changes a schedule. T
Sensors and External Triggers (concepts: paDagOrchestration)
A pipeline often has to wait for something outside its control. A vendor SFTP drops a file at an irregular time. A REST API publishes a daily endpoint that becomes available between 1am and 4am. A Kafka topic accumulates messages, and a downstream batch process should kick off when the offset crosses a threshold. Sensors are the orchestration primitive that turns 'wait for the world' into a task the DAG can schedule around. Knowing the families of sensors and their trade-offs is part of the inte
Three Cadences, One Output (concepts: paDagOrchestration)
A real orchestration design rarely has one schedule. The example below builds a single downstream table that is fed by three sources, each on its own cadence. The shape is common in production: a daily executive table that combines streaming events, hourly Stripe data, and once-a-day Salesforce CRM. Designing this correctly requires every concept from the previous sections. The Sources and Their Cadences The Downstream Table The downstream is mart.daily_revenue_by_account, a single table that th