A logistics platform at growth stage ran twenty-eight DAGs across three teams. Each DAG individually was clean. The combined system was a mess. The shipments DAG ran hourly, the inventory DAG ran every six hours, and the customer-facing analytics DAG ran daily. The three needed to share a single fact table, and that table was being recomputed three different ways at three different cadences. Some mornings the analytics DAG ran before the inventory DAG and produced numbers that disagreed with the inventory team's dashboard. The fix was not a bigger DAG. The fix was a more careful vocabulary: the difference between a task, a DAG, and a run, and the difference between a sensor that polls and an asset trigger that fires. This lesson is about the second-order vocabulary that turns a working DAG into a system of DAGs that cooperate.
Schedules: Cron, Interval, Event
Daily Life
Interviews
Choose between cron, interval, and event-triggered schedules based on the data arrival pattern and consumer freshness expectation.
Every DAG has a schedule. The schedule decides when a run starts. Three forms cover almost every production case: cron expressions for repeated wall-clock times, intervals for relative cadences (every fifteen minutes, every hour), and event triggers for runs that start when something external happens. A modern orchestrator supports all three, and the right choice depends on the source's behavior, not on engineer preference.
Cron Expressions
A cron expression is five fields that name a recurring time. The fields are minute, hour, day of month, month, and day of week. The expression `0 2 * * *` means 'minute zero of hour two of every day of every month, regardless of weekday', which is 2am every night. Cron expressions are dense and easy to misread; production teams paste them through validators and add comments next to every one.
Cron Expression
Meaning
Typical Use
0 2 * * *
2:00 AM every day
Daily ETL run after most app traffic has passed
*/15 * * * *
Every 15 minutes, top of the quarter hour
Near-real-time pull from a rate-limited API
0 */6 * * *
Every 6 hours starting at midnight
Refresh of slow-changing reference data
0 9 * * 1
9:00 AM every Monday
Weekly executive report generation
0 0 1 * *
Midnight on the first day of every month
Monthly close, billing reconciliation
Interval Schedules
An interval schedule says 'run every N units' without anchoring to a wall-clock time. The first run happens when the DAG is deployed; subsequent runs happen every N units after the previous one. Intervals are the right schedule when the absolute time of a run does not matter, only the gap between runs. A pipeline that polls a Kafka offset every five minutes does not care whether the run starts at 2:00 or 2:03. The interval matters; the wall-clock alignment does not.
Event Triggers
Event triggers start a DAG when something external happens. The external thing can be a file landing in S3, a message arriving in a queue, an upstream DAG completing, a webhook firing from a third-party API. The schedule is conceptually 'when X occurs', not 'every Y minutes'. Event triggers are the most precise schedule type: the work runs exactly when there is work to do. They also introduce coupling to the external system's reliability, which is why they are usually paired with a fallback interval schedule that catches missed events.
The work must happen at a specific wall-clock time
Downstream consumers expect data to be available at a fixed time
Compliance or business rules require alignment with a calendar moment
Coordination with other systems uses the same calendar (close at month end)
✓Interval or Event Wins When
The work depends on data arrival, not on the clock
The source is event-driven (Kafka, webhooks, S3 notifications)
The cadence matters more than the alignment
Wall-clock alignment would force unnecessary delays or rushes
Schedule Drift and the Catchup Setting
What happens when the orchestrator was offline for six hours and the schedule said the DAG should have run six times? Two answers exist, and the one chosen has consequences. Catchup mode runs every missed schedule sequentially when the orchestrator comes back. No-catchup mode runs only the most recent missed schedule and skips the rest. Catchup is correct when every run produces a different output that downstream consumers need (a daily partition that must exist for every day). No-catchup is correct when the runs are idempotent reflections of current state and only the latest matters. Picking the wrong one causes either gaps in historical data or a thundering herd of catch-up runs that overwhelm the warehouse.
Schedule decisions to make explicit at deploy time:
▸Catchup or no-catchup: do missed runs matter?
▸Timezone: most orchestrators default to UTC; production teams often align with business hours
▸Maximum concurrent runs: a slow run should not stack two on top of each other
▸Backfill window: how far back can the schedule be re-run on demand?
CronIntervalEvent
Cron
Wall-clock alignment
Five-field expression naming a recurring time. Right when the consumer expects data at a fixed business moment (6am, end of month, Monday morning).
Interval
Relative cadence
Run every N units. Right when the gap between runs matters more than alignment with a clock. Common for polling sources and micro-batch consumers.
Event
Arrival-driven
Fire when an external condition becomes true: file lands, asset updates, webhook fires. Most precise schedule type; couples reliability to the external system.
TIP
When picking a schedule, name the consumer first. The freshness expectation of the consumer is what should decide the cadence, not the convenience of the producer.
Three schedule forms cover almost every production case: cron, interval, event-trigger.
Catchup and no-catchup are not interchangeable; each has a correct case and a wrong one.
Event triggers are the most precise schedule type but also the one most coupled to external reliability.
Task vs DAG vs Run for Retries
Daily Life
Interviews
Distinguish task, DAG, run, and task instance, and apply the distinction to retry decisions.
Three words appear in every conversation about orchestration: task, DAG, and run. New engineers use the three interchangeably, and most of the time the imprecision does not bite. It bites hard when retry semantics are at stake, because the orchestrator retries at exactly one of those three levels and the answer matters. The vocabulary below is precise on purpose.
The Three Words
Term
Definition
Example
Task
A unit of work declared in the DAG (extract_orders)
One Python operator with retry policy 3
DAG
The full graph of tasks and dependencies, declared once
daily_orders_by_region, with extract -> clean -> aggregate
Run
One execution of the DAG triggered by the schedule
The 2026-04-25 run of daily_orders_by_region
Task instance
One task within one run
extract_orders during the 2026-04-25 run
Retries Happen at the Task Instance Level
When an orchestrator retries a failure, it retries a task instance. It does not retry the task in the abstract, because the task is a definition, not a running thing. It does not retry the DAG, because the DAG is the structure of work, not a running thing either. It retries this task in this run on this date with these inputs. That precision is what makes idempotency possible: the retry has the same inputs as the failure, so a correct task produces the same outputs.
1
# Retry semantics tied to the task instance, not the DAG or the task definition
2
# When extract_orders fails on the 2026-04-25 run:
3
# - The task instance (extract_orders, 2026-04-25) is retried
4
# - retries=3 means three more attempts of THIS task on THIS date
5
# - Other dates' runs of extract_orders are unaffected
6
# - The DAG itself is not 'retried'; only the failed task instance is
7
8
extract=PythonOperator(
9
task_id='extract_orders',
10
python_callable=extract_orders_for_date,
11
op_kwargs={'run_date':'{{ ds }}'},# the run's logical date
12
retries=3,
13
retry_delay=timedelta(minutes=2),
14
)
Why Run-Level Retries Are a Mistake
A DAG run that fails halfway has produced partial output. The first instinct of an inexperienced operator is to mark the whole run as failed and re-run it. That instinct is sometimes wrong. If the first three of five tasks succeeded, re-running the entire run repeats those three successes, which costs compute and can produce unexpected results if any of those tasks is not perfectly idempotent. The orchestrator's retry mechanism, by default, only retries the failed task and resumes from there. Marking the whole run for retry overrides that default and is occasionally the right call, but it should be a deliberate choice, not a reflex.
✓Task-Instance Retry (Default)
Only the failed task is rerun
Successful tasks keep their state
Cheap: only re-pays for the failed work
Safe even when upstream tasks are not perfectly idempotent
•DAG-Run Retry (Manual)
Every task in the run is rerun
Successful tasks lose their state and rerun
Expensive: re-pays for everything
Required only when upstream task outputs are corrupted
Logical Date vs Execution Date
A run carries a logical date (the date the run represents) and an execution date (the date the run actually happened). For a daily DAG that runs at 2am to process the previous day's data, the run on 2026-04-26 at 2am represents 2026-04-25. The logical date is 2026-04-25; the execution date is 2026-04-26. Tasks should reference the logical date, not the execution date, because the logical date is what the data partition is keyed on. Confusing the two is the second most common bug after cycles in DAGs, and it always shows up first in backfills.
When task, DAG, and run get conflated, the bugs that follow:
▸An operator marks the whole run as failed when only one task failed, wasting compute
▸Retry policy is set on the DAG when it should be per-task, applying wrong policy to wrong work
▸A backfill is mistaken for a fresh run, so logical date is wrong and partitions are misaligned
▸Schema changes deployed to a task definition are confused with changes to a single task instance
1
# Show the difference between task instances of the same task across runs
2
# Each call represents one task instance: a task on a specific run date
3
4
classTaskInstance:
5
def__init__(self,task_id,run_date,max_retries=3):
6
self.task_id=task_id
7
self.run_date=run_date
8
self.max_retries=max_retries
9
self.attempts=0
10
self.status='pending'
11
12
defattempt(self,will_fail):
13
self.attempts+=1
14
ifwill_failandself.attempts<=self.max_retries:
15
self.status='failed_will_retry'
16
print(f' attempt {self.attempts} of {self.task_id} for {self.run_date}: failed, retry')
17
elifwill_fail:
18
self.status='failed_terminal'
19
print(f' attempt {self.attempts} of {self.task_id} for {self.run_date}: failed, no retries left')
20
else:
21
self.status='success'
22
print(f' attempt {self.attempts} of {self.task_id} for {self.run_date}: success')
23
24
# Same task, different run dates: each is a separate task instance with its own retry budget
print(f'\nApr 24 final status: {ti_apr24.status}')
36
print(f'Apr 25 final status: {ti_apr25.status}')
37
print('Note: Apr 24 retry budget is independent of Apr 25 retry budget.')
38
TaskDAGRunTask Instance
Task
The definition of work
Declared once in the DAG file. Names the operation, its retry policy, and its inputs. Lives in version control.
DAG
The structure of work
The directed acyclic graph of tasks and edges. Versioned together with the task definitions. The unit of deployment.
Run
An execution of the DAG
Triggered by the schedule. Carries a logical date. The thing that succeeds, fails, or gets backfilled.
Task Instance
One task in one run
The smallest addressable unit. Retries operate here. Logs are stored here. State is tracked here.
✓Do
Set retry policy at the task level, not the DAG level
Reference the logical date inside tasks; it is the date the data partition is keyed on
Re-run only the failed task instance when possible; preserve successful state
✗Don't
Re-run the whole DAG when a single task failed and other tasks finished cleanly
Use execution date when logical date is what the data uses (backfills break)
Apply one retry policy to every task in the DAG; tighten or loosen per task as needed
Retries operate on task instances, the smallest addressable unit; the DAG and the run are not retried.
Re-running an entire run repeats successful tasks and is rarely the right response to a single failure.
Logical date is what the data partition is keyed on; execution date is when the run actually fired.
extract
extract
transform
transform
load
load
Storage
warehouse
An orchestration DAG: tasks are nodes, dependencies are edges, and there are no cycles. The orchestrator runs each task only after its upstream finishes, retries failures, and backfills - things cron cannot do.
Cross-DAG Dependencies
Daily Life
Interviews
Choose between sensors, asset triggers, and time offsets for cross-DAG dependencies based on coupling and freshness needs.
Real production environments do not run one giant DAG. They run dozens of smaller DAGs, owned by different teams, on different cadences. Some of those DAGs depend on each other. The marketing analytics DAG reads tables produced by the orders DAG; the ML feature DAG reads tables produced by the events DAG. The dependency edge crosses a DAG boundary. Modeling that edge correctly is the difference between a system that scales across teams and one that breaks every time someone changes a schedule.
Three Ways to Express a Cross-DAG Edge
Mechanism
How It Works
Trade-off
Time offset
Schedule the downstream DAG to run after the upstream is expected to finish
Brittle: ignores the upstream's actual state, breaks when upstream runs slow
Sensor
A polling task in the downstream DAG waits for the upstream's run to succeed
Reliable but adds polling latency; ties downstream to upstream run state
Asset trigger
Downstream DAG runs when the asset (table, file) the upstream produces is updated
Decouples DAGs cleanly; depends on the orchestrator supporting asset-aware scheduling
Why Time Offsets Fail
Scheduling the downstream DAG at 3am because the upstream usually finishes by 2:45am is the cross-DAG version of the cron chain failure from the beginner tier. The bug is identical: when the upstream runs slow, the downstream still starts on schedule and reads stale state. Cross-DAG time offsets are the most common architectural anti-pattern in mature production environments because they accumulate one DAG at a time and look harmless until the day they fail. The fix is the same as inside a single DAG: encode the dependency, not the time.
Sensors: The Old Way
A sensor is a task that waits for an external condition to become true. The downstream DAG starts on its own schedule, but its first task is a sensor that polls the upstream DAG's run state every minute. The sensor task succeeds when the upstream's most recent run for the same logical date succeeded. Until then, the sensor sits in a poking state. Sensors are the reliable but slightly slow way to model cross-DAG dependencies. Polling adds minutes of lag, and a poorly tuned sensor can hold a worker slot indefinitely. Modern sensors use a deferrable mode that releases the slot while waiting, but the polling lag remains.
1
# Airflow ExternalTaskSensor: wait for upstream DAG's task to succeed for the same logical date
mode='reschedule',# release the worker slot between pokes
12
)
Asset Triggers: The Modern Way
Asset-based scheduling reframes the dependency. The downstream DAG does not depend on the upstream DAG running. It depends on the data being fresh. When the upstream produces (or refreshes) the asset (a Snowflake table, an S3 prefix, a feature in a feature store), the orchestrator marks the asset as updated. Any DAG that schedules on that asset triggers automatically. Airflow ships Datasets for this; Dagster builds the entire orchestration model around software-defined assets; Prefect supports it through deployments and triggers. The shift is from 'wait for the producer' to 'trigger when the data is fresh', and it decouples the producer and consumer cleanly.
Cross-DAG dependencies require a contract between producer and consumer about freshness. The producer commits to delivering the asset by a stated time (or with a stated cadence). The consumer commits to relying only on the contract, not on the producer's internal scheduling. The contract is what allows the producer to refactor (split a DAG, change cadence, switch tools) without coordinating with every downstream consumer. Without the contract, every change to a producer DAG is a breaking change to N downstream DAGs, and refactoring becomes a months-long project.
Cross-team coordination needed for any cadence shift
✓Asset-Triggered Cross-DAG
Downstream starts when the asset is fresh
Slow upstream delays downstream rather than corrupting it
Schedule changes upstream are transparent to downstream
The contract is the freshness guarantee, not the schedule
Signs that cross-DAG dependencies are hand-wired through time:
▸Downstream DAG schedule comments explain 'runs after the orders DAG, which usually finishes by 2:45'
▸Slack messages exist asking the orders team to delay a deploy because it would break attribution
▸Failed runs always blame 'the downstream ran before the upstream finished'
▸A schedule change in one DAG triggers a chain of schedule changes in others
When Sensors Are Still the Right Answer
Asset triggers are the modern default, but sensors still earn their keep in two cases. First, when the producer is outside the orchestrator (a third-party SaaS that drops a file in S3), an asset trigger may not be wired up; a sensor that polls for the file is the only option. Second, when the producer DAG is owned by another company or another orchestrator, sensors bridge the gap. Sensors are not legacy; they are the right tool for the boundary between the orchestrator and the world outside it.
TIP
When designing a new cross-DAG edge, default to an asset trigger. Reach for a sensor only when the producer is outside the orchestrator's control. Never reach for a time offset; the failure mode is too well-known.
Sensors and External Triggers
Daily Life
Interviews
Place sensors and external triggers correctly at the boundary of the orchestrator, with timeouts and the right execution mode.
A pipeline often has to wait for something outside its control. A vendor SFTP drops a file at an irregular time. A REST API publishes a daily endpoint that becomes available between 1am and 4am. A Kafka topic accumulates messages, and a downstream batch process should kick off when the offset crosses a threshold. Sensors are the orchestration primitive that turns 'wait for the world' into a task the DAG can schedule around. Knowing the families of sensors and their trade-offs is part of the intermediate vocabulary.
What a Sensor Is, Precisely
A sensor is a task that periodically checks an external condition and succeeds when the condition becomes true. The condition can be 'a file exists in this S3 prefix', 'this API endpoint returns 200', 'this Kafka offset is at least N', 'this upstream task succeeded'. The sensor blocks the DAG until its condition is met or its timeout fires. Most modern sensors use a deferrable execution model that releases the worker slot while waiting, so a sensor that pokes for two hours does not waste a slot for two hours.
Common Sensor Families
Sensor
What It Waits For
Typical Use
FileSensor
A file exists at a path or prefix
SFTP file drop from a partner, batch export from an external system
S3KeySensor / GCSObjectSensor
An object exists in cloud storage
Vendor data drop, batch export from another team's pipeline
HTTPSensor
An HTTP endpoint returns a success status
Wait for an external API to publish a daily endpoint
ExternalTaskSensor
Another DAG's task succeeded for the same logical date
Sensors come in three execution modes. Poking holds a worker slot and re-evaluates on a fixed interval. Reschedule releases the slot between pokes and asks the scheduler to run the sensor again later. Deferrable hands off to an asynchronous trigger system that watches the condition and resumes the task when it fires. Deferrable is the modern default because it scales to thousands of concurrent waits without burning worker capacity. Poking is the legacy default and is acceptable for short waits with few concurrent sensors. Reschedule is the middle ground.
•Poking Mode
Worker slot held continuously while waiting
Re-evaluation cost is small but accumulates
Hundreds of waiting sensors exhaust the worker pool
Requires a triggerer process running alongside the scheduler
External Triggers (Webhooks and Events)
An external trigger is the inverse of a sensor. Instead of polling for a condition, the orchestrator exposes an endpoint that an external system can call to start a DAG. Modern orchestrators ship with a REST API that accepts trigger requests, and webhook integrations route events from systems like Stripe, GitHub, or a SaaS app into DAG runs. External triggers are the right answer when the external system can call the orchestrator; sensors are the right answer when only the orchestrator can call the external system.
▸Sensor with no timeout that pokes forever when the upstream is permanently broken
▸Hundreds of poking sensors burning the entire worker pool on a slow morning
▸Sensor that polls a third-party API and trips its rate limit
▸Sensor whose success condition is a side effect (file exists) without a content check
When to Skip the Sensor Entirely
Sometimes the right design is no sensor. If the external system can call the orchestrator's API or send a webhook, an external trigger replaces the sensor. If the upstream is another DAG in the same orchestrator, an asset trigger replaces the sensor. Sensors earn their place at the boundary between the orchestrator and the outside world; everywhere else, alternatives are usually cleaner.
Every sensor should have a timeout. A sensor without a timeout is a worker slot that may never come back.
1
# Simulate a deferrable sensor: it does not hold a worker slot while waiting
2
# Compare with a poking sensor, which does
3
4
importtime
5
6
classPokingSensor:
7
def__init__(self,name):
8
self.name=name
9
self.slot_held_seconds=0
10
11
defwait(self,condition_at_seconds):
12
fortinrange(condition_at_seconds+1):
13
self.slot_held_seconds+=1
14
# In real Airflow, a worker slot is held the entire time
15
returnself.slot_held_seconds
16
17
classDeferrableSensor:
18
def__init__(self,name):
19
self.name=name
20
self.slot_held_seconds=0
21
22
defwait(self,condition_at_seconds):
23
# Trigger watches asynchronously; worker slot is released
24
# Slot is held only at start (1s) and at resume (1s)
25
self.slot_held_seconds=2
26
returnself.slot_held_seconds
27
28
p=PokingSensor('poke_for_file')
29
d=DeferrableSensor('defer_for_file')
30
31
forsensorin(p,d):
32
sensor.wait(condition_at_seconds=600)# condition true after 10 minutes
33
print(f'{sensor.name}: held worker slot for {sensor.slot_held_seconds}s')
34
35
print('\nPoking burns slots proportional to wait time. Deferrable scales.')
36
TIP
Use deferrable sensors when the orchestrator supports them, set a timeout that reflects the longest reasonable wait, and prefer external triggers when the upstream system can call the orchestrator.
Three Cadences, One Output
Daily Life
Interviews
Design a multi-cadence ingestion pattern with three upstream DAGs and one downstream DAG that joins on freshness rather than time offsets.
A real orchestration design rarely has one schedule. The example below builds a single downstream table that is fed by three sources, each on its own cadence. The shape is common in production: a daily executive table that combines streaming events, hourly Stripe data, and once-a-day Salesforce CRM. Designing this correctly requires every concept from the previous sections.
The Sources and Their Cadences
Source
Cadence
Why
Mobile events (Kafka)
Continuous; micro-batched into S3 every 5 minutes
Volume is too high for hourly; downstream wants sub-hour freshness
CRM data changes slowly; daily is more than enough
The Downstream Table
The downstream is mart.daily_revenue_by_account, a single table that the executive dashboard reads. One row per account per day, with columns for total events, total revenue, and account-level metadata from CRM. The freshness bar is 'available by 6am Pacific each morning'. Three sources at three cadences must produce this one table on time, every day, without time-offset hand-wiring.
Three DAGs, Three Cadences
The first design instinct is to put everything in one DAG that runs daily. That instinct is wrong. The hourly Stripe pull and the five-minute Kafka micro-batch should not run on a daily cadence; they should run on their natural cadences and feed durable raw tables that the daily DAG reads. The right shape is three DAGs at three cadences, one daily DAG that joins the latest of each, and asset triggers that connect them.
Each upstream DAG runs on its own schedule and produces an asset (raw.events, raw.payments, raw.accounts). The downstream DAG runs at 5am, after the latest natural runs of all three upstreams have completed for the previous day. It does not depend on the upstream DAGs running at any specific time; it depends on the data being fresh, which the upstream cadences guarantee. The 5am clock-based schedule is the right answer here precisely because the freshness bar (6am Pacific) is a wall-clock commitment, and the wall-clock anchor lives at the seam between the upstream cadences and the downstream commitment.
Stripe and events stay fresh continuously for other consumers too
Failures are isolated to the affected cadence
Daily mart is a small DAG that joins the latest, no exotic scheduling
Lessons from the worked example:
▸Cadences belong to sources; the downstream join consumes the latest of each
▸Asset triggers (or sensors) at the seam replace time-offset assumptions
▸Wall-clock SLAs anchor the downstream DAG's schedule, not the upstreams' schedules
▸Each upstream DAG's failure stays isolated; downstream waits or fails fast with a clear error
TIP
When sources have different natural cadences, do not force them into one DAG. Three small DAGs and one downstream join cooperate cleanly through asset triggers and beat any single mega-DAG.
❯❯❯PUTTING IT ALL TOGETHER
> A logistics company has six teams, each running their own DAGs. The shipments DAG runs hourly, the inventory DAG runs every six hours, the analytics DAG runs daily. The analytics DAG has been time-staggered to start after the shipments DAG 'usually' finishes. It worked for a year, but a Black Friday spike caused shipments to run for three hours and analytics produced wrong numbers. The CTO asks: 'How is this rebuilt so it never breaks again?'
First, name the schedule type for every DAG. Shipments and inventory are interval-based (their cadence is what matters). Analytics is event-driven (it cares about freshness of its inputs).
Replace the analytics DAG's time-offset start with an asset trigger that fires when the shipments and inventory raw tables are fresh for the run date. Combined with a 5am wall-clock fallback, the SLA is preserved even on slow upstream days.
Apply the task vs DAG vs run vocabulary precisely: when shipments fails for the 2026-04-25 run, only that task instance is retried, and the analytics DAG either waits or fails fast. The DAG itself is not rerun, and unaffected days are not touched.
Add a freshness contract between teams: shipments commits to having raw.shipments updated by 4am Pacific, inventory by 4:30am. The analytics team relies on the contract, not on the producer's schedule. Schedule changes upstream become transparent.
From Lesson 2, recognize that this architecture mixes batch (daily analytics, daily Salesforce) with near-streaming (Kafka events, 15-minute Stripe). Each cadence belongs in its own DAG so the freshness tier of each is preserved.
KEY TAKEAWAYS
Three schedule forms cover production: cron for wall-clock alignment, interval for relative cadence, event triggers for arrival-driven runs.
Task, DAG, and run are different things: retries operate on task instances, the DAG is a structure, and the run is one execution. Conflating them produces the wrong fix.
Cross-DAG dependencies belong on assets, not on time: asset triggers decouple producer and consumer through freshness; sensors fill the gap when the producer is outside the orchestrator.
Sensors should be deferrable and bounded: deferrable mode releases the worker slot, and a timeout prevents a slot from waiting forever.
Multi-cadence systems need multi-cadence DAGs: do not force three different freshness tiers into one schedule. Let each source run on its natural cadence and join on freshness in a small downstream DAG.
Schedules trigger DAGs, runs are the work, and dependencies cross every boundary
Category
Pipeline Architecture
Difficulty
intermediate
Duration
30 minutes
Challenges
0 hands-on challenges
Topics covered: Schedules: Cron, Interval, Event, Task vs DAG vs Run for Retries, Cross-DAG Dependencies, Sensors and External Triggers, Three Cadences, One Output
Every DAG has a schedule. The schedule decides when a run starts. Three forms cover almost every production case: cron expressions for repeated wall-clock times, intervals for relative cadences (every fifteen minutes, every hour), and event triggers for runs that start when something external happens. A modern orchestrator supports all three, and the right choice depends on the source's behavior, not on engineer preference. Cron Expressions A cron expression is five fields that name a recurring
Three words appear in every conversation about orchestration: task, DAG, and run. New engineers use the three interchangeably, and most of the time the imprecision does not bite. It bites hard when retry semantics are at stake, because the orchestrator retries at exactly one of those three levels and the answer matters. The vocabulary below is precise on purpose. The Three Words Retries Happen at the Task Instance Level When an orchestrator retries a failure, it retries a task instance. It does
Real production environments do not run one giant DAG. They run dozens of smaller DAGs, owned by different teams, on different cadences. Some of those DAGs depend on each other. The marketing analytics DAG reads tables produced by the orders DAG; the ML feature DAG reads tables produced by the events DAG. The dependency edge crosses a DAG boundary. Modeling that edge correctly is the difference between a system that scales across teams and one that breaks every time someone changes a schedule. T
A pipeline often has to wait for something outside its control. A vendor SFTP drops a file at an irregular time. A REST API publishes a daily endpoint that becomes available between 1am and 4am. A Kafka topic accumulates messages, and a downstream batch process should kick off when the offset crosses a threshold. Sensors are the orchestration primitive that turns 'wait for the world' into a task the DAG can schedule around. Knowing the families of sensors and their trade-offs is part of the inte
A real orchestration design rarely has one schedule. The example below builds a single downstream table that is fed by three sources, each on its own cadence. The shape is common in production: a daily executive table that combines streaming events, hourly Stripe data, and once-a-day Salesforce CRM. Designing this correctly requires every concept from the previous sections. The Sources and Their Cadences The Downstream Table The downstream is mart.daily_revenue_by_account, a single table that th