Orchestration and Dependencies: Advanced

A staff data engineer at a public marketplace inherited a 220-DAG Airflow deployment that was missing its 6am SLA roughly twice a week. The investigation turned up four separate problems coexisting in one symptom: a backfill job from a year ago was running every night and competing for warehouse slots, the SLA itself was declared on the wrong DAG, the priority weight on the executive marts was the same as on the experimental ML jobs, and a recent migration to asset-based scheduling had not been propagated to half the deployment. Fixing one without the others did nothing. Each one alone was a senior-engineer concern; together they formed the operational reality of large orchestration deployments. This lesson is about the four levers that a senior engineer pulls when single-DAG fluency is no longer enough: assets, backfills, SLAs, and concurrency.

Asset vs Task Orchestration

Daily Life
Interviews

Distinguish asset-based and task-based orchestration models, explain the operational consequences, and pick a model at the architectural level.

Two philosophies compete in modern orchestration. Task-based orchestration, exemplified by Airflow, treats the work as the primary object: define tasks, declare dependencies between them, and trust that the data will follow. Asset-based orchestration, exemplified by Dagster, treats the data as the primary object: declare the assets the pipeline produces, declare the dependencies between assets, and let the orchestrator infer the work. The two models produce equivalent pipelines on simple cases. They diverge sharply on lineage, observability, partial refreshes, and the operability of large deployments.

Task-Based: Work Is the Object

Airflow's model places tasks at the center. A task has a definition, a retry policy, an upstream dependency, and a downstream dependency. Tasks happen to read and write data, but the orchestrator does not require that fact to be declared. The graph is over tasks, not over the data they produce. The benefit is flexibility: any task can call any other system, and the orchestrator does not need to understand what the task does. The cost is that lineage is implicit: an engineer must read the task code to know which tables it touches, and the orchestrator cannot answer questions like 'which tasks contributed to fct_orders'.

Asset-Based: Data Is the Object

Dagster's model places assets at the center. An asset is a named piece of data (a table, a partition, a feature) and is declared as the output of a function. The orchestrator builds the dependency graph from the asset declarations: if asset B is declared as derived from asset A, the orchestrator infers that B's compute depends on A. Tasks still exist as the underlying execution, but the surface API is the asset graph. The benefit is that lineage is explicit: every asset has a declared origin, and the orchestrator can answer structural questions about the data. The cost is the up-front modeling discipline; engineers must declare assets rather than write tasks that happen to write data.
PropertyTask-Based (Airflow)Asset-Based (Dagster)
Primary objectTask: a unit of workAsset: a unit of data
Dependency modelTask -> Task; data follows implicitlyAsset -> Asset; tasks are inferred
LineageImplicit; reconstructed from code readingExplicit; queryable from the orchestrator metadata
Partial refreshRe-run a task; the data effects are a side effectRefresh an asset; the orchestrator computes what to recompute
Backfill semanticsRe-run task instances over a date rangeMaterialize asset partitions over a date range
Cross-DAG / asset supportDatasets and ExternalTaskSensor; later additionNative; assets are the cross-DAG primitive from day one

What Changes Operationally

An asset-based deployment lets an operator answer 'which assets are stale right now and why'. The orchestrator already knows the lineage, the freshness policy on each asset, and the producer task. A task-based deployment can answer the same question only by external lineage tooling (OpenLineage, Marquez, dbt's lineage) bolted onto the orchestrator. Both answers are achievable; the asset model gets there with less coordination because it never lost the connection between work and data.
1# Task-based (Airflow): the data is implicit
2@task
3def compute_fct_orders():
4 sql = 'INSERT INTO fct_orders SELECT ...'
5 snowflake.execute(sql)
6
7@task
8def compute_mart_revenue():
9 sql = 'INSERT INTO mart_revenue SELECT ... FROM fct_orders'
10 snowflake.execute(sql)
11
12compute_fct_orders() >> compute_mart_revenue() # the data dependency is in the engineer's head
13
14# Asset-based (Dagster): the data is the API
15@asset
16def fct_orders():
17 return snowflake_load('SELECT ... FROM raw.orders')
18
19@asset(deps=[fct_orders])
20def mart_revenue(fct_orders):
21 return snowflake_load('SELECT ... FROM fct_orders')
22# The orchestrator now knows mart_revenue is derived from fct_orders

Where Each Model Fits

Task-Based Fits When
  • Existing Airflow deployment with hundreds of DAGs
  • Workloads include heavy non-data orchestration (alerts, ML jobs, cross-system glue)
  • Tasks span many systems where the data shape is hard to model uniformly
  • The team is more comfortable thinking in tasks than in assets
Asset-Based Fits When
  • Lineage and data observability are first-order requirements
  • Many partitioned tables that need partial refreshes and backfills
  • Large data graph where task count would explode if modeled per task
  • New build with no orchestrator legacy

The Hybrid Reality

Airflow has been growing asset-aware features for years (Datasets, OpenLineage integration). Dagster supports task-style imperative work alongside its asset model. The two models are converging in the middle, and most large deployments end up hybrid in practice: a core asset graph for the data layer, plus task-style operators for non-data work like notifications, infrastructure jobs, and external API calls. The rigid 'pick one' framing is rare in production. The framing that matters is which model is the default at the seam where new work gets added.
Symptoms that the wrong model is in use:
  • Lineage tooling has to be bolted on with custom code; the orchestrator does not know which task touches which table
  • Asset refreshes require running entire DAGs even when only one table is stale
  • Partial backfills are expressed as ad-hoc CLI scripts rather than first-class operations
  • Cross-team data dependencies are wired with sensors that nobody fully trusts
Task-BasedAsset-BasedHybrid
Task-Based
Work-first model (Airflow)
Tasks are the primary object. Flexible, broadly compatible, lineage requires external tooling. The default of most existing enterprise deployments.
Asset-Based
Data-first model (Dagster)
Assets are the primary object. Lineage is explicit, partial refreshes are native, the asset graph mirrors the data graph. The default of most new builds focused on data observability.
Hybrid
Convergence in production
Asset graph for data work plus task operators for non-data glue. The pragmatic shape most large deployments end up in regardless of which orchestrator was chosen.
Do
  • Model the assets explicitly when the orchestrator supports it; lineage compounds in value
  • Mix task-style and asset-style where each fits, with the asset model as the default for data work
  • Pick the model at the architectural level, not per-DAG; consistency reduces operator load
Don't
  • Bolt lineage tooling onto a task-based deployment as an afterthought; design it in
  • Force every job into the asset model; some work (alerts, infra) is genuinely task-shaped
  • Migrate from one model to the other without a phased plan; partial migrations are worse than either pure model
check
Task-based and asset-based models converge in practice; the choice is which is the default at the seam where new work is added.
query
Asset-based deployments answer 'which assets are stale and why' from orchestrator metadata; task-based deployments answer the same question only with bolted-on lineage tooling.
alert
Partial migrations between the two models are worse than either pure model; plan the phased rollout before starting.
TIP
When building a new deployment, choose asset-based as the default and accept task-style operators where they fit. When inheriting a task-based deployment, introduce assets at the seams (Datasets, OpenLineage) before attempting a full migration.

Backfills as First-Class Operations

Daily Life
Interviews

Design pipelines that are backfillable: parameterize on date range, write only the target partition, and use orchestrator-managed backfill operations with appropriate concurrency.

A backfill is the operation of running a pipeline over a historical date range that it did not run for at the time. Backfills happen for three reasons: a bug in the pipeline produced wrong data and the corrected pipeline must reprocess the affected dates; a new column was added and the historical data must be regenerated to fill it; a new pipeline is launched and needs initial history. In production, backfills are run constantly, and a system that does not treat them as a first-class operation produces a different bug every time.

What Makes Backfills Different From Regular Runs

A regular run processes one logical date: today, or the most recent hour. A backfill processes many logical dates, sometimes years of them. The same task is invoked once per logical date, with different parameters, and the orchestrator must coordinate the parallelism so the warehouse does not collapse under thousands of simultaneous queries. The task itself must accept a date as a parameter; if the task hardcodes 'yesterday', it cannot be backfilled. This requirement is the entire reason idempotent task design matters.
PropertyIdempotent TaskNon-Idempotent Task
Run twiceSame result as running onceDifferent result; possibly duplicates or corruption
Partition strategyOverwrites the target partition for the run dateAppends; duplicates accumulate
Time referencesReads the run date as a parameterReads NOW(); each run sees a different time
Backfill safetyBackfilling a year is safe and resumableBackfilling a year produces a corrupted year

Parameterizing on Date Range

Every backfillable task takes the run date (or a date range) as input and reads or writes only the partitions for that range. The orchestrator passes the parameter explicitly: Airflow exposes the logical date as a Jinja template variable; Dagster passes a partition key into the asset function; Prefect parameterizes the flow run. Tasks that hardcode date logic inside the function body break backfills silently. The most common silent break is a SQL query that filters on `DATE(NOW()) - 1` instead of the parameterized run date; the query runs successfully on the backfill but reads today's data, producing rows for every backfill date that look identical.
1# Backfillable task: run_date is a parameter
2@task
3def aggregate_orders(run_date: str):
4 sql = f''' DELETE FROM mart.daily_orders WHERE order_date = '{run_date}'; INSERT INTO mart.daily_orders SELECT '{run_date}' AS order_date, COUNT(*) AS order_count FROM raw.orders WHERE DATE(order_timestamp) = '{run_date}'; '''
5 snowflake.execute(sql)
6
7# Non-backfillable variant: hardcoded date logic that does not accept the run date
8@task
9def aggregate_orders_broken():
10 sql = ''' INSERT INTO mart.daily_orders SELECT CURRENT_DATE - 1 AS order_date, COUNT(*) AS order_count FROM raw.orders WHERE DATE(order_timestamp) = CURRENT_DATE - 1; '''
11 snowflake.execute(sql)
12# The broken version writes today's data with yesterday's date label, every time it runs

Backfills as Orchestrator Operations

Modern orchestrators expose backfills as a first-class CLI and UI operation. In Airflow, the `airflow dags backfill` command takes a start date, an end date, and a DAG ID, then runs the DAG for every date in the range. In Dagster, the asset partition UI lets an operator select a range of partitions and click 'materialize'. The orchestrator handles the parallelism, the rate of submission, and the failure isolation. The engineer does not write a custom loop. This is the difference between backfills as a first-class operation and backfills as an ad-hoc script.
Backfill anti-patterns that show up in production:
  • Custom Python script that loops over dates and runs queries directly, bypassing the orchestrator
  • Tasks that hardcode CURRENT_DATE or NOW() and silently produce wrong dates on backfill
  • Backfills that re-run successful runs and overwrite manual fixes
  • Backfills with no concurrency limit that submit a year of queries at once and stall the warehouse

The Concurrency Knob on Backfills

A backfill of one year is 365 invocations of the same task. Submitting them simultaneously will overwhelm any warehouse. The orchestrator exposes a concurrency cap on backfills: how many runs of the DAG can be in flight at the same time. A typical setting is between four and sixteen, tuned to the warehouse's compute envelope. The right setting balances time-to-completion against contention with normal production runs. A backfill that runs for two days at concurrency four is usually preferable to a backfill that finishes in two hours at concurrency 100 and breaks every other pipeline running on the same warehouse.
1# Simulate a backfill with concurrency control
2# Compare unbounded submission against capped concurrency
3
4from collections import deque
5
6class BackfillSimulator:
7 def __init__(self, dates, concurrency, runtime_per_date_min=10, warehouse_capacity=8):
8 self.dates = dates
9 self.concurrency = concurrency
10 self.runtime = runtime_per_date_min
11 self.capacity = warehouse_capacity
12
13 def run(self):
14 pending = deque(self.dates)
15 in_flight = []
16 clock_min = 0
17 slowdowns = 0
18 while pending or in_flight:
19 while pending and len(in_flight) < self.concurrency:
20 in_flight.append(self.runtime)
21 # If the warehouse capacity is exceeded, every task slows down
22 slow_factor = max(1, len(in_flight) / self.capacity)
23 if slow_factor > 1:
24 slowdowns += 1
25 in_flight = [t - 1 / slow_factor for t in in_flight]
26 in_flight = [t for t in in_flight if t > 0]
27 clock_min += 1
28 return clock_min, slowdowns
29
30dates = list(range(60)) # backfill 60 days
31
32for c in (4, 16, 60):
33 minutes, slow = BackfillSimulator(dates, concurrency=c).run()
34 print(f'concurrency={c:>3}: {minutes} minutes elapsed, {slow} steps with warehouse contention')
35

When Backfills Are Not the Right Answer

Backfilling years of history at full fidelity is sometimes the wrong call. If only the last 90 days are queried, regenerating five years of partitions costs warehouse time that no consumer ever uses. A targeted backfill of the affected range is cheaper and faster. The discipline is to ask 'what range actually needs to be regenerated' before passing a start date of 2020-01-01 to the backfill command. A backfill is a tool, not a default.
Naive Backfill
  • Custom script loops over dates and runs SQL directly
  • No concurrency cap; submits everything at once
  • Tasks hardcode dates internally and produce wrong outputs
  • No tracking of which dates succeeded; partial failure leaves a mystery
First-Class Backfill
  • Orchestrator exposes a backfill CLI or UI command
  • Concurrency cap matches warehouse capacity
  • Tasks accept run_date as a parameter and write only their partition
  • Each date is a tracked task instance; partial failure is observable

Idempotency is the prerequisite for safe backfills. Without it, every backfill is a different kind of corruption.

TIP
Before triggering a backfill, confirm that every task in the DAG accepts the run date as a parameter and writes only that partition. If any task hardcodes the date or appends rather than overwrites, the backfill will silently corrupt history.
extract
extract
transform
transform
load
load
Storage
warehouse

An orchestration DAG: tasks are nodes, dependencies are edges, and there are no cycles. The orchestrator runs each task only after its upstream finishes, retries failures, and backfills - things cron cannot do.

SLAs at the Orchestrator Level

Daily Life
Interviews

Declare orchestrator-level SLAs on DAGs and assets, tune the thresholds against historical runtime distributions, and route misses to on-call with actionable context.

An SLA, in orchestration terms, is a commitment that a DAG (or an asset) finishes by a stated time. The marketing dashboard SLA might be 'mart.daily_revenue is fresh for the previous day by 6am Pacific'. The SLA is not a hope. It is a configured guarantee that the orchestrator monitors and alerts on when missed. Declaring SLAs at the orchestrator level rather than in a separate runbook turns a soft expectation into an enforced contract, and it changes the on-call response from 'someone noticed late' to 'the orchestrator paged at 6:01am'.

What an Orchestrator-Level SLA Specifies

ElementMeaningExample
TargetThe DAG, task, or asset the SLA applies tomart.daily_revenue_by_account
DeadlineWall-clock time by which the work must complete06:00 America/Los_Angeles
FrequencyHow often the deadline appliesEvery day for the previous day's logical date
ResponseWhat the orchestrator does if the deadline passesPage on-call, post to #data-incidents, mark run as SLA-missed
Recovery expectationHow quickly a missed SLA must be resolvedWithin 60 minutes of the breach

How Airflow Declares SLAs

Airflow's SLA mechanism attaches a deadline to a task. If the task does not finish within `sla` time of its expected start, Airflow fires an SLA miss event that can be routed to alerts, Slack, or PagerDuty. The mechanism is task-level by default, which is a known weakness: an SLA on the final publishing task is correct conceptually but does not catch failures earlier in the DAG. Production teams typically also configure DAG-level monitoring through external tools (Datadog, Monte Carlo, Bigeye) to bridge the gap. Newer Airflow features (Datasets with deadlines, the deadline feature added in 2.x) move toward asset-level deadlines.
1# Airflow task-level SLA
2from datetime import timedelta
3
4publish = PythonOperator(
5 task_id='publish_mart_revenue',
6 python_callable=publish_to_mart,
7 sla=timedelta(hours=4), # task must complete within 4 hours of scheduled start
8)
9
10# DAG-level: route SLA misses to a callback
11def sla_miss_callback(dag, task_list, blocking_task_list, slas, blocking_tis):
12 pagerduty.fire('Marketing revenue SLA missed', tasks=[s.task_id for s in slas])
13
14with DAG(
15 'daily_revenue_by_account',
16 schedule='0 5 * * *',
17 sla_miss_callback=sla_miss_callback,
18 ...
19) as dag:
20 ...

How Dagster Declares Freshness Policies

Dagster's asset-first model declares SLAs as freshness policies on assets. An asset can specify 'this asset must be no more than 4 hours stale by 6am every day'. The orchestrator monitors actual asset freshness and fires alerts when an asset is overdue. Because the policy is on the asset, it survives changes to which DAG produces the asset; refactoring the producer DAG does not require updating the SLA. This is one of the operational advantages of asset-based scheduling that becomes visible only when SLAs are declared.
1# Dagster freshness policy on the asset, not the task
2from dagster import asset, FreshnessPolicy
3
4@asset(
5 freshness_policy=FreshnessPolicy(
6 maximum_lag_minutes=240, # asset must be at most 4h old
7 cron_schedule='0 6 * * *', # by 6am daily
8 cron_schedule_timezone='America/Los_Angeles',
9 ),
10)
11def mart_daily_revenue_by_account():
12 return run_daily_revenue_query()
SLA failure modes that point at deeper problems:
  • SLA missed every Monday because Sunday's source had unusual volume that the schedule did not anticipate
  • SLA technically met but data is wrong because a task succeeded on stale upstream
  • SLA on the final task fires, but the actual failure was three tasks earlier and went unalerted
  • SLA threshold is too tight; alerts fire on normal slow runs and on-call ignores them
  • SLA threshold is too loose; misses are flagged hours after consumers already noticed

SLA Tuning

Setting an SLA is a tradeoff. Too tight, and every slow but successful run pages on-call, training the team to ignore alerts. Too loose, and missed SLAs are detected only after a consumer complains. The right setting is the worst-case acceptable runtime under normal conditions, plus a small buffer. A pipeline whose 95th-percentile runtime is 3 hours and whose worst observed runtime is 4.5 hours might set an SLA of 5 hours. The setting should be revisited quarterly as runtime distributions drift.
Tightly Tuned SLA
  • Detects misses promptly when they happen
  • False positives on normal slow runs train alert fatigue
  • Pages on-call for transient slowness that resolves naturally
  • Tightening below the runtime distribution produces noise, not signal
Loosely Tuned SLA
  • Misses caught only after consumers complain
  • Real outages get diagnosed late
  • Alert fatigue is low; trust in the alerts is high when they fire
  • Loosening above the runtime distribution produces silence, not safety

SLA Observability

An SLA without a dashboard is a setting that nobody calibrates. Production teams ship a dashboard that shows the historical distribution of runtimes for every SLA-bound DAG, the current SLA threshold, the count of misses in the last 30 days, and the trend. The dashboard is what lets the team tune the threshold over time as runtime distributions shift. SLA settings that go untouched for years drift out of date as the underlying work changes; observability is the antidote.
Do
  • Declare SLAs at the orchestrator level, not in a separate runbook
  • Tune SLAs to the 95th-percentile runtime plus a known buffer; revisit quarterly
  • Publish a dashboard of historical SLA performance for every bound pipeline
  • Route SLA misses to on-call with enough context to start investigation immediately
Don't
  • Set SLAs to the average runtime; half of normal runs will breach by definition
  • Page on every SLA event without distinguishing real misses from edge cases
  • Place the SLA on the final task only; failures earlier in the DAG go unalerted
  • Leave SLA settings unchanged for years; runtime distributions drift
TIP
Treat SLAs as living settings, not one-time configurations. The best-tuned SLAs are reviewed quarterly against the actual runtime distribution and adjusted before alert fatigue or detection lag sets in.

Concurrency, Pools, and Priority

Daily Life
Interviews

Configure DAG concurrency, named pools, and task priority so simultaneous workloads cooperate rather than compete on a shared warehouse.

Every shared system has finite capacity. A Snowflake warehouse has slot limits. A Spark cluster has executor limits. A Postgres replica has connection limits. An orchestrator that submits work without regard for those limits will, eventually, melt the system it depends on. Concurrency, pools, and priority are the three controls that let the orchestrator submit work in a way the downstream system can absorb. They are simple knobs that prevent the most expensive class of incident in mature deployments.

Three Levers, Three Scopes

ControlScopeWhat It Caps
ConcurrencyPer DAG or per taskHow many instances of this DAG or task can run simultaneously
PoolShared across many DAGsTotal parallel slots available to a named resource (e.g., 'snowflake_xl')
PriorityPer task instanceOrder in which tasks are scheduled when capacity is contended
ConcurrencyPoolPriority
Concurrency
Single-DAG cap
How many tasks within one DAG run at once, and how many runs of the same DAG can be in flight. Prevents fanout saturation and run stacking.
Pool
Cross-DAG resource budget
A named slot budget shared across DAGs. Caps total load on a downstream resource (Snowflake XL, Spark, Postgres) to a known number.
Priority
Queue ordering
Per-task or per-DAG weight that decides who runs next when the pool is full. Customer-facing work outranks experimental work.

Concurrency: The Single-DAG Cap

DAG concurrency caps how many tasks within a single DAG run at the same time, and DAG-run concurrency caps how many runs of the same DAG can be in flight. The first prevents a fanned-out DAG with 200 parallel tasks from saturating workers. The second prevents a slow run from stacking on top of itself: if today's daily run is still going at midnight and yesterday's never finished, the orchestrator should not start tomorrow's. Both caps are usually set per DAG, and the right values depend on the work's compute profile.
1with DAG(
2 'fanout_etl',
3 schedule='@daily',
4 max_active_runs=1, # only one run in flight at a time
5 concurrency=8, # at most 8 tasks running simultaneously within this DAG
6 ...
7) as dag:
8 ...

Pools: The Cross-DAG Cap

A pool is a named resource budget shared across DAGs. The 'snowflake_xl' pool might have 16 slots; any task that runs against the XL warehouse takes one slot for the duration. If 20 tasks across 5 DAGs all want the XL warehouse simultaneously, 16 run and 4 wait. The pool prevents a fanout in DAG A from starving DAG B, and it caps the total load on the warehouse to a known number. Pools are the most underused feature in many production deployments because they require explicit modeling of which tasks share which resources.
1# Tasks in different DAGs share a Snowflake XL pool
2# Pool 'snowflake_xl' has 16 slots configured globally
3
4# In dag_marketing:
5load_marketing = SnowflakeOperator(
6 task_id='load_marketing_aggregations',
7 sql='...',
8 pool='snowflake_xl',
9 pool_slots=1,
10)
11
12# In dag_finance:
13load_finance = SnowflakeOperator(
14 task_id='load_finance_close',
15 sql='...',
16 pool='snowflake_xl',
17 pool_slots=2, # heavier task takes 2 slots
18)
19# Together, the two DAGs cannot exceed 16 simultaneous slots in the pool

Priority: Who Goes First Under Contention

When the pool is full and tasks are waiting, the orchestrator picks one to run next. Priority weight controls that pick. A task with priority 10 runs before a task with priority 1 when both are waiting on the same pool. Priority is per-task or per-DAG depending on the orchestrator. The right pattern is: customer-facing pipelines (the executive dashboard, the customer email batch) are priority 10, regular production pipelines are 5, experimental and ad-hoc work is 1. Without priority tuning, contention resolves arbitrarily, which means the experimental backfill can starve the customer email run.
Priority decisions that pay off:
  • Customer-facing pipelines outrank internal analytics on the same warehouse
  • Backfills run at lower priority so they yield to live production runs
  • Ad-hoc and experimental work runs in a separate pool with lower priority
  • SLA-bound DAGs get a priority bump so they win contention against unbounded ones

When the Warehouse Melts

The classic incident is a Monday morning when a backfill, a normal daily run, and an experimental ML feature build all hit Snowflake at the same time. Without pools and priority, the warehouse queues every query, every query slows by 5x, and the daily SLA misses. With pools, the backfill is capped at 2 slots, the daily run is capped at 8, and the experimental work is in a different pool entirely. With priority, the customer-facing DAG runs first and finishes inside its SLA while the backfill stretches longer. The cost is one afternoon of pool configuration; the saving is every Monday that does not become an incident.
No Pools or Priority
  • Every task competes equally for warehouse slots
  • A fanout backfill starves the daily executive DAG
  • Contention resolves arbitrarily; SLA misses are unpredictable
  • On-call cannot tell which workload caused the slowdown
Pools and Priority Tuned
  • Pools partition resources by class (XL, ML, ad-hoc)
  • Backfills wait for live production work to finish
  • Priority routes the most important work first under contention
  • On-call sees pool utilization in the orchestrator UI
1# Simulate three workloads competing on a 16-slot pool
2# With and without priority
3
4from heapq import heappush, heappop
5
6class PoolScheduler:
7 def __init__(self, slots, use_priority):
8 self.slots = slots
9 self.use_priority = use_priority
10 self.queue = []
11 self.running = []
12
13 def submit(self, name, priority, duration):
14 # Higher priority = run first; negate for min-heap
15 order_key = -priority if self.use_priority else 0
16 heappush(self.queue, (order_key, name, duration))
17
18 def run_to_completion(self):
19 clock = 0
20 finish_times = {}
21 while self.queue or self.running:
22 while self.queue and len(self.running) < self.slots:
23 key, name, duration = heappop(self.queue)
24 self.running.append((clock + duration, name))
25 self.running.sort()
26 finish_clock, name = self.running.pop(0)
27 clock = finish_clock
28 finish_times[name] = clock
29 self.running = [(t, n) for t, n in self.running if t > clock]
30 return finish_times
31
32for priority in (False, True):
33 sched = PoolScheduler(slots=4, use_priority=priority)
34 # 8 backfill tasks, low priority, run for 30 min each
35 for i in range(8):
36 sched.submit(f'backfill_{i}', priority=1, duration=30)
37 # 1 customer dashboard task, high priority, runs for 10 min
38 sched.submit('customer_dashboard', priority=10, duration=10)
39 finish = sched.run_to_completion()
40 label = 'with priority' if priority else 'no priority'
41 print(f'{label}: customer_dashboard finishes at minute {finish["customer_dashboard"]}')
42
TIP
Treat the orchestrator's pool configuration as part of the warehouse's architecture, not as a one-time setup. Pool sizes, priorities, and SLA thresholds should be reviewed alongside warehouse credit usage every quarter.

Postmortem: Cadence Change Bug

Daily Life
Interviews

Diagnose a multi-cause orchestration failure by separating the seam, the freshness contract, the pool, and the SLA, and design a redesign that addresses each layer.

A real production incident, redacted from a fintech postmortem. A daily DAG named `daily_finance_close` had been running cleanly for fourteen months. In month fifteen, it began missing its 6am Pacific SLA twice a week. Nothing in the DAG had changed. The investigation revealed that an upstream Stripe ingestion DAG had been quietly migrated from a 30-minute cadence to a 5-minute cadence three weeks earlier. The change was a clear improvement upstream. It broke the downstream because of every assumption the original architecture had encoded.

The Original Architecture

1Upstream : Downstream : stripe_to_raw schedule : * / 30 daily_finance_close schedule : 0 5 * * *(every 30 MIN, FULL pull) start_at_5am outlets : raw.payments read raw.payments since 00 : 00
2JOIN, aggregate publish mart.daily_close
3
4CROSS - DAG mechanism : TIME
5OFFSET(downstream starts AT 5 am, upstream finishes BY 4 : 50 am)

What Changed Upstream and Why It Started Failing

The upstream team migrated `stripe_to_raw` from 30-minute full pulls to 5-minute incremental pulls. The migration was a clear improvement: smaller batches, lower latency, less load on the warehouse. The team announced it in their standup and merged the PR. No one downstream caught the announcement, because the cross-DAG dependency was invisible: the downstream's schedule did not reference the upstream at all.
The new 5-minute cadence created a small window of inconsistency every five minutes during which raw.payments was mid-write. The 5am downstream read sometimes caught a partial batch and other times caught a complete batch, depending on the exact second the join started. On clean reads, the daily_finance_close finished by 5:45am. On partial reads, the join produced rows with NULLs in the new payment columns, the row count threshold check triggered a row deletion and rewrite, the rewrite waited on a warehouse slot, and the SLA missed by 20 to 60 minutes. The bug only showed on the days when the read happened to land mid-batch.

The Five Things Wrong With the Original

IssueWhy It Was Latent for 14 MonthsWhy It Surfaced After the Change
Time-offset cross-DAG dependency30-min cadence had a 25-minute quiet window before the 5am read5-min cadence eliminated the quiet window
No asset-level freshness contractImplicit 'finishes by 4:50am' was reliable enoughCadence change broke the implicit contract; nobody noticed
No transactional read isolationSource rarely caught mid-write5-min cadence made mid-write reads frequent
Single-shared pool with no prioritySpare capacity absorbed slow runsSlow runs collided with backfills competing for the same pool
SLA on the final task onlyFinal-task SLA caught the symptomDetection at SLA breach is too late; root cause is upstream

The Redesign

The fix touched four of the five issues. The cross-DAG edge moved from time-offset to asset-trigger: daily_finance_close now starts when raw.payments is fresh for the previous day's logical date, not when the clock hits 5am. The freshness contract on raw.payments was declared explicitly: 'fresh for the previous calendar day by 4:30am Pacific'. The Stripe ingestion DAG was modified to write a `_partition_complete` marker when the day's data finished landing, and the downstream waits on the marker, not the raw rows. The downstream gained a higher priority than the backfill workloads on the same pool.
1Redesigned : stripe_to_raw schedule : * / 5(every 5 MIN, incremental pull) outlets : raw.payments(per - batch) separate marker task AT hour 4 writes : raw.payments_partition_complete date=YYYY-MM-DD daily_finance_close schedule : asset_trigger(payments_complete) read raw.payments
2
3WHERE DATE = previous_day
4JOIN, aggregate publish mart.daily_close Freshness policy
5
6 ON raw.payments_partition_complete : must exist for previous day BY 04 : 30 America / Los_Angeles Freshness policy
7 ON mart.daily_close : must exist for previous day BY 06 : 00 America / Los_Angeles Pool : finance_close AT priority 10 ; backfills AT priority 1
1/* The downstream DAG checks the partition_complete marker before reading raw.payments */
2/* This SQL returns the rows the daily_finance_close should process, gated on completeness */
3/* The marker table is small (one row per partition) and cheap to check */
4WITH partition_marker AS (
5 SELECT
6 '2026-04-24' AS partition_date,
7 'complete' AS status,
8 '2026-04-25 04:18:33' AS completed_at
9
10 UNION ALL
11
12 SELECT
13 '2026-04-23',
14 'complete',
15 '2026-04-24 04:21:07'
16
17 UNION ALL
18
19 SELECT
20 '2026-04-22',
21 'complete',
22 '2026-04-23 04:14:51'
23),
24raw_payments AS (
25 SELECT
26 'p1' AS payment_id,
27 'a1' AS account_id,
28 1500 AS amount_cents,
29 DATE('2026-04-24') AS pay_date
30
31 UNION ALL
32
33 SELECT
34 'p2',
35 'a1',
36 800,
37 DATE('2026-04-24')
38
39 UNION ALL
40
41 SELECT
42 'p3',
43 'a2',
44 2200,
45 DATE('2026-04-24')
46)
47
48SELECT
49 p.account_id,
50 COUNT(*) AS payments,
51 SUM(p.amount_cents) / 100 AS revenue_usd
52FROM raw_payments AS p
53INNER JOIN partition_marker AS m
54 ON m.partition_date = CAST(
55 p.pay_date
56 AS VARCHAR
57 )
58WHERE m.status = 'complete'
59AND p.pay_date = DATE('2026-04-24')
60GROUP BY p.account_id
61ORDER BY p.account_id

What the Redesign Costs

The cost was real. The Stripe team had to add the partition_complete marker task and accept a small additional cost on every run. The finance team had to migrate the daily_finance_close DAG from time-offset to asset-trigger and validate that the new behavior was correct. The platform team had to add a separate pool for finance_close work, configure priority, and document the rationale. Together the changes took about three weeks of engineering time. The recurring SLA misses, by contrast, were costing roughly six engineer-hours per week in firefighting plus the executive-team cost of a stale dashboard, so the fix paid for itself in eight weeks.
Original (Time-Offset, Single Pool)
  • Downstream starts at 5am regardless of upstream state
  • Implicit freshness contract; cadence changes silently break it
  • All work shares one pool; backfills can starve customer-facing work
  • SLA on the final task; root causes stay invisible
Redesigned (Asset-Trigger, Pool + Priority)
  • Downstream starts when partition_complete marker is fresh
  • Explicit freshness policies on both upstream and downstream assets
  • Separate pool for finance work; priority 10 over backfills
  • SLA on the asset; misses surface at the actual layer where freshness slipped
Lessons from the postmortem:
  • An upstream cadence change is a contract change; it should be PR-reviewed by every downstream owner
  • Time-offset cross-DAG dependencies hide the seam; asset triggers make it visible
  • An idempotent producer with explicit partition-complete markers protects every downstream from mid-batch reads
  • Pools and priority are how customer-facing work survives backfill contention
  • An SLA on the final task is necessary but not sufficient; SLAs at the seam catch root causes earlier
alert
Cadence changes upstream are silent breakers of time-offset cross-DAG dependencies.
check
Asset triggers, partition-complete markers, and pool-priority tuning together convert a fragile seam into a contract.
query
An SLA at the seam catches root causes earlier than an SLA on the final task; both are usually needed.
TIP
When a DAG starts failing without itself changing, look upstream for a cadence change. The failure mode is almost always at the seam between cadences, not inside the DAG that fires the alert.
PUTTING IT ALL TOGETHER

> A staff engineer at a public marketplace owns 220 production DAGs. The marketing executive dashboard misses its 6am SLA twice a week. Investigation reveals four overlapping issues: a year-old backfill is still running nightly and competing for warehouse slots, the SLA is declared on the final task instead of the producing asset, customer-facing DAGs share priority with experimental ML jobs, and a recent partial migration to asset-based scheduling left half the deployment on time-offset cross-DAG edges. The CTO asks: 'What is the smallest set of changes that fixes the 6am SLA without breaking the rest?'

Stop the year-old backfill. From Lesson 5 (idempotency and backfill), every backfill should be a bounded operation with a clear end. A nightly recurring backfill is a sign that the original pipeline is not idempotent and the workaround has calcified into a permanent cost.
Move SLAs from final tasks to assets. The marketing dashboard's SLA belongs on the mart asset; the producing DAG's tasks already have SLAs that catch internal slowness. The asset-level SLA catches misses that final-task SLAs miss when the failure is in the seam.
Split the warehouse pool by workload class: customer_facing (priority 10), regular_production (priority 5), experimental (priority 1, separate pool). Backfills run in the experimental pool. Customer-facing DAGs win contention without starving the rest.
Complete the asset-based migration on the seams that touch SLA-bound assets first. Time-offset cross-DAG edges remain the highest-risk anti-pattern (Lesson 4 intermediate showed why). Asset triggers replace them and stabilize the schedule.
Apply the four-role and undercurrents framing from Lesson 1. Orchestration is the undercurrent that runs through every node; the redesign is the orchestration layer catching up to the architecture the rest of the pipeline already shows.
Treat the change as a pipeline-as-product rollout (Lesson 1 advanced): every affected DAG gets a contract update, a deprecation path for the old cross-DAG edges, and an explicit owner. The shift is operational discipline, not a single PR.
KEY TAKEAWAYS
Asset-based and task-based orchestration are different defaults: task-based fits existing Airflow deployments; asset-based fits new builds where lineage and partial refreshes matter. Most large deployments end up hybrid.
Backfills demand idempotent task design: every task takes the run date as a parameter and writes only that partition. Without idempotency, a backfill is a fresh form of corruption.
SLAs declared at the orchestrator level are enforceable: the orchestrator detects the miss, routes the alert, and surfaces it on a tunable schedule. Loose runbooks do none of that.
Concurrency, pools, and priority prevent warehouse contention: concurrency caps a single DAG, pools share resources across DAGs, priority resolves the queue. The three together stop simultaneous workloads from melting the warehouse.
Cross-DAG cadence changes are silent contract changes: an upstream cadence shift breaks downstream time-offset dependencies on the day it ships. Asset triggers and explicit freshness contracts at the seam are the durable fix.

Assets, backfills, SLAs, and concurrency are the levers a senior engineer pulls

Category
Pipeline Architecture
Difficulty
advanced
Duration
40 minutes
Challenges
0 hands-on challenges

Topics covered: Asset vs Task Orchestration, Backfills as First-Class Operations, SLAs at the Orchestrator Level, Concurrency, Pools, and Priority, Postmortem: Cadence Change Bug

Lesson Sections

  1. Asset vs Task Orchestration (concepts: paDagOrchestration)

    Two philosophies compete in modern orchestration. Task-based orchestration, exemplified by Airflow, treats the work as the primary object: define tasks, declare dependencies between them, and trust that the data will follow. Asset-based orchestration, exemplified by Dagster, treats the data as the primary object: declare the assets the pipeline produces, declare the dependencies between assets, and let the orchestrator infer the work. The two models produce equivalent pipelines on simple cases.

  2. Backfills as First-Class Operations (concepts: paBackfill)

    A backfill is the operation of running a pipeline over a historical date range that it did not run for at the time. Backfills happen for three reasons: a bug in the pipeline produced wrong data and the corrected pipeline must reprocess the affected dates; a new column was added and the historical data must be regenerated to fill it; a new pipeline is launched and needs initial history. In production, backfills are run constantly, and a system that does not treat them as a first-class operation p

  3. SLAs at the Orchestrator Level (concepts: paMonitoring)

    An SLA, in orchestration terms, is a commitment that a DAG (or an asset) finishes by a stated time. The marketing dashboard SLA might be 'mart.daily_revenue is fresh for the previous day by 6am Pacific'. The SLA is not a hope. It is a configured guarantee that the orchestrator monitors and alerts on when missed. Declaring SLAs at the orchestrator level rather than in a separate runbook turns a soft expectation into an enforced contract, and it changes the on-call response from 'someone noticed l

  4. Concurrency, Pools, and Priority (concepts: paDagOrchestration)

    Every shared system has finite capacity. A Snowflake warehouse has slot limits. A Spark cluster has executor limits. A Postgres replica has connection limits. An orchestrator that submits work without regard for those limits will, eventually, melt the system it depends on. Concurrency, pools, and priority are the three controls that let the orchestrator submit work in a way the downstream system can absorb. They are simple knobs that prevent the most expensive class of incident in mature deploym

  5. Postmortem: Cadence Change Bug (concepts: paDagOrchestration)

    A real production incident, redacted from a fintech postmortem. A daily DAG named `daily_finance_close` had been running cleanly for fourteen months. In month fifteen, it began missing its 6am Pacific SLA twice a week. Nothing in the DAG had changed. The investigation revealed that an upstream Stripe ingestion DAG had been quietly migrated from a 30-minute cadence to a 5-minute cadence three weeks earlier. The change was a clear improvement upstream. It broke the downstream because of every assu