Orchestration and Dependencies: Advanced

A staff data engineer at a public marketplace inherited a 220-DAG Airflow deployment that was missing its 6am SLA roughly twice a week. The investigation turned up four separate problems coexisting in one symptom: a backfill job from a year ago was running every night and competing for warehouse slots, the SLA itself was declared on the wrong DAG, the priority weight on the executive marts was the same as on the experimental ML jobs, and a recent migration to asset-based scheduling had not been propagated to half the deployment. Fixing one without the others did nothing. Each one alone was a senior-engineer concern; together they formed the operational reality of large orchestration deployments. This lesson is about the four levers that a senior engineer pulls when single-DAG fluency is no longer enough: assets, backfills, SLAs, and concurrency.

What you will be able to do

Distinguish asset-based and task-based orchestration models and the operational consequences of each

Design backfills as a first-class operation, including idempotency requirements and parameterization

Apply orchestrator-level SLAs and concurrency controls so deadlines are met under contention

Asset vs Task Orchestration

Daily Life

Interviews

Distinguish asset-based and task-based orchestration models, explain the operational consequences, and pick a model at the architectural level.

Two philosophies compete in modern orchestration. Task-based orchestration, exemplified by Airflow, treats the work as the primary object: define tasks, declare dependencies between them, and trust that the data will follow. Asset-based orchestration, exemplified by Dagster, treats the data as the primary object: declare the assets the pipeline produces, declare the dependencies between assets, and let the orchestrator infer the work. The two models produce equivalent pipelines on simple cases. They diverge sharply on lineage, observability, partial refreshes, and the operability of large deployments.

Task-Based: Work Is the Object

Airflow's model places tasks at the center. A task has a definition, a retry policy, an upstream dependency, and a downstream dependency. Tasks happen to read and write data, but the orchestrator does not require that fact to be declared. The graph is over tasks, not over the data they produce. The benefit is flexibility: any task can call any other system, and the orchestrator does not need to understand what the task does. The cost is that lineage is implicit: an engineer must read the task code to know which tables it touches, and the orchestrator cannot answer questions like 'which tasks contributed to fct_orders'.

Asset-Based: Data Is the Object

Dagster's model places assets at the center. An asset is a named piece of data (a table, a partition, a feature) and is declared as the output of a function. The orchestrator builds the dependency graph from the asset declarations: if asset B is declared as derived from asset A, the orchestrator infers that B's compute depends on A. Tasks still exist as the underlying execution, but the surface API is the asset graph. The benefit is that lineage is explicit: every asset has a declared origin, and the orchestrator can answer structural questions about the data. The cost is the up-front modeling discipline; engineers must declare assets rather than write tasks that happen to write data.

Property	Task-Based (Airflow)	Asset-Based (Dagster)
Primary object	Task: a unit of work	Asset: a unit of data
Dependency model	Task -> Task; data follows implicitly	Asset -> Asset; tasks are inferred
Lineage	Implicit; reconstructed from code reading	Explicit; queryable from the orchestrator metadata
Partial refresh	Re-run a task; the data effects are a side effect	Refresh an asset; the orchestrator computes what to recompute
Backfill semantics	Re-run task instances over a date range	Materialize asset partitions over a date range
Cross-DAG / asset support	Datasets and ExternalTaskSensor; later addition	Native; assets are the cross-DAG primitive from day one

What Changes Operationally

An asset-based deployment lets an operator answer 'which assets are stale right now and why'. The orchestrator already knows the lineage, the freshness policy on each asset, and the producer task. A task-based deployment can answer the same question only by external lineage tooling (OpenLineage, Marquez, dbt's lineage) bolted onto the orchestrator. Both answers are achievable; the asset model gets there with less coordination because it never lost the connection between work and data.

	# Task-based (Airflow): the data is implicit
	@task
	def compute_fct_orders():
	sql = 'INSERT INTO fct_orders SELECT ...'
	snowflake.execute(sql)

	@task
	def compute_mart_revenue():
	sql = 'INSERT INTO mart_revenue SELECT ... FROM fct_orders'
	snowflake.execute(sql)

	compute_fct_orders() >> compute_mart_revenue() # the data dependency is in the engineer's head

	# Asset-based (Dagster): the data is the API
	@asset
	def fct_orders():
	return snowflake_load('SELECT ... FROM raw.orders')

	@asset(deps=[fct_orders])
	def mart_revenue(fct_orders):
	return snowflake_load('SELECT ... FROM fct_orders')
	# The orchestrator now knows mart_revenue is derived from fct_orders

Where Each Model Fits

•Task-Based Fits When

Existing Airflow deployment with hundreds of DAGs
Workloads include heavy non-data orchestration (alerts, ML jobs, cross-system glue)
Tasks span many systems where the data shape is hard to model uniformly
The team is more comfortable thinking in tasks than in assets

✓Asset-Based Fits When

Lineage and data observability are first-order requirements
Many partitioned tables that need partial refreshes and backfills
Large data graph where task count would explode if modeled per task
New build with no orchestrator legacy

The Hybrid Reality

Airflow has been growing asset-aware features for years (Datasets, OpenLineage integration). Dagster supports task-style imperative work alongside its asset model. The two models are converging in the middle, and most large deployments end up hybrid in practice: a core asset graph for the data layer, plus task-style operators for non-data work like notifications, infrastructure jobs, and external API calls. The rigid 'pick one' framing is rare in production. The framing that matters is which model is the default at the seam where new work gets added.

Symptoms that the wrong model is in use:

▸Lineage tooling has to be bolted on with custom code; the orchestrator does not know which task touches which table
▸Asset refreshes require running entire DAGs even when only one table is stale
▸Partial backfills are expressed as ad-hoc CLI scripts rather than first-class operations
▸Cross-team data dependencies are wired with sensors that nobody fully trusts

Task-BasedAsset-BasedHybrid

Task-Based

Work-first model (Airflow)

Tasks are the primary object. Flexible, broadly compatible, lineage requires external tooling. The default of most existing enterprise deployments.

Asset-Based

Data-first model (Dagster)

Assets are the primary object. Lineage is explicit, partial refreshes are native, the asset graph mirrors the data graph. The default of most new builds focused on data observability.

Hybrid

Convergence in production

Asset graph for data work plus task operators for non-data glue. The pragmatic shape most large deployments end up in regardless of which orchestrator was chosen.

✓Do

Model the assets explicitly when the orchestrator supports it; lineage compounds in value
Mix task-style and asset-style where each fits, with the asset model as the default for data work
Pick the model at the architectural level, not per-DAG; consistency reduces operator load

✗Don't

Bolt lineage tooling onto a task-based deployment as an afterthought; design it in
Force every job into the asset model; some work (alerts, infra) is genuinely task-shaped
Migrate from one model to the other without a phased plan; partial migrations are worse than either pure model

Task-based and asset-based models converge in practice; the choice is which is the default at the seam where new work is added.

Asset-based deployments answer 'which assets are stale and why' from orchestrator metadata; task-based deployments answer the same question only with bolted-on lineage tooling.

Partial migrations between the two models are worse than either pure model; plan the phased rollout before starting.

TIP

When building a new deployment, choose asset-based as the default and accept task-style operators where they fit. When inheriting a task-based deployment, introduce assets at the seams (Datasets, OpenLineage) before attempting a full migration.

Backfills as First-Class Operations

Daily Life

Interviews

Design pipelines that are backfillable: parameterize on date range, write only the target partition, and use orchestrator-managed backfill operations with appropriate concurrency.

A backfill is the operation of running a pipeline over a historical date range that it did not run for at the time. Backfills happen for three reasons: a bug in the pipeline produced wrong data and the corrected pipeline must reprocess the affected dates; a new column was added and the historical data must be regenerated to fill it; a new pipeline is launched and needs initial history. In production, backfills are run constantly, and a system that does not treat them as a first-class operation produces a different bug every time.

What Makes Backfills Different From Regular Runs

A regular run processes one logical date: today, or the most recent hour. A backfill processes many logical dates, sometimes years of them. The same task is invoked once per logical date, with different parameters, and the orchestrator must coordinate the parallelism so the warehouse does not collapse under thousands of simultaneous queries. The task itself must accept a date as a parameter; if the task hardcodes 'yesterday', it cannot be backfilled. This requirement is the entire reason idempotent task design matters.

Property	Idempotent Task	Non-Idempotent Task
Run twice	Same result as running once	Different result; possibly duplicates or corruption
Partition strategy	Overwrites the target partition for the run date	Appends; duplicates accumulate
Time references	Reads the run date as a parameter	Reads NOW(); each run sees a different time
Backfill safety	Backfilling a year is safe and resumable	Backfilling a year produces a corrupted year

Parameterizing on Date Range

Every backfillable task takes the run date (or a date range) as input and reads or writes only the partitions for that range. The orchestrator passes the parameter explicitly: Airflow exposes the logical date as a Jinja template variable; Dagster passes a partition key into the asset function; Prefect parameterizes the flow run. Tasks that hardcode date logic inside the function body break backfills silently. The most common silent break is a SQL query that filters on `DATE(NOW()) - 1` instead of the parameterized run date; the query runs successfully on the backfill but reads today's data, producing rows for every backfill date that look identical.

	# Backfillable task: run_date is a parameter
	@task
	def aggregate_orders(run_date: str):
	sql = f''' DELETE FROM mart.daily_orders WHERE order_date = '{run_date}'; INSERT INTO mart.daily_orders SELECT '{run_date}' AS order_date, COUNT(*) AS order_count FROM raw.orders WHERE DATE(order_timestamp) = '{run_date}'; '''
	snowflake.execute(sql)

	# Non-backfillable variant: hardcoded date logic that does not accept the run date
	@task
	def aggregate_orders_broken():
	sql = ''' INSERT INTO mart.daily_orders SELECT CURRENT_DATE - 1 AS order_date, COUNT(*) AS order_count FROM raw.orders WHERE DATE(order_timestamp) = CURRENT_DATE - 1; '''
	snowflake.execute(sql)
	# The broken version writes today's data with yesterday's date label, every time it runs

Backfills as Orchestrator Operations

Modern orchestrators expose backfills as a first-class CLI and UI operation. In Airflow, the `airflow dags backfill` command takes a start date, an end date, and a DAG ID, then runs the DAG for every date in the range. In Dagster, the asset partition UI lets an operator select a range of partitions and click 'materialize'. The orchestrator handles the parallelism, the rate of submission, and the failure isolation. The engineer does not write a custom loop. This is the difference between backfills as a first-class operation and backfills as an ad-hoc script.

Backfill anti-patterns that show up in production:

▸Custom Python script that loops over dates and runs queries directly, bypassing the orchestrator
▸Tasks that hardcode CURRENT_DATE or NOW() and silently produce wrong dates on backfill
▸Backfills that re-run successful runs and overwrite manual fixes
▸Backfills with no concurrency limit that submit a year of queries at once and stall the warehouse

The Concurrency Knob on Backfills

A backfill of one year is 365 invocations of the same task. Submitting them simultaneously will overwhelm any warehouse. The orchestrator exposes a concurrency cap on backfills: how many runs of the DAG can be in flight at the same time. A typical setting is between four and sixteen, tuned to the warehouse's compute envelope. The right setting balances time-to-completion against contention with normal production runs. A backfill that runs for two days at concurrency four is usually preferable to a backfill that finishes in two hours at concurrency 100 and breaks every other pipeline running on the same warehouse.

	# Simulate a backfill with concurrency control
	# Compare unbounded submission against capped concurrency

	from collections import deque

	class BackfillSimulator:
	def __init__(self, dates, concurrency, runtime_per_date_min=10, warehouse_capacity=8):
	self.dates = dates
	self.concurrency = concurrency
	self.runtime = runtime_per_date_min
	self.capacity = warehouse_capacity

	def run(self):
	pending = deque(self.dates)
	in_flight = []
	clock_min = 0
	slowdowns = 0
	while pending or in_flight:
	while pending and len(in_flight) < self.concurrency:
	in_flight.append(self.runtime)
	# If the warehouse capacity is exceeded, every task slows down
	slow_factor = max(1, len(in_flight) / self.capacity)
	if slow_factor > 1:
	slowdowns += 1
	in_flight = [t - 1 / slow_factor for t in in_flight]
	in_flight = [t for t in in_flight if t > 0]
	clock_min += 1
	return clock_min, slowdowns

	dates = list(range(60)) # backfill 60 days

	for c in (4, 16, 60):
	minutes, slow = BackfillSimulator(dates, concurrency=c).run()
	print(f'concurrency={c:>3}: {minutes} minutes elapsed, {slow} steps with warehouse contention')

When Backfills Are Not the Right Answer

Backfilling years of history at full fidelity is sometimes the wrong call. If only the last 90 days are queried, regenerating five years of partitions costs warehouse time that no consumer ever uses. A targeted backfill of the affected range is cheaper and faster. The discipline is to ask 'what range actually needs to be regenerated' before passing a start date of 2020-01-01 to the backfill command. A backfill is a tool, not a default.

•Naive Backfill

Custom script loops over dates and runs SQL directly
No concurrency cap; submits everything at once
Tasks hardcode dates internally and produce wrong outputs
No tracking of which dates succeeded; partial failure leaves a mystery

✓First-Class Backfill

Orchestrator exposes a backfill CLI or UI command
Concurrency cap matches warehouse capacity
Tasks accept run_date as a parameter and write only their partition
Each date is a tracked task instance; partial failure is observable

Idempotency is the prerequisite for safe backfills. Without it, every backfill is a different kind of corruption.

TIP

Before triggering a backfill, confirm that every task in the DAG accepts the run date as a parameter and writes only that partition. If any task hardcodes the date or appends rather than overwrites, the backfill will silently corrupt history.

extract

transform

load

Storage

warehouse

An orchestration DAG: tasks are nodes, dependencies are edges, and there are no cycles. The orchestrator runs each task only after its upstream finishes, retries failures, and backfills - things cron cannot do.

SLAs at the Orchestrator Level

Daily Life

Interviews

Declare orchestrator-level SLAs on DAGs and assets, tune the thresholds against historical runtime distributions, and route misses to on-call with actionable context.

An SLA, in orchestration terms, is a commitment that a DAG (or an asset) finishes by a stated time. The marketing dashboard SLA might be 'mart.daily_revenue is fresh for the previous day by 6am Pacific'. The SLA is not a hope. It is a configured guarantee that the orchestrator monitors and alerts on when missed. Declaring SLAs at the orchestrator level rather than in a separate runbook turns a soft expectation into an enforced contract, and it changes the on-call response from 'someone noticed late' to 'the orchestrator paged at 6:01am'.

What an Orchestrator-Level SLA Specifies

Element	Meaning	Example
Target	The DAG, task, or asset the SLA applies to	mart.daily_revenue_by_account
Deadline	Wall-clock time by which the work must complete	06:00 America/Los_Angeles
Frequency	How often the deadline applies	Every day for the previous day's logical date
Response	What the orchestrator does if the deadline passes	Page on-call, post to #data-incidents, mark run as SLA-missed
Recovery expectation	How quickly a missed SLA must be resolved	Within 60 minutes of the breach

How Airflow Declares SLAs

Airflow's SLA mechanism attaches a deadline to a task. If the task does not finish within `sla` time of its expected start, Airflow fires an SLA miss event that can be routed to alerts, Slack, or PagerDuty. The mechanism is task-level by default, which is a known weakness: an SLA on the final publishing task is correct conceptually but does not catch failures earlier in the DAG. Production teams typically also configure DAG-level monitoring through external tools (Datadog, Monte Carlo, Bigeye) to bridge the gap. Newer Airflow features (Datasets with deadlines, the deadline feature added in 2.x) move toward asset-level deadlines.

	# Airflow task-level SLA
	from datetime import timedelta

	publish = PythonOperator(
	task_id='publish_mart_revenue',
	python_callable=publish_to_mart,
	sla=timedelta(hours=4), # task must complete within 4 hours of scheduled start
	)

	# DAG-level: route SLA misses to a callback
	def sla_miss_callback(dag, task_list, blocking_task_list, slas, blocking_tis):
	pagerduty.fire('Marketing revenue SLA missed', tasks=[s.task_id for s in slas])

	with DAG(
	'daily_revenue_by_account',
	schedule='0 5 * * *',
	sla_miss_callback=sla_miss_callback,
	...
	) as dag:
	...

How Dagster Declares Freshness Policies

Dagster's asset-first model declares SLAs as freshness policies on assets. An asset can specify 'this asset must be no more than 4 hours stale by 6am every day'. The orchestrator monitors actual asset freshness and fires alerts when an asset is overdue. Because the policy is on the asset, it survives changes to which DAG produces the asset; refactoring the producer DAG does not require updating the SLA. This is one of the operational advantages of asset-based scheduling that becomes visible only when SLAs are declared.

	# Dagster freshness policy on the asset, not the task
	from dagster import asset, FreshnessPolicy

	@asset(
	freshness_policy=FreshnessPolicy(
	maximum_lag_minutes=240, # asset must be at most 4h old
	cron_schedule='0 6 * * *', # by 6am daily
	cron_schedule_timezone='America/Los_Angeles',
	),
	)
	def mart_daily_revenue_by_account():
	return run_daily_revenue_query()

SLA failure modes that point at deeper problems:

▸SLA missed every Monday because Sunday's source had unusual volume that the schedule did not anticipate
▸SLA technically met but data is wrong because a task succeeded on stale upstream
▸SLA on the final task fires, but the actual failure was three tasks earlier and went unalerted
▸SLA threshold is too tight; alerts fire on normal slow runs and on-call ignores them
▸SLA threshold is too loose; misses are flagged hours after consumers already noticed

SLA Tuning

Setting an SLA is a tradeoff. Too tight, and every slow but successful run pages on-call, training the team to ignore alerts. Too loose, and missed SLAs are detected only after a consumer complains. The right setting is the worst-case acceptable runtime under normal conditions, plus a small buffer. A pipeline whose 95th-percentile runtime is 3 hours and whose worst observed runtime is 4.5 hours might set an SLA of 5 hours. The setting should be revisited quarterly as runtime distributions drift.

•Tightly Tuned SLA

Detects misses promptly when they happen
False positives on normal slow runs train alert fatigue
Pages on-call for transient slowness that resolves naturally
Tightening below the runtime distribution produces noise, not signal

•Loosely Tuned SLA

Misses caught only after consumers complain
Real outages get diagnosed late
Alert fatigue is low; trust in the alerts is high when they fire
Loosening above the runtime distribution produces silence, not safety

SLA Observability

An SLA without a dashboard is a setting that nobody calibrates. Production teams ship a dashboard that shows the historical distribution of runtimes for every SLA-bound DAG, the current SLA threshold, the count of misses in the last 30 days, and the trend. The dashboard is what lets the team tune the threshold over time as runtime distributions shift. SLA settings that go untouched for years drift out of date as the underlying work changes; observability is the antidote.

✓Do

Declare SLAs at the orchestrator level, not in a separate runbook
Tune SLAs to the 95th-percentile runtime plus a known buffer; revisit quarterly
Publish a dashboard of historical SLA performance for every bound pipeline
Route SLA misses to on-call with enough context to start investigation immediately

✗Don't

Set SLAs to the average runtime; half of normal runs will breach by definition
Page on every SLA event without distinguishing real misses from edge cases
Place the SLA on the final task only; failures earlier in the DAG go unalerted
Leave SLA settings unchanged for years; runtime distributions drift

TIP

Treat SLAs as living settings, not one-time configurations. The best-tuned SLAs are reviewed quarterly against the actual runtime distribution and adjusted before alert fatigue or detection lag sets in.

Concurrency, Pools, and Priority

Daily Life

Interviews

Configure DAG concurrency, named pools, and task priority so simultaneous workloads cooperate rather than compete on a shared warehouse.

Every shared system has finite capacity. A Snowflake warehouse has slot limits. A Spark cluster has executor limits. A Postgres replica has connection limits. An orchestrator that submits work without regard for those limits will, eventually, melt the system it depends on. Concurrency, pools, and priority are the three controls that let the orchestrator submit work in a way the downstream system can absorb. They are simple knobs that prevent the most expensive class of incident in mature deployments.

Three Levers, Three Scopes

Control	Scope	What It Caps
Concurrency	Per DAG or per task	How many instances of this DAG or task can run simultaneously
Pool	Shared across many DAGs	Total parallel slots available to a named resource (e.g., 'snowflake_xl')
Priority	Per task instance	Order in which tasks are scheduled when capacity is contended

ConcurrencyPoolPriority

Concurrency

Single-DAG cap

How many tasks within one DAG run at once, and how many runs of the same DAG can be in flight. Prevents fanout saturation and run stacking.

Pool

Cross-DAG resource budget

A named slot budget shared across DAGs. Caps total load on a downstream resource (Snowflake XL, Spark, Postgres) to a known number.

Priority

Queue ordering

Per-task or per-DAG weight that decides who runs next when the pool is full. Customer-facing work outranks experimental work.

Concurrency: The Single-DAG Cap

DAG concurrency caps how many tasks within a single DAG run at the same time, and DAG-run concurrency caps how many runs of the same DAG can be in flight. The first prevents a fanned-out DAG with 200 parallel tasks from saturating workers. The second prevents a slow run from stacking on top of itself: if today's daily run is still going at midnight and yesterday's never finished, the orchestrator should not start tomorrow's. Both caps are usually set per DAG, and the right values depend on the work's compute profile.

	with DAG(
	'fanout_etl',
	schedule='@daily',
	max_active_runs=1, # only one run in flight at a time
	concurrency=8, # at most 8 tasks running simultaneously within this DAG
	...
	) as dag:
	...

Pools: The Cross-DAG Cap

A pool is a named resource budget shared across DAGs. The 'snowflake_xl' pool might have 16 slots; any task that runs against the XL warehouse takes one slot for the duration. If 20 tasks across 5 DAGs all want the XL warehouse simultaneously, 16 run and 4 wait. The pool prevents a fanout in DAG A from starving DAG B, and it caps the total load on the warehouse to a known number. Pools are the most underused feature in many production deployments because they require explicit modeling of which tasks share which resources.

	# Tasks in different DAGs share a Snowflake XL pool
	# Pool 'snowflake_xl' has 16 slots configured globally

	# In dag_marketing:
	load_marketing = SnowflakeOperator(
	task_id='load_marketing_aggregations',
	sql='...',
	pool='snowflake_xl',
	pool_slots=1,
	)

	# In dag_finance:
	load_finance = SnowflakeOperator(
	task_id='load_finance_close',
	sql='...',
	pool='snowflake_xl',
	pool_slots=2, # heavier task takes 2 slots
	)
	# Together, the two DAGs cannot exceed 16 simultaneous slots in the pool

Priority: Who Goes First Under Contention

When the pool is full and tasks are waiting, the orchestrator picks one to run next. Priority weight controls that pick. A task with priority 10 runs before a task with priority 1 when both are waiting on the same pool. Priority is per-task or per-DAG depending on the orchestrator. The right pattern is: customer-facing pipelines (the executive dashboard, the customer email batch) are priority 10, regular production pipelines are 5, experimental and ad-hoc work is 1. Without priority tuning, contention resolves arbitrarily, which means the experimental backfill can starve the customer email run.

Priority decisions that pay off:

▸Customer-facing pipelines outrank internal analytics on the same warehouse
▸Backfills run at lower priority so they yield to live production runs
▸Ad-hoc and experimental work runs in a separate pool with lower priority
▸SLA-bound DAGs get a priority bump so they win contention against unbounded ones

When the Warehouse Melts

The classic incident is a Monday morning when a backfill, a normal daily run, and an experimental ML feature build all hit Snowflake at the same time. Without pools and priority, the warehouse queues every query, every query slows by 5x, and the daily SLA misses. With pools, the backfill is capped at 2 slots, the daily run is capped at 8, and the experimental work is in a different pool entirely. With priority, the customer-facing DAG runs first and finishes inside its SLA while the backfill stretches longer. The cost is one afternoon of pool configuration; the saving is every Monday that does not become an incident.

•No Pools or Priority

Every task competes equally for warehouse slots
A fanout backfill starves the daily executive DAG
Contention resolves arbitrarily; SLA misses are unpredictable
On-call cannot tell which workload caused the slowdown

✓Pools and Priority Tuned

Pools partition resources by class (XL, ML, ad-hoc)
Backfills wait for live production work to finish
Priority routes the most important work first under contention
On-call sees pool utilization in the orchestrator UI

	# Simulate three workloads competing on a 16-slot pool
	# With and without priority

	from heapq import heappush, heappop

	class PoolScheduler:
	def __init__(self, slots, use_priority):
	self.slots = slots
	self.use_priority = use_priority
	self.queue = []
	self.running = []

	def submit(self, name, priority, duration):
	# Higher priority = run first; negate for min-heap
	order_key = -priority if self.use_priority else 0
	heappush(self.queue, (order_key, name, duration))

	def run_to_completion(self):
	clock = 0
	finish_times = {}
	while self.queue or self.running:
	while self.queue and len(self.running) < self.slots:
	key, name, duration = heappop(self.queue)
	self.running.append((clock + duration, name))
	self.running.sort()
	finish_clock, name = self.running.pop(0)
	clock = finish_clock
	finish_times[name] = clock
	self.running = [(t, n) for t, n in self.running if t > clock]
	return finish_times

	for priority in (False, True):
	sched = PoolScheduler(slots=4, use_priority=priority)
	# 8 backfill tasks, low priority, run for 30 min each
	for i in range(8):
	sched.submit(f'backfill_{i}', priority=1, duration=30)
	# 1 customer dashboard task, high priority, runs for 10 min
	sched.submit('customer_dashboard', priority=10, duration=10)
	finish = sched.run_to_completion()
	label = 'with priority' if priority else 'no priority'
	print(f'{label}: customer_dashboard finishes at minute {finish["customer_dashboard"]}')

TIP

Treat the orchestrator's pool configuration as part of the warehouse's architecture, not as a one-time setup. Pool sizes, priorities, and SLA thresholds should be reviewed alongside warehouse credit usage every quarter.

Postmortem: Cadence Change Bug

Daily Life

Interviews

Diagnose a multi-cause orchestration failure by separating the seam, the freshness contract, the pool, and the SLA, and design a redesign that addresses each layer.

A real production incident, redacted from a fintech postmortem. A daily DAG named `daily_finance_close` had been running cleanly for fourteen months. In month fifteen, it began missing its 6am Pacific SLA twice a week. Nothing in the DAG had changed. The investigation revealed that an upstream Stripe ingestion DAG had been quietly migrated from a 30-minute cadence to a 5-minute cadence three weeks earlier. The change was a clear improvement upstream. It broke the downstream because of every assumption the original architecture had encoded.

The Original Architecture

	Upstream : Downstream : stripe_to_raw schedule : * / 30 daily_finance_close schedule : 0 5 * * *(every 30 MIN, FULL pull) start_at_5am outlets : raw.payments read raw.payments since 00 : 00
	JOIN, aggregate publish mart.daily_close

	CROSS - DAG mechanism : TIME
	OFFSET(downstream starts AT 5 am, upstream finishes BY 4 : 50 am)

What Changed Upstream and Why It Started Failing

The upstream team migrated `stripe_to_raw` from 30-minute full pulls to 5-minute incremental pulls. The migration was a clear improvement: smaller batches, lower latency, less load on the warehouse. The team announced it in their standup and merged the PR. No one downstream caught the announcement, because the cross-DAG dependency was invisible: the downstream's schedule did not reference the upstream at all.

The new 5-minute cadence created a small window of inconsistency every five minutes during which raw.payments was mid-write. The 5am downstream read sometimes caught a partial batch and other times caught a complete batch, depending on the exact second the join started. On clean reads, the daily_finance_close finished by 5:45am. On partial reads, the join produced rows with NULLs in the new payment columns, the row count threshold check triggered a row deletion and rewrite, the rewrite waited on a warehouse slot, and the SLA missed by 20 to 60 minutes. The bug only showed on the days when the read happened to land mid-batch.

The Five Things Wrong With the Original

Issue	Why It Was Latent for 14 Months	Why It Surfaced After the Change
Time-offset cross-DAG dependency	30-min cadence had a 25-minute quiet window before the 5am read	5-min cadence eliminated the quiet window
No asset-level freshness contract	Implicit 'finishes by 4:50am' was reliable enough	Cadence change broke the implicit contract; nobody noticed
No transactional read isolation	Source rarely caught mid-write	5-min cadence made mid-write reads frequent
Single-shared pool with no priority	Spare capacity absorbed slow runs	Slow runs collided with backfills competing for the same pool
SLA on the final task only	Final-task SLA caught the symptom	Detection at SLA breach is too late; root cause is upstream

The Redesign

The fix touched four of the five issues. The cross-DAG edge moved from time-offset to asset-trigger: daily_finance_close now starts when raw.payments is fresh for the previous day's logical date, not when the clock hits 5am. The freshness contract on raw.payments was declared explicitly: 'fresh for the previous calendar day by 4:30am Pacific'. The Stripe ingestion DAG was modified to write a `_partition_complete` marker when the day's data finished landing, and the downstream waits on the marker, not the raw rows. The downstream gained a higher priority than the backfill workloads on the same pool.

	Redesigned : stripe_to_raw schedule : * / 5(every 5 MIN, incremental pull) outlets : raw.payments(per - batch) separate marker task AT hour 4 writes : raw.payments_partition_complete date=YYYY-MM-DD daily_finance_close schedule : asset_trigger(payments_complete) read raw.payments

	WHERE DATE = previous_day
	JOIN, aggregate publish mart.daily_close Freshness policy

	ON raw.payments_partition_complete : must exist for previous day BY 04 : 30 America / Los_Angeles Freshness policy
	ON mart.daily_close : must exist for previous day BY 06 : 00 America / Los_Angeles Pool : finance_close AT priority 10 ; backfills AT priority 1

	/* The downstream DAG checks the partition_complete marker before reading raw.payments */
	/* This SQL returns the rows the daily_finance_close should process, gated on completeness */
	/* The marker table is small (one row per partition) and cheap to check */
	WITH partition_marker AS (
	SELECT
	'2026-04-24' AS partition_date,
	'complete' AS status,
	'2026-04-25 04:18:33' AS completed_at

	UNION ALL

	SELECT
	'2026-04-23',
	'complete',
	'2026-04-24 04:21:07'

	UNION ALL

	SELECT
	'2026-04-22',
	'complete',
	'2026-04-23 04:14:51'
	),
	raw_payments AS (
	SELECT
	'p1' AS payment_id,
	'a1' AS account_id,
	1500 AS amount_cents,
	DATE('2026-04-24') AS pay_date

	UNION ALL

	SELECT
	'p2',
	'a1',
	800,
	DATE('2026-04-24')

	UNION ALL

	SELECT
	'p3',
	'a2',
	2200,
	DATE('2026-04-24')
	)

	SELECT
	p.account_id,
	COUNT(*) AS payments,
	SUM(p.amount_cents) / 100 AS revenue_usd
	FROM raw_payments AS p
	INNER JOIN partition_marker AS m
	ON m.partition_date = CAST(
	p.pay_date
	AS VARCHAR
	)
	WHERE m.status = 'complete'
	AND p.pay_date = DATE('2026-04-24')
	GROUP BY p.account_id
	ORDER BY p.account_id

What the Redesign Costs

The cost was real. The Stripe team had to add the partition_complete marker task and accept a small additional cost on every run. The finance team had to migrate the daily_finance_close DAG from time-offset to asset-trigger and validate that the new behavior was correct. The platform team had to add a separate pool for finance_close work, configure priority, and document the rationale. Together the changes took about three weeks of engineering time. The recurring SLA misses, by contrast, were costing roughly six engineer-hours per week in firefighting plus the executive-team cost of a stale dashboard, so the fix paid for itself in eight weeks.

•Original (Time-Offset, Single Pool)

Downstream starts at 5am regardless of upstream state
Implicit freshness contract; cadence changes silently break it
All work shares one pool; backfills can starve customer-facing work
SLA on the final task; root causes stay invisible

✓Redesigned (Asset-Trigger, Pool + Priority)

Downstream starts when partition_complete marker is fresh
Explicit freshness policies on both upstream and downstream assets
Separate pool for finance work; priority 10 over backfills
SLA on the asset; misses surface at the actual layer where freshness slipped

Lessons from the postmortem:

▸An upstream cadence change is a contract change; it should be PR-reviewed by every downstream owner
▸Time-offset cross-DAG dependencies hide the seam; asset triggers make it visible
▸An idempotent producer with explicit partition-complete markers protects every downstream from mid-batch reads
▸Pools and priority are how customer-facing work survives backfill contention
▸An SLA on the final task is necessary but not sufficient; SLAs at the seam catch root causes earlier

Cadence changes upstream are silent breakers of time-offset cross-DAG dependencies.

Asset triggers, partition-complete markers, and pool-priority tuning together convert a fragile seam into a contract.

An SLA at the seam catches root causes earlier than an SLA on the final task; both are usually needed.

TIP

When a DAG starts failing without itself changing, look upstream for a cadence change. The failure mode is almost always at the seam between cadences, not inside the DAG that fires the alert.

❯❯❯PUTTING IT ALL TOGETHER

> A staff engineer at a public marketplace owns 220 production DAGs. The marketing executive dashboard misses its 6am SLA twice a week. Investigation reveals four overlapping issues: a year-old backfill is still running nightly and competing for warehouse slots, the SLA is declared on the final task instead of the producing asset, customer-facing DAGs share priority with experimental ML jobs, and a recent partial migration to asset-based scheduling left half the deployment on time-offset cross-DAG edges. The CTO asks: 'What is the smallest set of changes that fixes the 6am SLA without breaking the rest?'

Stop the year-old backfill. From Lesson 5 (idempotency and backfill), every backfill should be a bounded operation with a clear end. A nightly recurring backfill is a sign that the original pipeline is not idempotent and the workaround has calcified into a permanent cost.

Move SLAs from final tasks to assets. The marketing dashboard's SLA belongs on the mart asset; the producing DAG's tasks already have SLAs that catch internal slowness. The asset-level SLA catches misses that final-task SLAs miss when the failure is in the seam.

Split the warehouse pool by workload class: customer_facing (priority 10), regular_production (priority 5), experimental (priority 1, separate pool). Backfills run in the experimental pool. Customer-facing DAGs win contention without starving the rest.

Complete the asset-based migration on the seams that touch SLA-bound assets first. Time-offset cross-DAG edges remain the highest-risk anti-pattern (Lesson 4 intermediate showed why). Asset triggers replace them and stabilize the schedule.

Apply the four-role and undercurrents framing from Lesson 1. Orchestration is the undercurrent that runs through every node; the redesign is the orchestration layer catching up to the architecture the rest of the pipeline already shows.

Treat the change as a pipeline-as-product rollout (Lesson 1 advanced): every affected DAG gets a contract update, a deprecation path for the old cross-DAG edges, and an explicit owner. The shift is operational discipline, not a single PR.

KEY TAKEAWAYS

Asset-based and task-based orchestration are different defaults: task-based fits existing Airflow deployments; asset-based fits new builds where lineage and partial refreshes matter. Most large deployments end up hybrid.

Backfills demand idempotent task design: every task takes the run date as a parameter and writes only that partition. Without idempotency, a backfill is a fresh form of corruption.

SLAs declared at the orchestrator level are enforceable: the orchestrator detects the miss, routes the alert, and surfaces it on a tunable schedule. Loose runbooks do none of that.

Concurrency, pools, and priority prevent warehouse contention: concurrency caps a single DAG, pools share resources across DAGs, priority resolves the queue. The three together stop simultaneous workloads from melting the warehouse.

Cross-DAG cadence changes are silent contract changes: an upstream cadence shift breaks downstream time-offset dependencies on the day it ships. Asset triggers and explicit freshness contracts at the seam are the durable fix.

Assets, backfills, SLAs, and concurrency are the levers a senior engineer pulls

Category: Pipeline Architecture
Difficulty: advanced
Duration: 40 minutes
Challenges: 0 hands-on challenges

Topics covered: Asset vs Task Orchestration, Backfills as First-Class Operations, SLAs at the Orchestrator Level, Concurrency, Pools, and Priority, Postmortem: Cadence Change Bug

Lesson Sections

Asset vs Task Orchestration (concepts: paDagOrchestration)
Two philosophies compete in modern orchestration. Task-based orchestration, exemplified by Airflow, treats the work as the primary object: define tasks, declare dependencies between them, and trust that the data will follow. Asset-based orchestration, exemplified by Dagster, treats the data as the primary object: declare the assets the pipeline produces, declare the dependencies between assets, and let the orchestrator infer the work. The two models produce equivalent pipelines on simple cases.
Backfills as First-Class Operations (concepts: paBackfill)
A backfill is the operation of running a pipeline over a historical date range that it did not run for at the time. Backfills happen for three reasons: a bug in the pipeline produced wrong data and the corrected pipeline must reprocess the affected dates; a new column was added and the historical data must be regenerated to fill it; a new pipeline is launched and needs initial history. In production, backfills are run constantly, and a system that does not treat them as a first-class operation p
SLAs at the Orchestrator Level (concepts: paMonitoring)
An SLA, in orchestration terms, is a commitment that a DAG (or an asset) finishes by a stated time. The marketing dashboard SLA might be 'mart.daily_revenue is fresh for the previous day by 6am Pacific'. The SLA is not a hope. It is a configured guarantee that the orchestrator monitors and alerts on when missed. Declaring SLAs at the orchestrator level rather than in a separate runbook turns a soft expectation into an enforced contract, and it changes the on-call response from 'someone noticed l
Concurrency, Pools, and Priority (concepts: paDagOrchestration)
Every shared system has finite capacity. A Snowflake warehouse has slot limits. A Spark cluster has executor limits. A Postgres replica has connection limits. An orchestrator that submits work without regard for those limits will, eventually, melt the system it depends on. Concurrency, pools, and priority are the three controls that let the orchestrator submit work in a way the downstream system can absorb. They are simple knobs that prevent the most expensive class of incident in mature deploym
Postmortem: Cadence Change Bug (concepts: paDagOrchestration)
A real production incident, redacted from a fintech postmortem. A daily DAG named `daily_finance_close` had been running cleanly for fourteen months. In month fifteen, it began missing its 6am Pacific SLA twice a week. Nothing in the DAG had changed. The investigation revealed that an upstream Stripe ingestion DAG had been quietly migrated from a 30-minute cadence to a 5-minute cadence three weeks earlier. The change was a clear improvement upstream. It broke the downstream because of every assu