A staff data engineer at a public marketplace inherited a 220-DAG Airflow deployment that was missing its 6am SLA roughly twice a week. The investigation turned up four separate problems coexisting in one symptom: a backfill job from a year ago was running every night and competing for warehouse slots, the SLA itself was declared on the wrong DAG, the priority weight on the executive marts was the same as on the experimental ML jobs, and a recent migration to asset-based scheduling had not been propagated to half the deployment. Fixing one without the others did nothing. Each one alone was a senior-engineer concern; together they formed the operational reality of large orchestration deployments. This lesson is about the four levers that a senior engineer pulls when single-DAG fluency is no longer enough: assets, backfills, SLAs, and concurrency.
Asset vs Task Orchestration
Daily Life
Interviews
Distinguish asset-based and task-based orchestration models, explain the operational consequences, and pick a model at the architectural level.
Two philosophies compete in modern orchestration. Task-based orchestration, exemplified by Airflow, treats the work as the primary object: define tasks, declare dependencies between them, and trust that the data will follow. Asset-based orchestration, exemplified by Dagster, treats the data as the primary object: declare the assets the pipeline produces, declare the dependencies between assets, and let the orchestrator infer the work. The two models produce equivalent pipelines on simple cases. They diverge sharply on lineage, observability, partial refreshes, and the operability of large deployments.
Task-Based: Work Is the Object
Airflow's model places tasks at the center. A task has a definition, a retry policy, an upstream dependency, and a downstream dependency. Tasks happen to read and write data, but the orchestrator does not require that fact to be declared. The graph is over tasks, not over the data they produce. The benefit is flexibility: any task can call any other system, and the orchestrator does not need to understand what the task does. The cost is that lineage is implicit: an engineer must read the task code to know which tables it touches, and the orchestrator cannot answer questions like 'which tasks contributed to fct_orders'.
Asset-Based: Data Is the Object
Dagster's model places assets at the center. An asset is a named piece of data (a table, a partition, a feature) and is declared as the output of a function. The orchestrator builds the dependency graph from the asset declarations: if asset B is declared as derived from asset A, the orchestrator infers that B's compute depends on A. Tasks still exist as the underlying execution, but the surface API is the asset graph. The benefit is that lineage is explicit: every asset has a declared origin, and the orchestrator can answer structural questions about the data. The cost is the up-front modeling discipline; engineers must declare assets rather than write tasks that happen to write data.
Property
Task-Based (Airflow)
Asset-Based (Dagster)
Primary object
Task: a unit of work
Asset: a unit of data
Dependency model
Task -> Task; data follows implicitly
Asset -> Asset; tasks are inferred
Lineage
Implicit; reconstructed from code reading
Explicit; queryable from the orchestrator metadata
Partial refresh
Re-run a task; the data effects are a side effect
Refresh an asset; the orchestrator computes what to recompute
Backfill semantics
Re-run task instances over a date range
Materialize asset partitions over a date range
Cross-DAG / asset support
Datasets and ExternalTaskSensor; later addition
Native; assets are the cross-DAG primitive from day one
What Changes Operationally
An asset-based deployment lets an operator answer 'which assets are stale right now and why'. The orchestrator already knows the lineage, the freshness policy on each asset, and the producer task. A task-based deployment can answer the same question only by external lineage tooling (OpenLineage, Marquez, dbt's lineage) bolted onto the orchestrator. Both answers are achievable; the asset model gets there with less coordination because it never lost the connection between work and data.
1
# Task-based (Airflow): the data is implicit
2
@task
3
defcompute_fct_orders():
4
sql='INSERT INTO fct_orders SELECT ...'
5
snowflake.execute(sql)
6
7
@task
8
defcompute_mart_revenue():
9
sql='INSERT INTO mart_revenue SELECT ... FROM fct_orders'
10
snowflake.execute(sql)
11
12
compute_fct_orders()>>compute_mart_revenue()# the data dependency is in the engineer's head
13
14
# Asset-based (Dagster): the data is the API
15
@asset
16
deffct_orders():
17
returnsnowflake_load('SELECT ... FROM raw.orders')
18
19
@asset(deps=[fct_orders])
20
defmart_revenue(fct_orders):
21
returnsnowflake_load('SELECT ... FROM fct_orders')
22
# The orchestrator now knows mart_revenue is derived from fct_orders
Where Each Model Fits
•Task-Based Fits When
Existing Airflow deployment with hundreds of DAGs
Workloads include heavy non-data orchestration (alerts, ML jobs, cross-system glue)
Tasks span many systems where the data shape is hard to model uniformly
The team is more comfortable thinking in tasks than in assets
✓Asset-Based Fits When
Lineage and data observability are first-order requirements
Many partitioned tables that need partial refreshes and backfills
Large data graph where task count would explode if modeled per task
New build with no orchestrator legacy
The Hybrid Reality
Airflow has been growing asset-aware features for years (Datasets, OpenLineage integration). Dagster supports task-style imperative work alongside its asset model. The two models are converging in the middle, and most large deployments end up hybrid in practice: a core asset graph for the data layer, plus task-style operators for non-data work like notifications, infrastructure jobs, and external API calls. The rigid 'pick one' framing is rare in production. The framing that matters is which model is the default at the seam where new work gets added.
Symptoms that the wrong model is in use:
▸Lineage tooling has to be bolted on with custom code; the orchestrator does not know which task touches which table
▸Asset refreshes require running entire DAGs even when only one table is stale
▸Partial backfills are expressed as ad-hoc CLI scripts rather than first-class operations
▸Cross-team data dependencies are wired with sensors that nobody fully trusts
Task-BasedAsset-BasedHybrid
Task-Based
Work-first model (Airflow)
Tasks are the primary object. Flexible, broadly compatible, lineage requires external tooling. The default of most existing enterprise deployments.
Asset-Based
Data-first model (Dagster)
Assets are the primary object. Lineage is explicit, partial refreshes are native, the asset graph mirrors the data graph. The default of most new builds focused on data observability.
Hybrid
Convergence in production
Asset graph for data work plus task operators for non-data glue. The pragmatic shape most large deployments end up in regardless of which orchestrator was chosen.
✓Do
Model the assets explicitly when the orchestrator supports it; lineage compounds in value
Mix task-style and asset-style where each fits, with the asset model as the default for data work
Pick the model at the architectural level, not per-DAG; consistency reduces operator load
✗Don't
Bolt lineage tooling onto a task-based deployment as an afterthought; design it in
Force every job into the asset model; some work (alerts, infra) is genuinely task-shaped
Migrate from one model to the other without a phased plan; partial migrations are worse than either pure model
Task-based and asset-based models converge in practice; the choice is which is the default at the seam where new work is added.
Asset-based deployments answer 'which assets are stale and why' from orchestrator metadata; task-based deployments answer the same question only with bolted-on lineage tooling.
Partial migrations between the two models are worse than either pure model; plan the phased rollout before starting.
TIP
When building a new deployment, choose asset-based as the default and accept task-style operators where they fit. When inheriting a task-based deployment, introduce assets at the seams (Datasets, OpenLineage) before attempting a full migration.
Backfills as First-Class Operations
Daily Life
Interviews
Design pipelines that are backfillable: parameterize on date range, write only the target partition, and use orchestrator-managed backfill operations with appropriate concurrency.
A backfill is the operation of running a pipeline over a historical date range that it did not run for at the time. Backfills happen for three reasons: a bug in the pipeline produced wrong data and the corrected pipeline must reprocess the affected dates; a new column was added and the historical data must be regenerated to fill it; a new pipeline is launched and needs initial history. In production, backfills are run constantly, and a system that does not treat them as a first-class operation produces a different bug every time.
What Makes Backfills Different From Regular Runs
A regular run processes one logical date: today, or the most recent hour. A backfill processes many logical dates, sometimes years of them. The same task is invoked once per logical date, with different parameters, and the orchestrator must coordinate the parallelism so the warehouse does not collapse under thousands of simultaneous queries. The task itself must accept a date as a parameter; if the task hardcodes 'yesterday', it cannot be backfilled. This requirement is the entire reason idempotent task design matters.
Property
Idempotent Task
Non-Idempotent Task
Run twice
Same result as running once
Different result; possibly duplicates or corruption
Partition strategy
Overwrites the target partition for the run date
Appends; duplicates accumulate
Time references
Reads the run date as a parameter
Reads NOW(); each run sees a different time
Backfill safety
Backfilling a year is safe and resumable
Backfilling a year produces a corrupted year
Parameterizing on Date Range
Every backfillable task takes the run date (or a date range) as input and reads or writes only the partitions for that range. The orchestrator passes the parameter explicitly: Airflow exposes the logical date as a Jinja template variable; Dagster passes a partition key into the asset function; Prefect parameterizes the flow run. Tasks that hardcode date logic inside the function body break backfills silently. The most common silent break is a SQL query that filters on `DATE(NOW()) - 1` instead of the parameterized run date; the query runs successfully on the backfill but reads today's data, producing rows for every backfill date that look identical.
1
# Backfillable task: run_date is a parameter
2
@task
3
defaggregate_orders(run_date:str):
4
sql=f'''
DELETE FROM mart.daily_orders WHERE order_date = '{run_date}';
INSERT INTO mart.daily_orders
SELECT '{run_date}' AS order_date, COUNT(*) AS order_count
FROM raw.orders
WHERE DATE(order_timestamp) = '{run_date}';
'''
5
snowflake.execute(sql)
6
7
# Non-backfillable variant: hardcoded date logic that does not accept the run date
8
@task
9
defaggregate_orders_broken():
10
sql='''
INSERT INTO mart.daily_orders
SELECT CURRENT_DATE - 1 AS order_date, COUNT(*) AS order_count
FROM raw.orders
WHERE DATE(order_timestamp) = CURRENT_DATE - 1;
'''
11
snowflake.execute(sql)
12
# The broken version writes today's data with yesterday's date label, every time it runs
Backfills as Orchestrator Operations
Modern orchestrators expose backfills as a first-class CLI and UI operation. In Airflow, the `airflow dags backfill` command takes a start date, an end date, and a DAG ID, then runs the DAG for every date in the range. In Dagster, the asset partition UI lets an operator select a range of partitions and click 'materialize'. The orchestrator handles the parallelism, the rate of submission, and the failure isolation. The engineer does not write a custom loop. This is the difference between backfills as a first-class operation and backfills as an ad-hoc script.
Backfill anti-patterns that show up in production:
▸Custom Python script that loops over dates and runs queries directly, bypassing the orchestrator
▸Tasks that hardcode CURRENT_DATE or NOW() and silently produce wrong dates on backfill
▸Backfills that re-run successful runs and overwrite manual fixes
▸Backfills with no concurrency limit that submit a year of queries at once and stall the warehouse
The Concurrency Knob on Backfills
A backfill of one year is 365 invocations of the same task. Submitting them simultaneously will overwhelm any warehouse. The orchestrator exposes a concurrency cap on backfills: how many runs of the DAG can be in flight at the same time. A typical setting is between four and sixteen, tuned to the warehouse's compute envelope. The right setting balances time-to-completion against contention with normal production runs. A backfill that runs for two days at concurrency four is usually preferable to a backfill that finishes in two hours at concurrency 100 and breaks every other pipeline running on the same warehouse.
1
# Simulate a backfill with concurrency control
2
# Compare unbounded submission against capped concurrency
print(f'concurrency={c:>3}: {minutes} minutes elapsed, {slow} steps with warehouse contention')
35
When Backfills Are Not the Right Answer
Backfilling years of history at full fidelity is sometimes the wrong call. If only the last 90 days are queried, regenerating five years of partitions costs warehouse time that no consumer ever uses. A targeted backfill of the affected range is cheaper and faster. The discipline is to ask 'what range actually needs to be regenerated' before passing a start date of 2020-01-01 to the backfill command. A backfill is a tool, not a default.
•Naive Backfill
Custom script loops over dates and runs SQL directly
No concurrency cap; submits everything at once
Tasks hardcode dates internally and produce wrong outputs
No tracking of which dates succeeded; partial failure leaves a mystery
✓First-Class Backfill
Orchestrator exposes a backfill CLI or UI command
Concurrency cap matches warehouse capacity
Tasks accept run_date as a parameter and write only their partition
Each date is a tracked task instance; partial failure is observable
Idempotency is the prerequisite for safe backfills. Without it, every backfill is a different kind of corruption.
TIP
Before triggering a backfill, confirm that every task in the DAG accepts the run date as a parameter and writes only that partition. If any task hardcodes the date or appends rather than overwrites, the backfill will silently corrupt history.
extract
extract
transform
transform
load
load
Storage
warehouse
An orchestration DAG: tasks are nodes, dependencies are edges, and there are no cycles. The orchestrator runs each task only after its upstream finishes, retries failures, and backfills - things cron cannot do.
SLAs at the Orchestrator Level
Daily Life
Interviews
Declare orchestrator-level SLAs on DAGs and assets, tune the thresholds against historical runtime distributions, and route misses to on-call with actionable context.
An SLA, in orchestration terms, is a commitment that a DAG (or an asset) finishes by a stated time. The marketing dashboard SLA might be 'mart.daily_revenue is fresh for the previous day by 6am Pacific'. The SLA is not a hope. It is a configured guarantee that the orchestrator monitors and alerts on when missed. Declaring SLAs at the orchestrator level rather than in a separate runbook turns a soft expectation into an enforced contract, and it changes the on-call response from 'someone noticed late' to 'the orchestrator paged at 6:01am'.
What an Orchestrator-Level SLA Specifies
Element
Meaning
Example
Target
The DAG, task, or asset the SLA applies to
mart.daily_revenue_by_account
Deadline
Wall-clock time by which the work must complete
06:00 America/Los_Angeles
Frequency
How often the deadline applies
Every day for the previous day's logical date
Response
What the orchestrator does if the deadline passes
Page on-call, post to #data-incidents, mark run as SLA-missed
Recovery expectation
How quickly a missed SLA must be resolved
Within 60 minutes of the breach
How Airflow Declares SLAs
Airflow's SLA mechanism attaches a deadline to a task. If the task does not finish within `sla` time of its expected start, Airflow fires an SLA miss event that can be routed to alerts, Slack, or PagerDuty. The mechanism is task-level by default, which is a known weakness: an SLA on the final publishing task is correct conceptually but does not catch failures earlier in the DAG. Production teams typically also configure DAG-level monitoring through external tools (Datadog, Monte Carlo, Bigeye) to bridge the gap. Newer Airflow features (Datasets with deadlines, the deadline feature added in 2.x) move toward asset-level deadlines.
1
# Airflow task-level SLA
2
fromdatetimeimporttimedelta
3
4
publish=PythonOperator(
5
task_id='publish_mart_revenue',
6
python_callable=publish_to_mart,
7
sla=timedelta(hours=4),# task must complete within 4 hours of scheduled start
Dagster's asset-first model declares SLAs as freshness policies on assets. An asset can specify 'this asset must be no more than 4 hours stale by 6am every day'. The orchestrator monitors actual asset freshness and fires alerts when an asset is overdue. Because the policy is on the asset, it survives changes to which DAG produces the asset; refactoring the producer DAG does not require updating the SLA. This is one of the operational advantages of asset-based scheduling that becomes visible only when SLAs are declared.
1
# Dagster freshness policy on the asset, not the task
2
fromdagsterimportasset,FreshnessPolicy
3
4
@asset(
5
freshness_policy=FreshnessPolicy(
6
maximum_lag_minutes=240,# asset must be at most 4h old
7
cron_schedule='0 6 * * *',# by 6am daily
8
cron_schedule_timezone='America/Los_Angeles',
9
),
10
)
11
defmart_daily_revenue_by_account():
12
returnrun_daily_revenue_query()
SLA failure modes that point at deeper problems:
▸SLA missed every Monday because Sunday's source had unusual volume that the schedule did not anticipate
▸SLA technically met but data is wrong because a task succeeded on stale upstream
▸SLA on the final task fires, but the actual failure was three tasks earlier and went unalerted
▸SLA threshold is too tight; alerts fire on normal slow runs and on-call ignores them
▸SLA threshold is too loose; misses are flagged hours after consumers already noticed
SLA Tuning
Setting an SLA is a tradeoff. Too tight, and every slow but successful run pages on-call, training the team to ignore alerts. Too loose, and missed SLAs are detected only after a consumer complains. The right setting is the worst-case acceptable runtime under normal conditions, plus a small buffer. A pipeline whose 95th-percentile runtime is 3 hours and whose worst observed runtime is 4.5 hours might set an SLA of 5 hours. The setting should be revisited quarterly as runtime distributions drift.
•Tightly Tuned SLA
Detects misses promptly when they happen
False positives on normal slow runs train alert fatigue
Pages on-call for transient slowness that resolves naturally
Tightening below the runtime distribution produces noise, not signal
•Loosely Tuned SLA
Misses caught only after consumers complain
Real outages get diagnosed late
Alert fatigue is low; trust in the alerts is high when they fire
Loosening above the runtime distribution produces silence, not safety
SLA Observability
An SLA without a dashboard is a setting that nobody calibrates. Production teams ship a dashboard that shows the historical distribution of runtimes for every SLA-bound DAG, the current SLA threshold, the count of misses in the last 30 days, and the trend. The dashboard is what lets the team tune the threshold over time as runtime distributions shift. SLA settings that go untouched for years drift out of date as the underlying work changes; observability is the antidote.
✓Do
Declare SLAs at the orchestrator level, not in a separate runbook
Tune SLAs to the 95th-percentile runtime plus a known buffer; revisit quarterly
Publish a dashboard of historical SLA performance for every bound pipeline
Route SLA misses to on-call with enough context to start investigation immediately
✗Don't
Set SLAs to the average runtime; half of normal runs will breach by definition
Page on every SLA event without distinguishing real misses from edge cases
Place the SLA on the final task only; failures earlier in the DAG go unalerted
Leave SLA settings unchanged for years; runtime distributions drift
TIP
Treat SLAs as living settings, not one-time configurations. The best-tuned SLAs are reviewed quarterly against the actual runtime distribution and adjusted before alert fatigue or detection lag sets in.
Concurrency, Pools, and Priority
Daily Life
Interviews
Configure DAG concurrency, named pools, and task priority so simultaneous workloads cooperate rather than compete on a shared warehouse.
Every shared system has finite capacity. A Snowflake warehouse has slot limits. A Spark cluster has executor limits. A Postgres replica has connection limits. An orchestrator that submits work without regard for those limits will, eventually, melt the system it depends on. Concurrency, pools, and priority are the three controls that let the orchestrator submit work in a way the downstream system can absorb. They are simple knobs that prevent the most expensive class of incident in mature deployments.
Three Levers, Three Scopes
Control
Scope
What It Caps
Concurrency
Per DAG or per task
How many instances of this DAG or task can run simultaneously
Pool
Shared across many DAGs
Total parallel slots available to a named resource (e.g., 'snowflake_xl')
Priority
Per task instance
Order in which tasks are scheduled when capacity is contended
ConcurrencyPoolPriority
Concurrency
Single-DAG cap
How many tasks within one DAG run at once, and how many runs of the same DAG can be in flight. Prevents fanout saturation and run stacking.
Pool
Cross-DAG resource budget
A named slot budget shared across DAGs. Caps total load on a downstream resource (Snowflake XL, Spark, Postgres) to a known number.
Priority
Queue ordering
Per-task or per-DAG weight that decides who runs next when the pool is full. Customer-facing work outranks experimental work.
Concurrency: The Single-DAG Cap
DAG concurrency caps how many tasks within a single DAG run at the same time, and DAG-run concurrency caps how many runs of the same DAG can be in flight. The first prevents a fanned-out DAG with 200 parallel tasks from saturating workers. The second prevents a slow run from stacking on top of itself: if today's daily run is still going at midnight and yesterday's never finished, the orchestrator should not start tomorrow's. Both caps are usually set per DAG, and the right values depend on the work's compute profile.
1
withDAG(
2
'fanout_etl',
3
schedule='@daily',
4
max_active_runs=1,# only one run in flight at a time
5
concurrency=8,# at most 8 tasks running simultaneously within this DAG
6
...
7
)asdag:
8
...
Pools: The Cross-DAG Cap
A pool is a named resource budget shared across DAGs. The 'snowflake_xl' pool might have 16 slots; any task that runs against the XL warehouse takes one slot for the duration. If 20 tasks across 5 DAGs all want the XL warehouse simultaneously, 16 run and 4 wait. The pool prevents a fanout in DAG A from starving DAG B, and it caps the total load on the warehouse to a known number. Pools are the most underused feature in many production deployments because they require explicit modeling of which tasks share which resources.
1
# Tasks in different DAGs share a Snowflake XL pool
2
# Pool 'snowflake_xl' has 16 slots configured globally
3
4
# In dag_marketing:
5
load_marketing=SnowflakeOperator(
6
task_id='load_marketing_aggregations',
7
sql='...',
8
pool='snowflake_xl',
9
pool_slots=1,
10
)
11
12
# In dag_finance:
13
load_finance=SnowflakeOperator(
14
task_id='load_finance_close',
15
sql='...',
16
pool='snowflake_xl',
17
pool_slots=2,# heavier task takes 2 slots
18
)
19
# Together, the two DAGs cannot exceed 16 simultaneous slots in the pool
Priority: Who Goes First Under Contention
When the pool is full and tasks are waiting, the orchestrator picks one to run next. Priority weight controls that pick. A task with priority 10 runs before a task with priority 1 when both are waiting on the same pool. Priority is per-task or per-DAG depending on the orchestrator. The right pattern is: customer-facing pipelines (the executive dashboard, the customer email batch) are priority 10, regular production pipelines are 5, experimental and ad-hoc work is 1. Without priority tuning, contention resolves arbitrarily, which means the experimental backfill can starve the customer email run.
Priority decisions that pay off:
▸Customer-facing pipelines outrank internal analytics on the same warehouse
▸Backfills run at lower priority so they yield to live production runs
▸Ad-hoc and experimental work runs in a separate pool with lower priority
▸SLA-bound DAGs get a priority bump so they win contention against unbounded ones
When the Warehouse Melts
The classic incident is a Monday morning when a backfill, a normal daily run, and an experimental ML feature build all hit Snowflake at the same time. Without pools and priority, the warehouse queues every query, every query slows by 5x, and the daily SLA misses. With pools, the backfill is capped at 2 slots, the daily run is capped at 8, and the experimental work is in a different pool entirely. With priority, the customer-facing DAG runs first and finishes inside its SLA while the backfill stretches longer. The cost is one afternoon of pool configuration; the saving is every Monday that does not become an incident.
•No Pools or Priority
Every task competes equally for warehouse slots
A fanout backfill starves the daily executive DAG
Contention resolves arbitrarily; SLA misses are unpredictable
On-call cannot tell which workload caused the slowdown
✓Pools and Priority Tuned
Pools partition resources by class (XL, ML, ad-hoc)
Backfills wait for live production work to finish
Priority routes the most important work first under contention
On-call sees pool utilization in the orchestrator UI
1
# Simulate three workloads competing on a 16-slot pool
2
# With and without priority
3
4
fromheapqimportheappush,heappop
5
6
classPoolScheduler:
7
def__init__(self,slots,use_priority):
8
self.slots=slots
9
self.use_priority=use_priority
10
self.queue=[]
11
self.running=[]
12
13
defsubmit(self,name,priority,duration):
14
# Higher priority = run first; negate for min-heap
print(f'{label}: customer_dashboard finishes at minute {finish["customer_dashboard"]}')
42
TIP
Treat the orchestrator's pool configuration as part of the warehouse's architecture, not as a one-time setup. Pool sizes, priorities, and SLA thresholds should be reviewed alongside warehouse credit usage every quarter.
Postmortem: Cadence Change Bug
Daily Life
Interviews
Diagnose a multi-cause orchestration failure by separating the seam, the freshness contract, the pool, and the SLA, and design a redesign that addresses each layer.
A real production incident, redacted from a fintech postmortem. A daily DAG named `daily_finance_close` had been running cleanly for fourteen months. In month fifteen, it began missing its 6am Pacific SLA twice a week. Nothing in the DAG had changed. The investigation revealed that an upstream Stripe ingestion DAG had been quietly migrated from a 30-minute cadence to a 5-minute cadence three weeks earlier. The change was a clear improvement upstream. It broke the downstream because of every assumption the original architecture had encoded.
The upstream team migrated `stripe_to_raw` from 30-minute full pulls to 5-minute incremental pulls. The migration was a clear improvement: smaller batches, lower latency, less load on the warehouse. The team announced it in their standup and merged the PR. No one downstream caught the announcement, because the cross-DAG dependency was invisible: the downstream's schedule did not reference the upstream at all.
The new 5-minute cadence created a small window of inconsistency every five minutes during which raw.payments was mid-write. The 5am downstream read sometimes caught a partial batch and other times caught a complete batch, depending on the exact second the join started. On clean reads, the daily_finance_close finished by 5:45am. On partial reads, the join produced rows with NULLs in the new payment columns, the row count threshold check triggered a row deletion and rewrite, the rewrite waited on a warehouse slot, and the SLA missed by 20 to 60 minutes. The bug only showed on the days when the read happened to land mid-batch.
The Five Things Wrong With the Original
Issue
Why It Was Latent for 14 Months
Why It Surfaced After the Change
Time-offset cross-DAG dependency
30-min cadence had a 25-minute quiet window before the 5am read
5-min cadence eliminated the quiet window
No asset-level freshness contract
Implicit 'finishes by 4:50am' was reliable enough
Cadence change broke the implicit contract; nobody noticed
No transactional read isolation
Source rarely caught mid-write
5-min cadence made mid-write reads frequent
Single-shared pool with no priority
Spare capacity absorbed slow runs
Slow runs collided with backfills competing for the same pool
SLA on the final task only
Final-task SLA caught the symptom
Detection at SLA breach is too late; root cause is upstream
The Redesign
The fix touched four of the five issues. The cross-DAG edge moved from time-offset to asset-trigger: daily_finance_close now starts when raw.payments is fresh for the previous day's logical date, not when the clock hits 5am. The freshness contract on raw.payments was declared explicitly: 'fresh for the previous calendar day by 4:30am Pacific'. The Stripe ingestion DAG was modified to write a `_partition_complete` marker when the day's data finished landing, and the downstream waits on the marker, not the raw rows. The downstream gained a higher priority than the backfill workloads on the same pool.
/* The downstream DAG checks the partition_complete marker before reading raw.payments */
2
/* This SQL returns the rows the daily_finance_close should process, gated on completeness */
3
/* The marker table is small (one row per partition) and cheap to check */
4
WITHpartition_markerAS(
5
SELECT
6
'2026-04-24'ASpartition_date,
7
'complete'ASstatus,
8
'2026-04-25 04:18:33'AScompleted_at
9
10
UNIONALL
11
12
SELECT
13
'2026-04-23',
14
'complete',
15
'2026-04-24 04:21:07'
16
17
UNIONALL
18
19
SELECT
20
'2026-04-22',
21
'complete',
22
'2026-04-23 04:14:51'
23
),
24
raw_paymentsAS(
25
SELECT
26
'p1'ASpayment_id,
27
'a1'ASaccount_id,
28
1500ASamount_cents,
29
DATE('2026-04-24')ASpay_date
30
31
UNIONALL
32
33
SELECT
34
'p2',
35
'a1',
36
800,
37
DATE('2026-04-24')
38
39
UNIONALL
40
41
SELECT
42
'p3',
43
'a2',
44
2200,
45
DATE('2026-04-24')
46
)
47
48
SELECT
49
p.account_id,
50
COUNT(*)ASpayments,
51
SUM(p.amount_cents)/100ASrevenue_usd
52
FROMraw_paymentsASp
53
INNERJOINpartition_markerASm
54
ONm.partition_date=CAST(
55
p.pay_date
56
ASVARCHAR
57
)
58
WHEREm.status='complete'
59
ANDp.pay_date=DATE('2026-04-24')
60
GROUPBYp.account_id
61
ORDERBYp.account_id
What the Redesign Costs
The cost was real. The Stripe team had to add the partition_complete marker task and accept a small additional cost on every run. The finance team had to migrate the daily_finance_close DAG from time-offset to asset-trigger and validate that the new behavior was correct. The platform team had to add a separate pool for finance_close work, configure priority, and document the rationale. Together the changes took about three weeks of engineering time. The recurring SLA misses, by contrast, were costing roughly six engineer-hours per week in firefighting plus the executive-team cost of a stale dashboard, so the fix paid for itself in eight weeks.
•Original (Time-Offset, Single Pool)
Downstream starts at 5am regardless of upstream state
Implicit freshness contract; cadence changes silently break it
All work shares one pool; backfills can starve customer-facing work
SLA on the final task; root causes stay invisible
✓Redesigned (Asset-Trigger, Pool + Priority)
Downstream starts when partition_complete marker is fresh
Explicit freshness policies on both upstream and downstream assets
Separate pool for finance work; priority 10 over backfills
SLA on the asset; misses surface at the actual layer where freshness slipped
Lessons from the postmortem:
▸An upstream cadence change is a contract change; it should be PR-reviewed by every downstream owner
▸Time-offset cross-DAG dependencies hide the seam; asset triggers make it visible
▸An idempotent producer with explicit partition-complete markers protects every downstream from mid-batch reads
▸Pools and priority are how customer-facing work survives backfill contention
▸An SLA on the final task is necessary but not sufficient; SLAs at the seam catch root causes earlier
Cadence changes upstream are silent breakers of time-offset cross-DAG dependencies.
Asset triggers, partition-complete markers, and pool-priority tuning together convert a fragile seam into a contract.
An SLA at the seam catches root causes earlier than an SLA on the final task; both are usually needed.
TIP
When a DAG starts failing without itself changing, look upstream for a cadence change. The failure mode is almost always at the seam between cadences, not inside the DAG that fires the alert.
❯❯❯PUTTING IT ALL TOGETHER
> A staff engineer at a public marketplace owns 220 production DAGs. The marketing executive dashboard misses its 6am SLA twice a week. Investigation reveals four overlapping issues: a year-old backfill is still running nightly and competing for warehouse slots, the SLA is declared on the final task instead of the producing asset, customer-facing DAGs share priority with experimental ML jobs, and a recent partial migration to asset-based scheduling left half the deployment on time-offset cross-DAG edges. The CTO asks: 'What is the smallest set of changes that fixes the 6am SLA without breaking the rest?'
Stop the year-old backfill. From Lesson 5 (idempotency and backfill), every backfill should be a bounded operation with a clear end. A nightly recurring backfill is a sign that the original pipeline is not idempotent and the workaround has calcified into a permanent cost.
Move SLAs from final tasks to assets. The marketing dashboard's SLA belongs on the mart asset; the producing DAG's tasks already have SLAs that catch internal slowness. The asset-level SLA catches misses that final-task SLAs miss when the failure is in the seam.
Split the warehouse pool by workload class: customer_facing (priority 10), regular_production (priority 5), experimental (priority 1, separate pool). Backfills run in the experimental pool. Customer-facing DAGs win contention without starving the rest.
Complete the asset-based migration on the seams that touch SLA-bound assets first. Time-offset cross-DAG edges remain the highest-risk anti-pattern (Lesson 4 intermediate showed why). Asset triggers replace them and stabilize the schedule.
Apply the four-role and undercurrents framing from Lesson 1. Orchestration is the undercurrent that runs through every node; the redesign is the orchestration layer catching up to the architecture the rest of the pipeline already shows.
Treat the change as a pipeline-as-product rollout (Lesson 1 advanced): every affected DAG gets a contract update, a deprecation path for the old cross-DAG edges, and an explicit owner. The shift is operational discipline, not a single PR.
KEY TAKEAWAYS
Asset-based and task-based orchestration are different defaults: task-based fits existing Airflow deployments; asset-based fits new builds where lineage and partial refreshes matter. Most large deployments end up hybrid.
Backfills demand idempotent task design: every task takes the run date as a parameter and writes only that partition. Without idempotency, a backfill is a fresh form of corruption.
SLAs declared at the orchestrator level are enforceable: the orchestrator detects the miss, routes the alert, and surfaces it on a tunable schedule. Loose runbooks do none of that.
Concurrency, pools, and priority prevent warehouse contention: concurrency caps a single DAG, pools share resources across DAGs, priority resolves the queue. The three together stop simultaneous workloads from melting the warehouse.
Cross-DAG cadence changes are silent contract changes: an upstream cadence shift breaks downstream time-offset dependencies on the day it ships. Asset triggers and explicit freshness contracts at the seam are the durable fix.
Assets, backfills, SLAs, and concurrency are the levers a senior engineer pulls
Category
Pipeline Architecture
Difficulty
advanced
Duration
40 minutes
Challenges
0 hands-on challenges
Topics covered: Asset vs Task Orchestration, Backfills as First-Class Operations, SLAs at the Orchestrator Level, Concurrency, Pools, and Priority, Postmortem: Cadence Change Bug
Two philosophies compete in modern orchestration. Task-based orchestration, exemplified by Airflow, treats the work as the primary object: define tasks, declare dependencies between them, and trust that the data will follow. Asset-based orchestration, exemplified by Dagster, treats the data as the primary object: declare the assets the pipeline produces, declare the dependencies between assets, and let the orchestrator infer the work. The two models produce equivalent pipelines on simple cases.
A backfill is the operation of running a pipeline over a historical date range that it did not run for at the time. Backfills happen for three reasons: a bug in the pipeline produced wrong data and the corrected pipeline must reprocess the affected dates; a new column was added and the historical data must be regenerated to fill it; a new pipeline is launched and needs initial history. In production, backfills are run constantly, and a system that does not treat them as a first-class operation p
An SLA, in orchestration terms, is a commitment that a DAG (or an asset) finishes by a stated time. The marketing dashboard SLA might be 'mart.daily_revenue is fresh for the previous day by 6am Pacific'. The SLA is not a hope. It is a configured guarantee that the orchestrator monitors and alerts on when missed. Declaring SLAs at the orchestrator level rather than in a separate runbook turns a soft expectation into an enforced contract, and it changes the on-call response from 'someone noticed l
Every shared system has finite capacity. A Snowflake warehouse has slot limits. A Spark cluster has executor limits. A Postgres replica has connection limits. An orchestrator that submits work without regard for those limits will, eventually, melt the system it depends on. Concurrency, pools, and priority are the three controls that let the orchestrator submit work in a way the downstream system can absorb. They are simple knobs that prevent the most expensive class of incident in mature deploym
A real production incident, redacted from a fintech postmortem. A daily DAG named `daily_finance_close` had been running cleanly for fourteen months. In month fifteen, it began missing its 6am Pacific SLA twice a week. Nothing in the DAG had changed. The investigation revealed that an upstream Stripe ingestion DAG had been quietly migrated from a 30-minute cadence to a 5-minute cadence three weeks earlier. The change was a clear improvement upstream. It broke the downstream because of every assu