An e-commerce company at Series C scale ran a nightly ETL by chaining seven cron jobs at staggered times: extract at 1am, clean at 2am, join at 3am, aggregate at 4am, publish at 5am, and so on. The whole stack stayed glued together for eighteen months because the jobs always finished inside their windows. One Tuesday in November, Black Friday traffic doubled the volume of the extract. The 1am job ran until 2:47am. The 2am clean job started on schedule, found no extract output, ran on yesterday's data, and emitted a clean run with stale numbers. The 3am join silently joined fresh tables to stale ones. By 9am the executive dashboard was wrong, and nothing in the system noticed. The fix was not a bigger cron schedule or a longer window; it was an orchestrator that knows about dependencies between tasks. This lesson is about that mental shift: from running things on a clock to running things in an order that the system understands.
Why Cron Is Not an Orchestrator
Daily Life
Interviews
Recognize the canonical failure mode of cron chains and name what an orchestrator adds beyond a schedule.
The first scheduled job most engineers ever write is a cron job. Cron is a Unix utility that runs a command at a fixed time. It is small, reliable, and has been part of every Unix system since 1975. For a single command that runs once a day, cron is the right tool. The trouble starts when several commands need to run in a particular order, and especially when the order has to hold even if one of them runs late. Cron does not know about order. Cron knows about clock time.
What Cron Does and Does Not Do
Capability
Cron
Orchestrator
Run a command at a fixed time
Yes
Yes
Wait for an upstream job to finish before starting
No
Yes
Retry a failing command with backoff
No
Yes
Show whether last night's run succeeded
No (just a log file)
Yes (a UI)
Re-run a failed task without re-running the whole pipeline
No
Yes
Backfill historical date ranges (rerun the pipeline for past days, e.g., the last week of October)
No
Yes
The Failure That Always Comes First
Engineers who chain cron jobs by clock time eventually hit the same bug. Job A is scheduled at 1am and is expected to finish in thirty minutes. Job B is scheduled at 2am because it reads what A produces. One night A runs slow because the source had more data than usual. A finishes at 2:47am. B started at 2:00am, found no new output, and either failed or, worse, ran successfully on stale data. The whole chain produced numbers that looked complete and were wrong. This is the canonical first cron failure, and every team that runs more than two scheduled jobs hits it within a year.
Hidden costs of a cron chain that nobody notices on day one:
▸Slow upstream jobs cause silent stale reads downstream
▸A failed job in the middle does not block the rest of the chain
▸There is no UI; status lives in log files spread across servers
▸Re-running step 3 after a fix means manually re-running steps 4 and 5
▸Backfilling last Tuesday means writing a one-off script that mimics the chain
Why Adding Sleeps Does Not Fix the Problem
The first patch every team tries is to widen the windows. If the extract sometimes takes ninety minutes, schedule the clean job at 3am instead of 2am. Then the clean job sometimes takes longer, so the join job moves to 5am. After a few months of widening, the pipeline starts at 1am and finishes at 9am, which is the moment the dashboard is supposed to be ready. The widening solves no actual problem; it spends slack to mask a missing dependency model. Real orchestration encodes the dependency directly: the clean job runs when the extract job finishes successfully, not at a fixed clock time.
•Cron Chain (Time-Based)
Each job runs at a fixed clock time, regardless of upstream state
Slow upstream produces stale downstream silently
Failures do not stop later jobs
Re-running step N requires re-running everything after it by hand
✓Orchestrator (Dependency-Based)
Each task runs after its declared dependencies succeed
Slow upstream delays downstream rather than corrupting it
A failed task halts dependent tasks automatically
The orchestrator can re-run a failed task and resume the rest
The Mental Shift
An orchestrator changes the question from 'what time does this run' to 'what does this depend on.' The schedule still exists, but it sits at the top of the dependency graph: the daily DAG starts at 1am, and every task inside it runs after its dependencies are satisfied. The clock triggers the start of the graph; the graph triggers the order of the tasks. Cron handles only the first half. An orchestrator handles both.
The first cron chain failure is always the same: slow upstream produces stale downstream silently.
An orchestrator runs tasks when their dependencies finish, not when the clock says so.
Widening cron windows is a workaround that spends slack to hide a missing dependency model.
TIP
Any pipeline with more than two scheduled jobs that depend on each other has outgrown cron. The next failure is not a question of if; it is a question of when the upstream job runs slow.
The DAG: Tasks, Edges, No Cycles
Daily Life
Interviews
Identify nodes, edges, and the acyclic property in a DAG, and read a small DAG declaration that encodes a real pipeline.
Every modern orchestrator models a pipeline as a directed acyclic graph, abbreviated DAG. The structure is a small mathematical object with three properties. It has nodes (the tasks). It has edges (the dependencies). The edges point in one direction, and they cannot form a loop. Those properties are not stylistic preferences. They are the conditions that make the graph computable: a structure with cycles cannot be scheduled at all, and a structure without direction cannot be ordered.
Vocabulary, Once and Precisely
Term
Meaning
Concrete Example
Task (node)
A single unit of work the orchestrator schedules
Run a SQL query, run a Python script, copy a file
Dependency (edge)
A rule that says one task waits for another
join_orders runs after extract_orders finishes
Directed
Edges point from upstream to downstream
Data and dependency both flow one way
Acyclic
No path leads back to its starting node
extract -> clean -> join, never join -> extract
DAG
The whole structure: tasks plus their dependencies
The pipeline a daily ETL declares to the orchestrator
Why Direction Matters
A pipeline that runs tasks in any order produces undefined results. The clean step needs the raw rows; running the clean step before the extract finishes means cleaning yesterday's rows or no rows at all. The direction on each edge encodes the temporal order. The orchestrator reads the directed graph, computes a valid ordering, and starts each task only when every task it points away from has already succeeded.
Why Cycles Are Forbidden
If task A depends on task B, and task B depends on task A, neither can start. Each is waiting for the other. The orchestrator has no defined behavior in that case because there is none. The acyclic constraint is what guarantees the orchestrator can compute a starting set (tasks with no upstream dependencies) and proceed from there. A graph with a cycle has no such starting set. The validation that catches cycles before deploy is not a nicety; it is the only way the runtime can be sure work will progress.
How cycles sneak into a DAG accidentally:
▸A new task is added that reads from a table another task in the same DAG overwrites
▸Two engineers add edges in the same week without seeing each other's changes
▸A backfill task is added at the end of the DAG and depends on the start
▸A circular reference in the data model leaks into the dependency graph
A First DAG, in Words
Consider a daily pipeline with three tasks. The first extracts orders from Postgres into a raw S3 zone. The second cleans the raw rows and writes a curated table in Snowflake. The third aggregates the curated table into a daily summary. Two edges describe the dependencies. Extract points to clean. Clean points to aggregate. The graph has three nodes, two edges, no cycles. The orchestrator can run it.
1
extract_orders|||(source)(TRANSFORM)(publish)
The Same DAG, in Code
1
# Airflow-style declaration: the >> operator declares a dependency edge
Three Python objects, two arrows, one schedule. The orchestrator reads this file, validates the graph, and runs the tasks in order every day starting at midnight. If the extract fails, clean does not run. If clean fails, aggregate does not run. If a task fails halfway, the orchestrator can retry it without rerunning the tasks that already succeeded. The single line `extract >> clean >> aggregate` is the entire dependency model, and the orchestrator handles the rest.
NodeEdgeAcyclic
Node
A unit of scheduled work
One task. A SQL query, a Python function, a shell command. The smallest piece the orchestrator can run, retry, or skip independently.
Edge
A declared dependency
An arrow from upstream to downstream. The downstream task waits for the upstream task to succeed before it runs.
Acyclic
No paths loop back
The constraint that makes scheduling possible. A starting set of tasks always exists, and progress is guaranteed.
1
# Detect a cycle in a small DAG before the orchestrator tries to schedule it
2
# A valid DAG returns a topological order; an invalid DAG raises
3
4
deftopological_order(graph):
5
visited=set()
6
on_stack=set()
7
order=[]
8
9
defvisit(node):
10
ifnodeinon_stack:
11
raiseValueError(f'cycle detected at node: {node}')
12
ifnodeinvisited:
13
return
14
on_stack.add(node)
15
forchildingraph.get(node,[]):
16
visit(child)
17
on_stack.discard(node)
18
visited.add(node)
19
order.append(node)
20
21
fornodeingraph:
22
visit(node)
23
returnlist(reversed(order))
24
25
valid_dag={
26
'extract':['clean'],
27
'clean':['aggregate'],
28
'aggregate':[],
29
}
30
print('valid DAG order:',topological_order(valid_dag))
Declare every dependency explicitly in the DAG; never rely on clock-based ordering inside one DAG
Validate the DAG at deploy time so cycles are caught before they reach production
Keep tasks small enough that a retry is cheap
✗Don't
Use one giant task that does extract, clean, and aggregate together
Add an edge that points back into a task earlier in the DAG
Treat the schedule as the dependency mechanism within a single DAG
What an Orchestrator Does
Daily Life
Interviews
Distinguish the four responsibilities an orchestrator owns from the work the orchestrator delegates to other systems.
An orchestrator is the system that owns four responsibilities: deciding when work runs, running it in the right order, retrying it when it fails, and showing what happened. The four are not separate features bolted together. They reinforce each other. A retry is meaningful only if dependencies are tracked. A schedule is operable only if a UI exists to inspect it. Visibility is useful only if failures are recorded as events the system can react to. Every orchestrator that ships sells the same four properties under different brands.
Retries only produce the same answer as a single run when the work is idempotent: running it twice gives the same result as running it once. That property is the subject of Lesson 5 (idempotency and backfill).
Responsibility 1: Scheduling
The orchestrator owns when a DAG starts. The schedule is usually a cron expression, an interval, or an external trigger. When the schedule fires, the orchestrator creates a run instance and begins traversing the DAG. The schedule applies to the whole DAG; the order of tasks inside the DAG is governed by the dependency graph, not the clock.
Responsibility 2: Dependency Resolution
The orchestrator looks at the DAG and computes which tasks have no unsatisfied dependencies. Those tasks become eligible to run. As tasks finish, more tasks become eligible. The traversal is the topological sort discussed in the previous section, executed at runtime. This is the responsibility cron does not have. It is also the responsibility that, once present, makes most other features possible.
Responsibility 3: Retries
Tasks fail. Networks blip, sources go down for thirty seconds, a query times out under unusual load. A naive system fails the whole pipeline on the first error. An orchestrator distinguishes transient failures (worth retrying) from terminal failures (not worth retrying) and applies a configured retry policy. Common configurations include a maximum number of retries, a delay between retries, and an exponential backoff that doubles the delay each time. The retry policy is set per task, because the right answer is not the same for an HTTP fetch and a SQL transform.
1
# Retry policy declared at the task level
2
fetch_stripe=PythonOperator(
3
task_id='fetch_stripe_payments',
4
python_callable=fetch_payments,
5
retries=5,
6
retry_delay=timedelta(seconds=30),
7
retry_exponential_backoff=True,
8
max_retry_delay=timedelta(minutes=10),
9
)
10
# 5 attempts. First retry after 30s, then 60s, 120s, 240s, capped at 600s.
Responsibility 4: Visibility
An orchestrator without a UI is a black box that runs jobs and produces log files. Modern orchestrators ship with a web UI that shows every DAG, every run, every task instance, and the status of each. Operators can inspect why a task failed, view its logs, re-run it, mark it as successful manually, or pause an entire DAG. The UI is the on-call surface. When something is wrong, the engineer who is paged opens the UI first.
Visibility Surface
What It Shows
Why It Matters
DAG list
Every pipeline registered with the orchestrator and its current state
On-call sees at a glance which pipelines are healthy
Run history
Every prior execution of a DAG with timestamps and status
Trends are visible: a job that gets slower week over week
Task instance log
Stdout and stderr of a single task on a single run
The first place a debugger goes when a task fails
Graph view
The DAG drawn with nodes colored by state
The shape of the failure is visible: which branch broke
What an orchestrator owns:
▸Scheduling: when a DAG starts
▸Dependency resolution: in what order tasks within the DAG run
▸Retries: what happens when a task fails transiently
▸Visibility: how operators see what ran, what failed, and why
1
# Simulate the retry responsibility of an orchestrator
2
# A task fails twice with transient errors, then succeeds on the third try
print('Note: the orchestrator owned the retry decision, not the task code.')
28
What the Orchestrator Does Not Own
An orchestrator is not a transformation engine. It does not know how to clean a customer record or aggregate a fact table. It calls out to other systems (a Snowflake warehouse, a Spark cluster, a Python container) that do the actual work, and tracks whether those calls succeeded. The line between orchestrator and worker is sharp and intentional. The orchestrator stays small and reliable; the heavy compute lives elsewhere. Confusing the two leads to orchestrators that try to do everything and fail at the one thing they were chosen for.
✓Orchestrator Owns
When a DAG starts
What order tasks run in
Retry policy and failure routing
The UI that shows run state
•Worker Systems Own
The actual transform: SQL, Spark, Python
Reading from sources and writing to destinations
Heavy compute and memory
The data shape itself
An orchestrator without a UI is a black box. The UI is not a nice-to-have; it is the on-call surface that turns failures into something a human can act on.
Four responsibilities define an orchestrator: schedule, resolve, retry, and show.
Retries belong to the orchestrator because they require knowledge of the dependency graph.
The orchestrator delegates compute; it does not perform transforms itself.
The Major Orchestrators by Name
Daily Life
Interviews
Name the three major orchestrators, describe what each emphasizes, and explain the shared model that makes them interchangeable in concept.
Three orchestrators dominate modern data engineering: Airflow, Dagster, and Prefect. Each ships the four responsibilities described in the previous section, but they make different choices in the API and the philosophy. Knowing the names matters because production environments have already chosen one (or, more often, are slowly migrating from one to another). Knowing what they have in common matters more, because the choice of tool changes which buttons are pressed, not what the buttons do.
Apache Airflow
Airflow is the oldest and most widely deployed of the three. Maxime Beauchemin started it at Airbnb in 2014, and it became an Apache project in 2016. Pipelines are declared as Python files; tasks are operators (PythonOperator, BashOperator, SQLOperator) connected with the >> operator. The model is task-centric: tasks are the unit of scheduling, and dependencies are between tasks. Strengths include enormous community, broad operator coverage, and stable production track record. Trade-offs include a steeper learning curve, an older imperative model, and a tendency for DAGs to grow into procedural Python code that drifts from declarative dependency definition.
Dagster
Dagster, started by Nick Schrock at Elementl in 2018, takes an asset-first view. The unit of declaration is the data asset (a table, a file, a feature) and the orchestrator computes which assets need to be refreshed and how. Tasks still exist underneath, but the API foregrounds the data, not the work. Strengths include a typed pipeline model, software-defined assets, strong local testing story, and an asset graph view that mirrors the data lineage. Trade-offs include a smaller community than Airflow, more conceptual overhead for engineers used to the task-first model, and fewer pre-built integrations.
Prefect
Prefect, started by Jeremiah Lowin in 2018, was built as a reaction to Airflow's quirks. Pipelines are flows, tasks are decorated Python functions, and the orchestration model is dynamic: the flow can decide at runtime which tasks to run. Strengths include a Pythonic API, a clean dynamic execution model, and a hybrid execution architecture where the orchestrator runs in the cloud and the workers run in the company's own infrastructure. Trade-offs include a smaller deployment footprint than Airflow, faster API churn between major versions, and less mature support for the long-tail of niche source systems.
Orchestrator
Origin
Model
Best Fit
Airflow
Airbnb, 2014
Task-centric, imperative DAG
Large existing deployments, broad integration needs, stable production
Dagster
Elementl, 2018
Asset-first, typed, software-defined
New builds emphasizing data lineage and testability
Prefect
Prefect Technologies, 2018
Flow-centric, Pythonic, dynamic
Teams that want a hybrid cloud-orchestration model and dynamic graphs
What They Have in Common
All three model pipelines as DAGs. All three take a schedule and produce runs. All three offer retries, dependency resolution, and a UI. All three integrate with Snowflake, BigQuery, S3, dbt, Spark, Kubernetes, and the rest of the modern data stack. The shared shape matters more than the API differences. An engineer who has internalized one orchestrator can be productive in another within days, because the four responsibilities behave the same way in all three. The brand is a tool choice; the model is the same.
AirflowDagsterPrefect
Airflow
Task-centric, oldest, biggest community
Python-coded DAGs with task operators connected by >>. Default in many enterprises. Mature, broad integration, sometimes verbose.
Dagster
Asset-first, typed, software-defined
Pipelines are graphs of data assets. Strong local testing and lineage view. Lower task-level boilerplate, sharper conceptual model.
Prefect
Flow-centric, dynamic, Pythonic
Decorated Python functions become tasks. Hybrid cloud-control plane with self-hosted workers. Dynamic execution and modern API.
Other names worth knowing in passing:
▸Argo Workflows: Kubernetes-native orchestrator, used heavily in ML and CI/CD
▸Luigi: Spotify's predecessor to Airflow, still in legacy deployments
▸Mage: newer, lower-code orchestrator aimed at smaller teams
▸Temporal: a workflow engine often used for application orchestration rather than data pipelines
▸Cloud-native: AWS Step Functions, Google Cloud Composer (managed Airflow), Azure Data Factory
How to Choose Between Them
For most teams, the choice is decided by what already runs in production. Migrating from one orchestrator to another costs months of engineer time and rarely earns the cost back. New builds at companies without an existing orchestrator usually pick Dagster or Prefect for the asset-aware, modern API; companies with deep Airflow expertise extend Airflow because retraining a team is expensive. The wrong question is 'which orchestrator is best.' The right question is 'which orchestrator fits this organization, this data stack, and the engineers who will operate it for the next three years.'
•Pick Airflow When
The team already runs Airflow at scale
A specific operator is needed (rare third-party source)
Stability and community size outweigh API freshness
Cloud Composer or MWAA is already in the stack
✓Pick Dagster or Prefect When
A new build with no existing orchestrator
Asset lineage and software-defined data assets matter (Dagster)
A hybrid cloud-control plane is preferred (Prefect)
Local testing and typed pipelines are priorities
TIP
Spend the first week with the orchestrator the team already uses. Read the UI, run a backfill, fail a task on purpose, watch the retry happen. The four responsibilities are the same everywhere; the muscle memory transfers.
First DAG: 3 Tasks, 1 Schedule
Daily Life
Interviews
Build a three-task DAG with one schedule and one retry policy, and explain the order of execution from the dependencies.
Vocabulary becomes useful when applied. The example below builds a tiny but complete DAG end to end. A retail company wants a daily summary of orders by region. Three tasks chain together: extract orders from Postgres, clean and standardize the rows, aggregate to one row per region per day. The DAG runs once a day at 2am Pacific. Every concept from the previous sections shows up in working code.
Step 1: Name the Tasks
Task ID
What It Does
Where It Reads From
Where It Writes To
extract_orders
Pulls new orders from Postgres since the last run
production.orders (Postgres)
raw.orders (Snowflake)
clean_orders
Standardizes country codes, drops test accounts
raw.orders
stg.orders
aggregate_orders
Counts orders by region for the run date
stg.orders
mart.orders_by_region
Step 2: Declare the Dependencies
The dependency graph is a chain. Clean reads what extract produces, so clean depends on extract. Aggregate reads what clean produces, so aggregate depends on clean. Two edges, three nodes, no cycles. The DAG is the smallest non-trivial example: a straight line.
1
extract_orders2:00am2:14am2:21am(typicalrun)
Step 3: Write the Airflow Code
1
fromairflowimportDAG
2
fromairflow.operators.pythonimportPythonOperator
3
fromdatetimeimportdatetime,timedelta
4
5
default_args={
6
'owner':'data-platform',
7
'retries':3,
8
'retry_delay':timedelta(minutes=2),
9
'retry_exponential_backoff':True,
10
}
11
12
withDAG(
13
dag_id='daily_orders_by_region',
14
start_date=datetime(2026,4,1),
15
schedule='0 2 * * *',# 2am every day
16
catchup=False,
17
default_args=default_args,
18
)asdag:
19
20
extract=PythonOperator(
21
task_id='extract_orders',
22
python_callable=extract_orders_since_last_run,
23
)
24
25
clean=PythonOperator(
26
task_id='clean_orders',
27
python_callable=clean_raw_orders,
28
)
29
30
aggregate=PythonOperator(
31
task_id='aggregate_orders',
32
python_callable=aggregate_to_region,
33
)
34
35
extract>>clean>>aggregate
Three operators, one chain, one schedule. The default_args block applies retry policy uniformly. The chain on the last line is the entire dependency model. When 2am Pacific arrives, Airflow creates a run, the scheduler looks at the DAG, and only extract is ready (it has no dependencies). When extract finishes, clean becomes ready. When clean finishes, aggregate becomes ready. When aggregate finishes, the run is complete.
Step 4: Run It and Watch the UI
1
# A simulation of a tiny orchestrator running the three-task DAG
2
# Read the code, then run it to see the order tasks execute
raiseException('No ready tasks: cycle or missing dependency')
17
task=ready[0]
18
print(f't={clock:>3}m start {task}')
19
clock+=tasks[task]['duration']
20
print(f't={clock:>3}m done {task}')
21
tasks[task]['status']='success'
22
done.add(task)
23
24
print(f'\nAll tasks complete at t={clock}m')
25
The simulation above is a skeleton of what every orchestrator does. It tracks which tasks are ready, runs them in dependency order, and progresses. Real orchestrators add scheduling, retries, parallel execution across multiple workers, persistence, a UI, and dozens of other concerns. The core loop is the same.
Step 5: Handle a Failure
When the clean task fails (perhaps the Snowflake warehouse hit a query timeout), Airflow checks the retry policy. The default_args set retries to 3 with exponential backoff. The orchestrator waits two minutes, retries the clean task, and if it succeeds, the run continues to aggregate. If all three retries fail, clean is marked failed, aggregate stays in 'upstream_failed' state, and the run ends in a failed state. An alert fires. On-call opens the UI, sees that clean failed three times, reads the log, fixes the underlying issue, and re-runs only the failed task. The aggregate task picks up automatically because its only dependency is now satisfied.
What this tiny DAG demonstrates:
▸Three tasks, two edges, no cycles
▸A schedule (2am daily) that triggers the start of the DAG
▸A retry policy applied uniformly to every task
▸Dependencies enforced by the orchestrator, not by clock time
▸Failure isolation: one failed task halts dependents, not unrelated work
A first DAG can fit in twenty lines and still demonstrate every core orchestration concept.
The chain operator >> turns Python objects into a dependency graph the orchestrator can schedule.
Failure isolation is the property that makes a DAG more reliable than a script: dependents wait, unrelated work proceeds.
TIP
Build the smallest DAG first, see it run end to end, fail one task on purpose, and watch the retry happen. Every later complexity is an extension of the same loop.
❯❯❯PUTTING IT ALL TOGETHER
> A startup data team has six cron jobs that run nightly: pull from Postgres, pull from Stripe, clean orders, clean payments, join the two, publish a fact table. The chain has been working for a year. Last week the Postgres pull ran two hours long because of a backfill, and the dashboard showed yesterday's numbers because the downstream jobs ran on stale data. The team asks: 'What is the smallest set of changes that would have prevented this?'
Replace the time-staggered cron schedule with a single DAG. Six tasks, edges that encode the actual data dependencies. The clean tasks wait for their respective extracts; the join waits for both cleans; the publish waits for the join.
Pick one orchestrator and stick with it: Airflow if Cloud Composer or MWAA is already in the stack, Dagster or Prefect for a fresh build. The choice matters less than the consistency.
Set a retry policy on every task: three retries, exponential backoff, alert on final failure. Most transient blips become invisible to operators; real problems still surface.
Use the four pipeline roles from Lesson 1 to label the DAG: extracts are sources, joins and aggregates are transforms, the fact table is curated storage, and the dashboard is the consumer. The shape that the orchestrator schedules is the same shape the architecture diagram shows.
KEY TAKEAWAYS
Cron is a schedule, not an orchestrator: the first cron failure is always the same, slow upstream produces stale downstream silently.
A DAG has nodes, edges, direction, and no cycles: those four properties are what makes scheduling computable in finite time with a defined result.
An orchestrator owns four responsibilities: scheduling, dependency resolution, retries, and visibility. The compute itself is delegated to other systems.
Three orchestrators dominate: Airflow (task-centric, oldest), Dagster (asset-first, typed), Prefect (flow-centric, dynamic). The model is the same; the API differs.
The smallest useful DAG is three tasks chained: extract, transform, publish. One schedule. One retry policy. Every later complexity is an extension of this loop.
Orchestration and Dependencies: Beginner
What runs, when, in what order, and what happens when something fails
Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges
Topics covered: Why Cron Is Not an Orchestrator, The DAG: Tasks, Edges, No Cycles, What an Orchestrator Does, The Major Orchestrators by Name, First DAG: 3 Tasks, 1 Schedule
The first scheduled job most engineers ever write is a cron job. Cron is a Unix utility that runs a command at a fixed time. It is small, reliable, and has been part of every Unix system since 1975. For a single command that runs once a day, cron is the right tool. The trouble starts when several commands need to run in a particular order, and especially when the order has to hold even if one of them runs late. Cron does not know about order. Cron knows about clock time. What Cron Does and Does
Every modern orchestrator models a pipeline as a directed acyclic graph, abbreviated DAG. The structure is a small mathematical object with three properties. It has nodes (the tasks). It has edges (the dependencies). The edges point in one direction, and they cannot form a loop. Those properties are not stylistic preferences. They are the conditions that make the graph computable: a structure with cycles cannot be scheduled at all, and a structure without direction cannot be ordered. Vocabulary,
An orchestrator is the system that owns four responsibilities: deciding when work runs, running it in the right order, retrying it when it fails, and showing what happened. The four are not separate features bolted together. They reinforce each other. A retry is meaningful only if dependencies are tracked. A schedule is operable only if a UI exists to inspect it. Visibility is useful only if failures are recorded as events the system can react to. Every orchestrator that ships sells the same fou
Three orchestrators dominate modern data engineering: Airflow, Dagster, and Prefect. Each ships the four responsibilities described in the previous section, but they make different choices in the API and the philosophy. Knowing the names matters because production environments have already chosen one (or, more often, are slowly migrating from one to another). Knowing what they have in common matters more, because the choice of tool changes which buttons are pressed, not what the buttons do. Apac
Vocabulary becomes useful when applied. The example below builds a tiny but complete DAG end to end. A retail company wants a daily summary of orders by region. Three tasks chain together: extract orders from Postgres, clean and standardize the rows, aggregate to one row per region per day. The DAG runs once a day at 2am Pacific. Every concept from the previous sections shows up in working code. Step 1: Name the Tasks Step 2: Declare the Dependencies The dependency graph is a chain. Clean reads