Orchestration and Dependencies: Beginner

An e-commerce company at Series C scale ran a nightly ETL by chaining seven cron jobs at staggered times: extract at 1am, clean at 2am, join at 3am, aggregate at 4am, publish at 5am, and so on. The whole stack stayed glued together for eighteen months because the jobs always finished inside their windows. One Tuesday in November, Black Friday traffic doubled the volume of the extract. The 1am job ran until 2:47am. The 2am clean job started on schedule, found no extract output, ran on yesterday's data, and emitted a clean run with stale numbers. The 3am join silently joined fresh tables to stale ones. By 9am the executive dashboard was wrong, and nothing in the system noticed. The fix was not a bigger cron schedule or a longer window; it was an orchestrator that knows about dependencies between tasks. This lesson is about that mental shift: from running things on a clock to running things in an order that the system understands.

What you will be able to do

Recognize why scheduling alone is not orchestration and where naive cron chains break

Identify a directed acyclic graph of tasks as the standard model for orchestrated work

Distinguish what an orchestrator owns: scheduling, retries, dependency resolution, and visibility

Why Cron Is Not an Orchestrator

Daily Life

Interviews

Recognize the canonical failure mode of cron chains and name what an orchestrator adds beyond a schedule.

The first scheduled job most engineers ever write is a cron job. Cron is a Unix utility that runs a command at a fixed time. It is small, reliable, and has been part of every Unix system since 1975. For a single command that runs once a day, cron is the right tool. The trouble starts when several commands need to run in a particular order, and especially when the order has to hold even if one of them runs late. Cron does not know about order. Cron knows about clock time.

What Cron Does and Does Not Do

Capability	Cron	Orchestrator
Run a command at a fixed time	Yes	Yes
Wait for an upstream job to finish before starting	No	Yes
Retry a failing command with backoff	No	Yes
Show whether last night's run succeeded	No (just a log file)	Yes (a UI)
Re-run a failed task without re-running the whole pipeline	No	Yes
Backfill historical date ranges (rerun the pipeline for past days, e.g., the last week of October)	No	Yes

The Failure That Always Comes First

Engineers who chain cron jobs by clock time eventually hit the same bug. Job A is scheduled at 1am and is expected to finish in thirty minutes. Job B is scheduled at 2am because it reads what A produces. One night A runs slow because the source had more data than usual. A finishes at 2:47am. B started at 2:00am, found no new output, and either failed or, worse, ran successfully on stale data. The whole chain produced numbers that looked complete and were wrong. This is the canonical first cron failure, and every team that runs more than two scheduled jobs hits it within a year.

	# A cron chain that works most days AND fails badly
	ON the bad ones 0 1 * * * / usr / local / bin / extract_orders.sh > > / var / log / extract.log 2 > & 1 0 2 * * * / usr / local / bin / clean_orders.sh > > / var / log / clean.log 2 > & 1 0 3 * * * / usr / local / bin / join_orders.sh > > / var / log /
	JOIN.log 2 > & 1 0 4 * * * / usr / local / bin / aggregate_orders.sh > > / var / log / agg.log 2 > & 1 0 5 * * * / usr / local / bin / publish_orders.sh > > / var / log / pub.log 2 > & 1 # The 2 am job assumes the 1 am job finished.Cron has no way to enforce that.

Hidden costs of a cron chain that nobody notices on day one:

▸Slow upstream jobs cause silent stale reads downstream
▸A failed job in the middle does not block the rest of the chain
▸There is no UI; status lives in log files spread across servers
▸Re-running step 3 after a fix means manually re-running steps 4 and 5
▸Backfilling last Tuesday means writing a one-off script that mimics the chain

Why Adding Sleeps Does Not Fix the Problem

The first patch every team tries is to widen the windows. If the extract sometimes takes ninety minutes, schedule the clean job at 3am instead of 2am. Then the clean job sometimes takes longer, so the join job moves to 5am. After a few months of widening, the pipeline starts at 1am and finishes at 9am, which is the moment the dashboard is supposed to be ready. The widening solves no actual problem; it spends slack to mask a missing dependency model. Real orchestration encodes the dependency directly: the clean job runs when the extract job finishes successfully, not at a fixed clock time.

•Cron Chain (Time-Based)

Each job runs at a fixed clock time, regardless of upstream state
Slow upstream produces stale downstream silently
Failures do not stop later jobs
Re-running step N requires re-running everything after it by hand

✓Orchestrator (Dependency-Based)

Each task runs after its declared dependencies succeed
Slow upstream delays downstream rather than corrupting it
A failed task halts dependent tasks automatically
The orchestrator can re-run a failed task and resume the rest

The Mental Shift

An orchestrator changes the question from 'what time does this run' to 'what does this depend on.' The schedule still exists, but it sits at the top of the dependency graph: the daily DAG starts at 1am, and every task inside it runs after its dependencies are satisfied. The clock triggers the start of the graph; the graph triggers the order of the tasks. Cron handles only the first half. An orchestrator handles both.

The first cron chain failure is always the same: slow upstream produces stale downstream silently.

An orchestrator runs tasks when their dependencies finish, not when the clock says so.

Widening cron windows is a workaround that spends slack to hide a missing dependency model.

TIP

Any pipeline with more than two scheduled jobs that depend on each other has outgrown cron. The next failure is not a question of if; it is a question of when the upstream job runs slow.

The DAG: Tasks, Edges, No Cycles

Daily Life

Interviews

Identify nodes, edges, and the acyclic property in a DAG, and read a small DAG declaration that encodes a real pipeline.

Every modern orchestrator models a pipeline as a directed acyclic graph, abbreviated DAG. The structure is a small mathematical object with three properties. It has nodes (the tasks). It has edges (the dependencies). The edges point in one direction, and they cannot form a loop. Those properties are not stylistic preferences. They are the conditions that make the graph computable: a structure with cycles cannot be scheduled at all, and a structure without direction cannot be ordered.

Vocabulary, Once and Precisely

Term	Meaning	Concrete Example
Task (node)	A single unit of work the orchestrator schedules	Run a SQL query, run a Python script, copy a file
Dependency (edge)	A rule that says one task waits for another	join_orders runs after extract_orders finishes
Directed	Edges point from upstream to downstream	Data and dependency both flow one way
Acyclic	No path leads back to its starting node	extract -> clean -> join, never join -> extract
DAG	The whole structure: tasks plus their dependencies	The pipeline a daily ETL declares to the orchestrator

Why Direction Matters

A pipeline that runs tasks in any order produces undefined results. The clean step needs the raw rows; running the clean step before the extract finishes means cleaning yesterday's rows or no rows at all. The direction on each edge encodes the temporal order. The orchestrator reads the directed graph, computes a valid ordering, and starts each task only when every task it points away from has already succeeded.

Why Cycles Are Forbidden

If task A depends on task B, and task B depends on task A, neither can start. Each is waiting for the other. The orchestrator has no defined behavior in that case because there is none. The acyclic constraint is what guarantees the orchestrator can compute a starting set (tasks with no upstream dependencies) and proceed from there. A graph with a cycle has no such starting set. The validation that catches cycles before deploy is not a nicety; it is the only way the runtime can be sure work will progress.

How cycles sneak into a DAG accidentally:

▸A new task is added that reads from a table another task in the same DAG overwrites
▸Two engineers add edges in the same week without seeing each other's changes
▸A backfill task is added at the end of the DAG and depends on the start
▸A circular reference in the data model leaks into the dependency graph

A First DAG, in Words

Consider a daily pipeline with three tasks. The first extracts orders from Postgres into a raw S3 zone. The second cleans the raw rows and writes a curated table in Snowflake. The third aggregates the curated table into a daily summary. Two edges describe the dependencies. Extract points to clean. Clean points to aggregate. The graph has three nodes, two edges, no cycles. The orchestrator can run it.

extract_orders | | |(source)(TRANSFORM)(publish)

The Same DAG, in Code

	# Airflow-style declaration: the >> operator declares a dependency edge
	from airflow import DAG
	from airflow.operators.python import PythonOperator
	from datetime import datetime

	with DAG('daily_orders', start_date=datetime(2026, 4, 1), schedule='@daily') as dag:
	extract = PythonOperator(task_id='extract_orders', python_callable=run_extract)
	clean = PythonOperator(task_id='clean_orders', python_callable=run_clean)
	aggregate = PythonOperator(task_id='aggregate_orders', python_callable=run_aggregate)

	extract >> clean >> aggregate

Three Python objects, two arrows, one schedule. The orchestrator reads this file, validates the graph, and runs the tasks in order every day starting at midnight. If the extract fails, clean does not run. If clean fails, aggregate does not run. If a task fails halfway, the orchestrator can retry it without rerunning the tasks that already succeeded. The single line `extract >> clean >> aggregate` is the entire dependency model, and the orchestrator handles the rest.

NodeEdgeAcyclic

Node

A unit of scheduled work

One task. A SQL query, a Python function, a shell command. The smallest piece the orchestrator can run, retry, or skip independently.

Edge

A declared dependency

An arrow from upstream to downstream. The downstream task waits for the upstream task to succeed before it runs.

Acyclic

No paths loop back

The constraint that makes scheduling possible. A starting set of tasks always exists, and progress is guaranteed.

	# Detect a cycle in a small DAG before the orchestrator tries to schedule it
	# A valid DAG returns a topological order; an invalid DAG raises

	def topological_order(graph):
	visited = set()
	on_stack = set()
	order = []

	def visit(node):
	if node in on_stack:
	raise ValueError(f'cycle detected at node: {node}')
	if node in visited:
	return
	on_stack.add(node)
	for child in graph.get(node, []):
	visit(child)
	on_stack.discard(node)
	visited.add(node)
	order.append(node)

	for node in graph:
	visit(node)
	return list(reversed(order))

	valid_dag = {
	'extract': ['clean'],
	'clean': ['aggregate'],
	'aggregate': [],
	}
	print('valid DAG order:', topological_order(valid_dag))

	cyclic_dag = {
	'extract': ['clean'],
	'clean': ['aggregate'],
	'aggregate': ['extract'], # cycle: aggregate -> extract -> clean -> aggregate
	}
	try:
	topological_order(cyclic_dag)
	except ValueError as e:
	print(f'cyclic DAG rejected: {e}')

✓Do

Declare every dependency explicitly in the DAG; never rely on clock-based ordering inside one DAG
Validate the DAG at deploy time so cycles are caught before they reach production
Keep tasks small enough that a retry is cheap

✗Don't

Use one giant task that does extract, clean, and aggregate together
Add an edge that points back into a task earlier in the DAG
Treat the schedule as the dependency mechanism within a single DAG

extract

transform

load

Storage

warehouse

An orchestration DAG: tasks are nodes, dependencies are edges, and there are no cycles. The orchestrator runs each task only after its upstream finishes, retries failures, and backfills - things cron cannot do.

What an Orchestrator Does

Daily Life

Interviews

Distinguish the four responsibilities an orchestrator owns from the work the orchestrator delegates to other systems.

An orchestrator is the system that owns four responsibilities: deciding when work runs, running it in the right order, retrying it when it fails, and showing what happened. The four are not separate features bolted together. They reinforce each other. A retry is meaningful only if dependencies are tracked. A schedule is operable only if a UI exists to inspect it. Visibility is useful only if failures are recorded as events the system can react to. Every orchestrator that ships sells the same four properties under different brands.

Retries only produce the same answer as a single run when the work is idempotent: running it twice gives the same result as running it once. That property is the subject of Lesson 5 (idempotency and backfill).

Responsibility 1: Scheduling

The orchestrator owns when a DAG starts. The schedule is usually a cron expression, an interval, or an external trigger. When the schedule fires, the orchestrator creates a run instance and begins traversing the DAG. The schedule applies to the whole DAG; the order of tasks inside the DAG is governed by the dependency graph, not the clock.

Responsibility 2: Dependency Resolution

The orchestrator looks at the DAG and computes which tasks have no unsatisfied dependencies. Those tasks become eligible to run. As tasks finish, more tasks become eligible. The traversal is the topological sort discussed in the previous section, executed at runtime. This is the responsibility cron does not have. It is also the responsibility that, once present, makes most other features possible.

Responsibility 3: Retries

Tasks fail. Networks blip, sources go down for thirty seconds, a query times out under unusual load. A naive system fails the whole pipeline on the first error. An orchestrator distinguishes transient failures (worth retrying) from terminal failures (not worth retrying) and applies a configured retry policy. Common configurations include a maximum number of retries, a delay between retries, and an exponential backoff that doubles the delay each time. The retry policy is set per task, because the right answer is not the same for an HTTP fetch and a SQL transform.

	# Retry policy declared at the task level
	fetch_stripe = PythonOperator(
	task_id='fetch_stripe_payments',
	python_callable=fetch_payments,
	retries=5,
	retry_delay=timedelta(seconds=30),
	retry_exponential_backoff=True,
	max_retry_delay=timedelta(minutes=10),
	)
	# 5 attempts. First retry after 30s, then 60s, 120s, 240s, capped at 600s.

Responsibility 4: Visibility

An orchestrator without a UI is a black box that runs jobs and produces log files. Modern orchestrators ship with a web UI that shows every DAG, every run, every task instance, and the status of each. Operators can inspect why a task failed, view its logs, re-run it, mark it as successful manually, or pause an entire DAG. The UI is the on-call surface. When something is wrong, the engineer who is paged opens the UI first.

Visibility Surface	What It Shows	Why It Matters
DAG list	Every pipeline registered with the orchestrator and its current state	On-call sees at a glance which pipelines are healthy
Run history	Every prior execution of a DAG with timestamps and status	Trends are visible: a job that gets slower week over week
Task instance log	Stdout and stderr of a single task on a single run	The first place a debugger goes when a task fails
Graph view	The DAG drawn with nodes colored by state	The shape of the failure is visible: which branch broke

What an orchestrator owns:

▸Scheduling: when a DAG starts
▸Dependency resolution: in what order tasks within the DAG run
▸Retries: what happens when a task fails transiently
▸Visibility: how operators see what ran, what failed, and why

	# Simulate the retry responsibility of an orchestrator
	# A task fails twice with transient errors, then succeeds on the third try

	import random
	random.seed(7)

	def run_task_with_retry(task_name, max_retries=3, delay_seconds=2):
	attempt = 0
	while attempt <= max_retries:
	attempt += 1
	# Simulate transient failure on first two attempts
	will_fail = attempt < 3
	status = 'failed' if will_fail else 'success'
	print(f' attempt {attempt}: status={status}')
	if status == 'success':
	return 'success'
	if attempt > max_retries:
	return 'failed_terminal'
	# exponential backoff doubling each retry
	wait = delay_seconds * (2 ** (attempt - 1))
	print(f' retry in {wait}s (exponential backoff)')
	return 'failed_terminal'

	print(f'Running task fetch_stripe_payments:')
	result = run_task_with_retry('fetch_stripe_payments')
	print(f'\nFinal status: {result}')
	print('Note: the orchestrator owned the retry decision, not the task code.')

What the Orchestrator Does Not Own

An orchestrator is not a transformation engine. It does not know how to clean a customer record or aggregate a fact table. It calls out to other systems (a Snowflake warehouse, a Spark cluster, a Python container) that do the actual work, and tracks whether those calls succeeded. The line between orchestrator and worker is sharp and intentional. The orchestrator stays small and reliable; the heavy compute lives elsewhere. Confusing the two leads to orchestrators that try to do everything and fail at the one thing they were chosen for.

✓Orchestrator Owns

When a DAG starts
What order tasks run in
Retry policy and failure routing
The UI that shows run state

•Worker Systems Own

The actual transform: SQL, Spark, Python
Reading from sources and writing to destinations
Heavy compute and memory
The data shape itself

An orchestrator without a UI is a black box. The UI is not a nice-to-have; it is the on-call surface that turns failures into something a human can act on.

Four responsibilities define an orchestrator: schedule, resolve, retry, and show.

Retries belong to the orchestrator because they require knowledge of the dependency graph.

The orchestrator delegates compute; it does not perform transforms itself.

The Major Orchestrators by Name

Daily Life

Interviews

Name the three major orchestrators, describe what each emphasizes, and explain the shared model that makes them interchangeable in concept.

Three orchestrators dominate modern data engineering: Airflow, Dagster, and Prefect. Each ships the four responsibilities described in the previous section, but they make different choices in the API and the philosophy. Knowing the names matters because production environments have already chosen one (or, more often, are slowly migrating from one to another). Knowing what they have in common matters more, because the choice of tool changes which buttons are pressed, not what the buttons do.

Apache Airflow

Airflow is the oldest and most widely deployed of the three. Maxime Beauchemin started it at Airbnb in 2014, and it became an Apache project in 2016. Pipelines are declared as Python files; tasks are operators (PythonOperator, BashOperator, SQLOperator) connected with the >> operator. The model is task-centric: tasks are the unit of scheduling, and dependencies are between tasks. Strengths include enormous community, broad operator coverage, and stable production track record. Trade-offs include a steeper learning curve, an older imperative model, and a tendency for DAGs to grow into procedural Python code that drifts from declarative dependency definition.

Dagster

Dagster, started by Nick Schrock at Elementl in 2018, takes an asset-first view. The unit of declaration is the data asset (a table, a file, a feature) and the orchestrator computes which assets need to be refreshed and how. Tasks still exist underneath, but the API foregrounds the data, not the work. Strengths include a typed pipeline model, software-defined assets, strong local testing story, and an asset graph view that mirrors the data lineage. Trade-offs include a smaller community than Airflow, more conceptual overhead for engineers used to the task-first model, and fewer pre-built integrations.

Prefect

Prefect, started by Jeremiah Lowin in 2018, was built as a reaction to Airflow's quirks. Pipelines are flows, tasks are decorated Python functions, and the orchestration model is dynamic: the flow can decide at runtime which tasks to run. Strengths include a Pythonic API, a clean dynamic execution model, and a hybrid execution architecture where the orchestrator runs in the cloud and the workers run in the company's own infrastructure. Trade-offs include a smaller deployment footprint than Airflow, faster API churn between major versions, and less mature support for the long-tail of niche source systems.

Orchestrator	Origin	Model	Best Fit
Airflow	Airbnb, 2014	Task-centric, imperative DAG	Large existing deployments, broad integration needs, stable production
Dagster	Elementl, 2018	Asset-first, typed, software-defined	New builds emphasizing data lineage and testability
Prefect	Prefect Technologies, 2018	Flow-centric, Pythonic, dynamic	Teams that want a hybrid cloud-orchestration model and dynamic graphs

What They Have in Common

All three model pipelines as DAGs. All three take a schedule and produce runs. All three offer retries, dependency resolution, and a UI. All three integrate with Snowflake, BigQuery, S3, dbt, Spark, Kubernetes, and the rest of the modern data stack. The shared shape matters more than the API differences. An engineer who has internalized one orchestrator can be productive in another within days, because the four responsibilities behave the same way in all three. The brand is a tool choice; the model is the same.

AirflowDagsterPrefect

Airflow

Task-centric, oldest, biggest community

Python-coded DAGs with task operators connected by >>. Default in many enterprises. Mature, broad integration, sometimes verbose.

Dagster

Asset-first, typed, software-defined

Pipelines are graphs of data assets. Strong local testing and lineage view. Lower task-level boilerplate, sharper conceptual model.

Prefect

Flow-centric, dynamic, Pythonic

Decorated Python functions become tasks. Hybrid cloud-control plane with self-hosted workers. Dynamic execution and modern API.

Other names worth knowing in passing:

▸Argo Workflows: Kubernetes-native orchestrator, used heavily in ML and CI/CD
▸Luigi: Spotify's predecessor to Airflow, still in legacy deployments
▸Mage: newer, lower-code orchestrator aimed at smaller teams
▸Temporal: a workflow engine often used for application orchestration rather than data pipelines
▸Cloud-native: AWS Step Functions, Google Cloud Composer (managed Airflow), Azure Data Factory

How to Choose Between Them

For most teams, the choice is decided by what already runs in production. Migrating from one orchestrator to another costs months of engineer time and rarely earns the cost back. New builds at companies without an existing orchestrator usually pick Dagster or Prefect for the asset-aware, modern API; companies with deep Airflow expertise extend Airflow because retraining a team is expensive. The wrong question is 'which orchestrator is best.' The right question is 'which orchestrator fits this organization, this data stack, and the engineers who will operate it for the next three years.'

•Pick Airflow When

The team already runs Airflow at scale
A specific operator is needed (rare third-party source)
Stability and community size outweigh API freshness
Cloud Composer or MWAA is already in the stack

✓Pick Dagster or Prefect When

A new build with no existing orchestrator
Asset lineage and software-defined data assets matter (Dagster)
A hybrid cloud-control plane is preferred (Prefect)
Local testing and typed pipelines are priorities

TIP

Spend the first week with the orchestrator the team already uses. Read the UI, run a backfill, fail a task on purpose, watch the retry happen. The four responsibilities are the same everywhere; the muscle memory transfers.

First DAG: 3 Tasks, 1 Schedule

Daily Life

Interviews

Build a three-task DAG with one schedule and one retry policy, and explain the order of execution from the dependencies.

Vocabulary becomes useful when applied. The example below builds a tiny but complete DAG end to end. A retail company wants a daily summary of orders by region. Three tasks chain together: extract orders from Postgres, clean and standardize the rows, aggregate to one row per region per day. The DAG runs once a day at 2am Pacific. Every concept from the previous sections shows up in working code.

Step 1: Name the Tasks

Task ID	What It Does	Where It Reads From	Where It Writes To
extract_orders	Pulls new orders from Postgres since the last run	production.orders (Postgres)	raw.orders (Snowflake)
clean_orders	Standardizes country codes, drops test accounts	raw.orders	stg.orders
aggregate_orders	Counts orders by region for the run date	stg.orders	mart.orders_by_region

Step 2: Declare the Dependencies

The dependency graph is a chain. Clean reads what extract produces, so clean depends on extract. Aggregate reads what clean produces, so aggregate depends on clean. Two edges, three nodes, no cycles. The DAG is the smallest non-trivial example: a straight line.

extract_orders 2 : 00 am 2 : 14 am 2 : 21 am(typical run)

Step 3: Write the Airflow Code

	from airflow import DAG
	from airflow.operators.python import PythonOperator
	from datetime import datetime, timedelta

	default_args = {
	'owner': 'data-platform',
	'retries': 3,
	'retry_delay': timedelta(minutes=2),
	'retry_exponential_backoff': True,
	}

	with DAG(
	dag_id='daily_orders_by_region',
	start_date=datetime(2026, 4, 1),
	schedule='0 2 * * *', # 2am every day
	catchup=False,
	default_args=default_args,
	) as dag:

	extract = PythonOperator(
	task_id='extract_orders',
	python_callable=extract_orders_since_last_run,
	)

	clean = PythonOperator(
	task_id='clean_orders',
	python_callable=clean_raw_orders,
	)

	aggregate = PythonOperator(
	task_id='aggregate_orders',
	python_callable=aggregate_to_region,
	)

	extract >> clean >> aggregate

Three operators, one chain, one schedule. The default_args block applies retry policy uniformly. The chain on the last line is the entire dependency model. When 2am Pacific arrives, Airflow creates a run, the scheduler looks at the DAG, and only extract is ready (it has no dependencies). When extract finishes, clean becomes ready. When clean finishes, aggregate becomes ready. When aggregate finishes, the run is complete.

Step 4: Run It and Watch the UI

	# A simulation of a tiny orchestrator running the three-task DAG
	# Read the code, then run it to see the order tasks execute

	tasks = {
	'extract_orders': {'deps': set(), 'duration': 14, 'status': 'pending'},
	'clean_orders': {'deps': {'extract_orders'}, 'duration': 7, 'status': 'pending'},
	'aggregate_orders':{'deps': {'clean_orders'}, 'duration': 3, 'status': 'pending'},
	}

	done = set()
	clock = 0
	while len(done) < len(tasks):
	ready = [t for t, info in tasks.items()
	if info['status'] == 'pending' and info['deps'].issubset(done)]
	if not ready:
	raise Exception('No ready tasks: cycle or missing dependency')
	task = ready[0]
	print(f't={clock:>3}m start {task}')
	clock += tasks[task]['duration']
	print(f't={clock:>3}m done {task}')
	tasks[task]['status'] = 'success'
	done.add(task)

	print(f'\nAll tasks complete at t={clock}m')

The simulation above is a skeleton of what every orchestrator does. It tracks which tasks are ready, runs them in dependency order, and progresses. Real orchestrators add scheduling, retries, parallel execution across multiple workers, persistence, a UI, and dozens of other concerns. The core loop is the same.

Step 5: Handle a Failure

When the clean task fails (perhaps the Snowflake warehouse hit a query timeout), Airflow checks the retry policy. The default_args set retries to 3 with exponential backoff. The orchestrator waits two minutes, retries the clean task, and if it succeeds, the run continues to aggregate. If all three retries fail, clean is marked failed, aggregate stays in 'upstream_failed' state, and the run ends in a failed state. An alert fires. On-call opens the UI, sees that clean failed three times, reads the log, fixes the underlying issue, and re-runs only the failed task. The aggregate task picks up automatically because its only dependency is now satisfied.

What this tiny DAG demonstrates:

▸Three tasks, two edges, no cycles
▸A schedule (2am daily) that triggers the start of the DAG
▸A retry policy applied uniformly to every task
▸Dependencies enforced by the orchestrator, not by clock time
▸Failure isolation: one failed task halts dependents, not unrelated work

A first DAG can fit in twenty lines and still demonstrate every core orchestration concept.

The chain operator >> turns Python objects into a dependency graph the orchestrator can schedule.

Failure isolation is the property that makes a DAG more reliable than a script: dependents wait, unrelated work proceeds.

TIP

Build the smallest DAG first, see it run end to end, fail one task on purpose, and watch the retry happen. Every later complexity is an extension of the same loop.

❯❯❯PUTTING IT ALL TOGETHER

> A startup data team has six cron jobs that run nightly: pull from Postgres, pull from Stripe, clean orders, clean payments, join the two, publish a fact table. The chain has been working for a year. Last week the Postgres pull ran two hours long because of a backfill, and the dashboard showed yesterday's numbers because the downstream jobs ran on stale data. The team asks: 'What is the smallest set of changes that would have prevented this?'

Replace the time-staggered cron schedule with a single DAG. Six tasks, edges that encode the actual data dependencies. The clean tasks wait for their respective extracts; the join waits for both cleans; the publish waits for the join.

Pick one orchestrator and stick with it: Airflow if Cloud Composer or MWAA is already in the stack, Dagster or Prefect for a fresh build. The choice matters less than the consistency.

Set a retry policy on every task: three retries, exponential backoff, alert on final failure. Most transient blips become invisible to operators; real problems still surface.

Use the four pipeline roles from Lesson 1 to label the DAG: extracts are sources, joins and aggregates are transforms, the fact table is curated storage, and the dashboard is the consumer. The shape that the orchestrator schedules is the same shape the architecture diagram shows.

KEY TAKEAWAYS

Cron is a schedule, not an orchestrator: the first cron failure is always the same, slow upstream produces stale downstream silently.

A DAG has nodes, edges, direction, and no cycles: those four properties are what makes scheduling computable in finite time with a defined result.

An orchestrator owns four responsibilities: scheduling, dependency resolution, retries, and visibility. The compute itself is delegated to other systems.

Three orchestrators dominate: Airflow (task-centric, oldest), Dagster (asset-first, typed), Prefect (flow-centric, dynamic). The model is the same; the API differs.

The smallest useful DAG is three tasks chained: extract, transform, publish. One schedule. One retry policy. Every later complexity is an extension of this loop.

What runs, when, in what order, and what happens when something fails

Category: Pipeline Architecture
Difficulty: beginner
Duration: 25 minutes
Challenges: 0 hands-on challenges

Topics covered: Why Cron Is Not an Orchestrator, The DAG: Tasks, Edges, No Cycles, What an Orchestrator Does, The Major Orchestrators by Name, First DAG: 3 Tasks, 1 Schedule

Lesson Sections

Why Cron Is Not an Orchestrator (concepts: paDagOrchestration)
The first scheduled job most engineers ever write is a cron job. Cron is a Unix utility that runs a command at a fixed time. It is small, reliable, and has been part of every Unix system since 1975. For a single command that runs once a day, cron is the right tool. The trouble starts when several commands need to run in a particular order, and especially when the order has to hold even if one of them runs late. Cron does not know about order. Cron knows about clock time. What Cron Does and Does
The DAG: Tasks, Edges, No Cycles (concepts: paDagOrchestration)
Every modern orchestrator models a pipeline as a directed acyclic graph, abbreviated DAG. The structure is a small mathematical object with three properties. It has nodes (the tasks). It has edges (the dependencies). The edges point in one direction, and they cannot form a loop. Those properties are not stylistic preferences. They are the conditions that make the graph computable: a structure with cycles cannot be scheduled at all, and a structure without direction cannot be ordered. Vocabulary,
What an Orchestrator Does (concepts: paDagOrchestration)
An orchestrator is the system that owns four responsibilities: deciding when work runs, running it in the right order, retrying it when it fails, and showing what happened. The four are not separate features bolted together. They reinforce each other. A retry is meaningful only if dependencies are tracked. A schedule is operable only if a UI exists to inspect it. Visibility is useful only if failures are recorded as events the system can react to. Every orchestrator that ships sells the same fou
The Major Orchestrators by Name (concepts: paDagOrchestration)
Three orchestrators dominate modern data engineering: Airflow, Dagster, and Prefect. Each ships the four responsibilities described in the previous section, but they make different choices in the API and the philosophy. Knowing the names matters because production environments have already chosen one (or, more often, are slowly migrating from one to another). Knowing what they have in common matters more, because the choice of tool changes which buttons are pressed, not what the buttons do. Apac
First DAG: 3 Tasks, 1 Schedule (concepts: paDagOrchestration)
Vocabulary becomes useful when applied. The example below builds a tiny but complete DAG end to end. A retail company wants a daily summary of orders by region. Three tasks chain together: extract orders from Postgres, clean and standardize the rows, aggregate to one row per region per day. The DAG runs once a day at 2am Pacific. Every concept from the previous sections shows up in working code. Step 1: Name the Tasks Step 2: Declare the Dependencies The dependency graph is a chain. Clean reads