Loading lesson...

Orchestration and Dependencies: Beginner

What runs, when, in what order, and what happens when something fails

What runs, when, in what order, and what happens when something fails

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: Why Cron Is Not an Orchestrator, The DAG: Tasks, Edges, No Cycles, What an Orchestrator Does, The Major Orchestrators by Name, First DAG: 3 Tasks, 1 Schedule

Lesson Sections

  1. Why Cron Is Not an Orchestrator (concepts: paOrchestrationVsCron, paDependencyModel)

    The first scheduled job most engineers ever write is a cron job. Cron is a Unix utility that runs a command at a fixed time. It is small, reliable, and has been part of every Unix system since 1975. For a single command that runs once a day, cron is the right tool. The trouble starts when several commands need to run in a particular order, and especially when the order has to hold even if one of them runs late. Cron does not know about order. Cron knows about clock time. What Cron Does and Does

  2. The DAG: Tasks, Edges, No Cycles (concepts: paDagOrchestration, paTaskDependency)

    Every modern orchestrator models a pipeline as a directed acyclic graph, abbreviated DAG. The structure is a small mathematical object with three properties. It has nodes (the tasks). It has edges (the dependencies). The edges point in one direction, and they cannot form a loop. Those properties are not stylistic preferences. They are the conditions that make the graph computable: a structure with cycles cannot be scheduled at all, and a structure without direction cannot be ordered. Vocabulary,

  3. What an Orchestrator Does (concepts: paOrchestratorRoles, paRetryPolicy)

    An orchestrator is the system that owns four responsibilities: deciding when work runs, running it in the right order, retrying it when it fails, and showing what happened. The four are not separate features bolted together. They reinforce each other. A retry is meaningful only if dependencies are tracked. A schedule is operable only if a UI exists to inspect it. Visibility is useful only if failures are recorded as events the system can react to. Every orchestrator that ships sells the same fou

  4. The Major Orchestrators by Name (concepts: paOrchestratorTools, paAirflowDagsterPrefect)

    Three orchestrators dominate modern data engineering: Airflow, Dagster, and Prefect. Each ships the four responsibilities described in the previous section, but they make different choices in the API and the philosophy. Knowing the names matters because production environments have already chosen one (or, more often, are slowly migrating from one to another). Knowing what they have in common matters more, because the choice of tool changes which buttons are pressed, not what the buttons do. Apac

  5. First DAG: 3 Tasks, 1 Schedule (concepts: paFirstDag, paChainedDependencies)

    Vocabulary becomes useful when applied. The example below builds a tiny but complete DAG end to end. A retail company wants a daily summary of orders by region. Three tasks chain together: extract orders from Postgres, clean and standardize the rows, aggregate to one row per region per day. The DAG runs once a day at 2am Pacific. Every concept from the previous sections shows up in working code. Step 1: Name the Tasks Step 2: Declare the Dependencies The dependency graph is a chain. Clean reads