Loading lesson...
What a Data Pipeline Is: Advanced
Pipelines are products with owners, contracts, and lifecycles, not scripts that move data
Pipelines are products with owners, contracts, and lifecycles, not scripts that move data
- Category
- Pipeline Architecture
- Difficulty
- advanced
- Duration
- 35 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Pipelines as Products, The Cross-Cutting Undercurrents, When to Split, When to Merge, Build vs Buy at Each Layer, Redesigning a Tangled Graph
Lesson Sections
- Pipelines as Products (concepts: paPipelineAsProduct, paDataContracts)
A script copies data; a pipeline serves consumers. The difference is not size. The difference is the existence of a contract. A contract names the consumer, names the producer, names what is delivered, names how often, and names what happens when the delivery fails. Pipelines without contracts accumulate, drift, and rot. The accumulated rot is the largest hidden cost in the data engineering organizations of mature companies. The discipline of treating pipelines as products is the only known anti
- The Cross-Cutting Undercurrents (concepts: paUndercurrents, paPipelineLifecycle)
The four roles (source, transform, storage, consumer) describe what a pipeline does. They do not describe the cross-cutting concerns that touch every role. Joe Reis and Matt Housley call these concerns 'undercurrents' in their data engineering lifecycle framework, and the term is apt: they run beneath the surface of every layer. A pipeline that addresses the four roles but ignores the undercurrents is a pipeline that works on the demo and breaks in production. Senior engineers spend much of thei
- When to Split, When to Merge (concepts: paDagBoundaries, paAssetTriggers)
Two pipeline architectures are equivalent in what they produce and very different in how they operate. One large DAG with sixty tasks runs as a single unit. Six DAGs with ten tasks each run as separate units. The choice is one of the most consequential architectural decisions a senior engineer makes, and it cannot be made once for all time; the right boundary changes as the system grows. The principle is simple to state and hard to apply: split when the cost of coupling exceeds the cost of coord
- Build vs Buy at Each Layer (concepts: paBuildVsBuy)
Every layer of a pipeline can be built in-house or bought from a vendor. The choice is rarely all build or all buy; the right answer differs per layer. Ingestion has mature SaaS options (Fivetran, Airbyte) that solve the boring 80% of source extraction at a real per-row cost. Orchestration has open-source options (Airflow, Dagster, Prefect) that have absorbed most of what custom schedulers used to do. Storage and warehousing have been almost entirely commoditized into Snowflake, BigQuery, Databr
- Redesigning a Tangled Graph (concepts: paArchitectureRedesign)
The synthesis exercise is a real-shaped problem. A mid-size company has accumulated 80 production DAGs over four years. The data team has grown from three engineers to twelve. The new tech lead has been asked to make the system operable. The exercise walks through the diagnosis and the redesign, using every concept from the lesson and the prior tiers. The Symptoms Diagnosis: What Each Symptom Reveals Redesign Step 1: Establish the Layered Shape The first move is to introduce a shared raw layer a