A senior data engineer at a public fintech inherited 412 production DAGs. Roughly 80 of them were mission critical and had owners. The other 332 had no owner, no documentation, and no consumers anyone could identify. Some of them had been running for four years. Turning any of them off was risky because nobody could prove the absence of a hidden consumer. The engineer's job for the next year was less about building new pipelines and more about deprecating old ones, which required a contract for what a pipeline was, who owned it, and how it would be retired. This lesson is about the framing that turns a pipeline from a script into a product with a lifecycle. The framing is the difference between a 200-pipeline environment that scales and a 200-pipeline environment that paralyzes.
Pipelines as Products
Daily Life
Interviews
Apply the pipeline-as-product framing: name the consumer, write the contract, and define the deprecation path before the pipeline ships.
A script copies data; a pipeline serves consumers. The difference is not size. The difference is the existence of a contract. A contract names the consumer, names the producer, names what is delivered, names how often, and names what happens when the delivery fails. Pipelines without contracts accumulate, drift, and rot. The accumulated rot is the largest hidden cost in the data engineering organizations of mature companies. The discipline of treating pipelines as products is the only known antidote.
What a Pipeline Contract Contains
Element
What It Specifies
Why It Matters
Producer
The team that owns the pipeline and is paged when it fails
Without an owner, no one fixes failures; orphaned pipelines rot
Consumer
The named downstream that depends on the output
If no consumer can be named, the pipeline is dead code
Schema
Column names, types, nullability, primary key
Consumer code depends on the shape; schema changes break consumers
Freshness SLA
How current the data is guaranteed to be (e.g., 'within the last hour')
Consumer planning depends on freshness; without a stated bar, every delay is a crisis
Quality SLA
Row count bounds, null thresholds, distribution checks
The pipeline is allowed to fail; what is not allowed is silent corruption
Backfill policy
How far back data can be re-derived and at what cost
Consumers ask for fixes to historical periods; the policy answers in advance
Deprecation policy
How the pipeline ends: notice period, migration path, sunset date
Without an end-of-life policy, every pipeline runs forever
Why the Contract Has to Be Written Down
Implicit contracts are not contracts. A consumer team that 'knows' the data is updated daily because it has been daily for two years is not party to a contract; they are party to a habit. Habits change. The producer who decides to switch to hourly to save cost has no way to know that a downstream model retrains daily and now retrains six times per cost cycle. The producer who decides to drop a column because nobody seems to use it has no way to know that a quarterly finance report depends on it once a quarter. Written contracts catch these collisions before they ship. The cost of writing them down is the cost of the conversations they force, which are conversations that need to happen anyway.
Contracts fail in two directions. The producer fails to deliver: the pipeline runs late, the row count drops below the floor, a column starts arriving with too many nulls. The consumer fails to abide: someone reads a column that was marked deprecated, or assumes a freshness tighter than the SLA. Both failures are recoverable when the contract is written, because the contract names the failure mode and the response. Both failures are unrecoverable when the contract is implicit, because the conversation about who is wrong starts after the production incident is already in progress.
Signals that a pipeline is being treated as a script, not a product:
▸Asking 'who owns this?' returns a long pause or a name of someone who left the company
▸The freshness expectation is in someone's head, not in a YAML file
▸Consumer teams maintain their own copies of the same logic 'just in case'
▸Schema changes are announced in standups rather than in PRs
▸Deprecating a pipeline requires a months-long archaeology project
The Maturity Model
Level
Pipeline Treated As
Visible Behavior
0
A script someone wrote
Owner is whoever last touched it; failures are firefighting
1
A scheduled job
Owner is named; failures route to a Slack channel; no SLA
2
A service
Freshness SLA exists; alerts route to on-call; consumers expect uptime
3
A product
Contract is written; quality SLAs exist; deprecation has a process; new consumers sign on explicitly
Most companies operate at level 1 or 2 for most of their pipelines. Level 3 is achievable but requires sustained investment in tooling and discipline. The shift from level 2 to level 3 is the shift from 'we build pipelines' to 'we publish data products,' and it changes the conversation between the data team and the rest of the company. The data team becomes a producer with named consumers, not a service desk that fields one-off requests.
✓Do
Write a contract for every new pipeline at the time it is built, not after
Name the consumer; if no consumer can be named, do not build the pipeline
Treat schema, freshness, and quality SLAs as PR-reviewable artifacts
✗Don't
Build pipelines in service of 'someone might want this later'
Allow contracts to live in tribal knowledge; they leave the company when people leave
Treat deprecation as an afterthought; build the end-of-life path with the start-of-life path
The Cross-Cutting Undercurrents
Daily Life
Interviews
Identify the six cross-cutting undercurrents and map each one onto the four pipeline roles.
The four roles (source, transform, storage, consumer) describe what a pipeline does. They do not describe the cross-cutting concerns that touch every role. Joe Reis and Matt Housley call these concerns 'undercurrents' in their data engineering lifecycle framework, and the term is apt: they run beneath the surface of every layer. A pipeline that addresses the four roles but ignores the undercurrents is a pipeline that works on the demo and breaks in production. Senior engineers spend much of their time on the undercurrents, not on the roles, because the roles are well-understood and the undercurrents are where the failure modes live.
DAG scheduling, dependency resolution, retries, and failure isolation. Without orchestration, a pipeline is a script that someone has to remember to run.
Observability
Did it run, did it succeed, was the output right
Logs, metrics, traces, and lineage. Without observability, failure modes are invisible until consumers complain. Five pillars: freshness, volume, schema, distribution, lineage.
Security
Who can access what data, in what role
Authentication, authorization, encryption, PII handling. Without security, the pipeline is a compliance liability waiting to be discovered.
Data management
Catalog, lineage, governance, quality
The metadata layer that lets consumers discover and trust the data. Without it, the same data is rebuilt repeatedly because no one knows it exists.
DataOps
Deployment, testing, version control, CI/CD
The engineering discipline applied to data work. Without it, every change is a production change and rollbacks are manual.
Cost
Who pays for what compute and storage
Attribution and optimization. Without cost discipline, pipelines silently consume budget that nobody knows belongs to them.
Where the Undercurrents Touch Each Role
Undercurrent
At the Source
At the Transform
At the Consumer
Orchestration
When to extract; how to coordinate with the source's availability
DAG dependencies; retries; partial failure recovery
Notify when fresh data is available; trigger downstream
Observability
Did the extract pull the expected volume; source schema check
Did the transform produce the expected row count and distributions
Did the consumer-facing table update; did dashboards stay green
Security
Who has read access to the source; how is the credential stored
PII redaction during transform; column-level encryption
Access control on consumer tables; row-level security
Data management
Cataloging the source; lineage from source to raw
Lineage between transforms; quality tests at each step
Discoverability of consumer tables; ownership metadata
DataOps
Source change management; sandbox environments
Version-controlled transform code; PR review; CI tests
Backwards-compatible schema changes; deprecation process
Cost
Egress costs from source systems
Compute cost of transforms; warehouse credits
Storage cost of consumer tables; query cost
The Cost of Ignoring an Undercurrent
An ignored undercurrent does not stop the pipeline from running. It changes how the pipeline fails. A pipeline with no observability fails silently and is discovered when a consumer asks 'why are these numbers strange.' A pipeline with no cost discipline runs successfully every day and produces a budget overrun at the end of the quarter. A pipeline with no DataOps practice runs successfully every day until the day a transform change is rolled out and breaks downstream consumers, and the rollback takes hours because there is no test environment. The undercurrents are the reason for the gap between 'this code works' and 'this code is in production.'
•Pipeline Without Undercurrents
Runs successfully on the demo, fails silently in production
Failure modes are discovered by consumers, not by the pipeline
Cost is unknown until the cloud bill arrives
Schema changes are deployed without testing
✓Pipeline With Undercurrents
Failures route to on-call within minutes via observability
Quality issues caught at the transform, not at the dashboard
Cost is tagged per pipeline; budget overruns are predictable
Schema changes go through CI; consumers are notified before deploy
The Senior Engineer's Allocation
An engineer early in their career spends most of their time on the four roles: writing extracts, writing transforms, choosing storage, building consumer-facing tables. A senior engineer spends most of their time on the undercurrents: orchestrating dependencies, instrumenting observability, hardening security, governing data, applying DataOps, controlling cost. The shift is not a change in interest. It is a change in where the leverage is. The roles can be built once and reused. The undercurrents have to be reinforced continuously, and a senior engineer's time is best spent on the layer that compounds across every pipeline rather than on individual pipelines.
TIP
When reviewing a new pipeline design, ask 'what is the orchestration story, what is the observability story, what is the security story' before asking 'what does the transform do.' The transform is usually fine. The undercurrents are where designs fail review.
An ignored undercurrent does not prevent the pipeline from running; it changes how the pipeline fails.
The four roles describe what a pipeline does; the six undercurrents describe how it survives in production.
Senior leverage compounds when invested in undercurrents (platforms, tooling, governance) rather than individual pipelines.
When to Split, When to Merge
Daily Life
Interviews
Decide when to split a pipeline into multiple DAGs and when to merge several into one, based on ownership, cadence, and operational cost.
Two pipeline architectures are equivalent in what they produce and very different in how they operate. One large DAG with sixty tasks runs as a single unit. Six DAGs with ten tasks each run as separate units. The choice is one of the most consequential architectural decisions a senior engineer makes, and it cannot be made once for all time; the right boundary changes as the system grows. The principle is simple to state and hard to apply: split when the cost of coupling exceeds the cost of coordination.
Costs of a Single Large Pipeline
Cost
What It Looks Like
Blast radius
One task fails, the whole DAG halts; unrelated downstream work is delayed
Deployment friction
Any change requires testing the whole DAG; small changes carry the risk of large ones
Ownership ambiguity
When sixty tasks span four teams, no one team owns the DAG
Scheduling rigidity
All tasks run on the same cadence; faster cadences are awkward to add
Long tail
The end-to-end latency is the sum of every task; tail latency dominates
Costs of Many Small Pipelines
Cost
What It Looks Like
Cross-DAG coordination
DAG B depends on DAG A; sensors or asset triggers add lag and complexity
Discoverability
Twenty DAGs are harder to find than one; lineage is fragmented
Operational overhead
Each DAG has its own alerts, runbooks, on-call rotation
Drift
Two DAGs that should share logic implement it twice
Latency from sensors
Polling sensors add minutes of lag at every cross-DAG boundary
The Right Place to Split
The boundary that ages well is the boundary of ownership. A pipeline that is owned by one team should be one DAG. A pipeline that crosses team boundaries should split at the team boundary, with a clear contract at the seam. The rule has the same shape as Conway's law applied to pipelines: the architecture mirrors the org chart, and pretending it does not is more expensive than embracing it. A second principle is the boundary of cadence. Tasks that must run hourly belong in a different DAG from tasks that must run daily. Mixing cadences in one DAG forces the slower cadence on faster work or wastes compute by over-running slow work.
Signals it is time to split a pipeline:
▸Two teams contribute to the same DAG and disagree about deployment cadence
▸The DAG mixes hourly and daily work; the slow tasks dominate the schedule
▸A single failure in one branch halts unrelated downstream branches
▸On-call cannot tell which team to page when the DAG fails
▸End-to-end latency is dominated by tasks that no consumer waits for
The Right Place to Merge
Merging two DAGs into one is the reverse decision and the rarer one. Merging makes sense when two DAGs are owned by the same team, share most of their inputs, must always run together, and have no independent consumers. The signal is operational pain from cross-DAG coordination: the sensors that connect them are unreliable, the latency between them is consistent and avoidable, and the failure modes always involve both DAGs. Merging is the right answer in those cases, and it is wrong every other time. Many engineers reach for merging because one DAG feels simpler, but the simplicity is local; the operational consequences are global.
The Cross-DAG Contract
When a pipeline does split, the seam between the two DAGs is a contract in itself. DAG A produces a table at a stated time with a stated schema; DAG B reads it. The seam is observable: 'fct_orders updated at 2:14am' is a fact that DAG B can act on. Modern orchestrators (Dagster, Airflow with asset triggers, Prefect with results) support asset-based triggering: DAG B starts when the asset DAG A produces is updated, regardless of which DAG produced it. The asset model decouples the two DAGs more cleanly than a sensor on DAG A's run state, because the consumer cares about the data being fresh, not about the producer DAG having run.
1
# Asset-based cross-DAG dependency, Dagster-style
2
@asset(deps=[fct_orders])
3
defrevenue_dashboard_mart(fct_orders):
4
# Runs whenever fct_orders is updated, regardless of which DAG updated it
5
...
6
7
# vs. sensor-based, which polls the producer DAG's run state
Tasks share most inputs and have no independent consumers
End-to-end latency is acceptable as the sum of all tasks
✓Split into Multiple
Multiple teams contribute to the work
Different parts must run at different cadences
Branches have independent consumers and independent failure tolerance
Tasks have no shared inputs and could deploy independently
TIP
Decide split versus merge by the question 'who is paged when this fails.' If two different teams should be paged for two different parts, split. If the same person is paged either way, leave it as one DAG.
Build vs Buy at Each Layer
Daily Life
Interviews
Apply the build-versus-buy decision per layer, weighing differentiation, total cost of ownership, and strategic risk.
Every layer of a pipeline can be built in-house or bought from a vendor. The choice is rarely all build or all buy; the right answer differs per layer. Ingestion has mature SaaS options (Fivetran, Airbyte) that solve the boring 80% of source extraction at a real per-row cost. Orchestration has open-source options (Airflow, Dagster, Prefect) that have absorbed most of what custom schedulers used to do. Storage and warehousing have been almost entirely commoditized into Snowflake, BigQuery, Databricks, and a few others. Transform tooling has converged on dbt for SQL-shaped work and Spark for non-SQL. The build-versus-buy conversation has shifted from 'should we build a warehouse' (no) to 'should we build the connector for this one weird source' (sometimes).
The Calculus
Factor
Pushes Toward Build
Pushes Toward Buy
Differentiation
The capability is core to the company's product
The capability is generic infrastructure
Volume
Volume is so high that vendor pricing exceeds engineering cost
Volume is moderate; vendor pricing is reasonable
Cycle time
Iteration speed matters more than total cost of ownership
Cycle time is acceptable at vendor pace
Compliance
Data cannot leave the company's environment
Vendor offers compliant deployment options
Engineer time
Engineers are available and want to build
Engineers are scarce and the build cost is opportunity cost
Fivetran, Airbyte, and similar services handle the long tail of SaaS source connectors. Build only for sources unique to the company (proprietary IoT data, custom internal systems).
Storage
Almost always buy
Snowflake, BigQuery, Databricks, Redshift. The economics of building a warehouse are unfavorable for any company that is not in the business of selling warehouses. Self-managed Postgres for tiny analytics is the only common build case.
Transform
Buy the framework, build the logic
dbt for SQL transforms, Spark for non-SQL. The framework is bought; the actual transform logic (the business rules) is always built, because nobody else knows the business.
Orchestration
Buy or use open-source; rarely build from scratch
Airflow, Dagster, Prefect, Mage. Custom schedulers exist at very large scale (Airbnb's original Airflow, Lyft's Flyte) but the marginal benefit shrinks every year as the open-source options mature.
Observability
Mixed; depends on scale
Monte Carlo, Bigeye, Soda for buy. Custom dbt tests plus PagerDuty for build. Smaller teams build; larger teams that have outgrown homegrown tooling buy.
Catalog/Lineage
Increasingly buy
DataHub, Atlan, Collibra. Building a catalog is a multi-year project that almost always underperforms the bought version.
The Hidden Cost of Building
The most common error in build-versus-buy is comparing the cost of writing the first version to the cost of buying the vendor's product. The right comparison is the cost of writing, maintaining, debugging, monitoring, on-calling, and eventually replacing the in-house version against the vendor's price. Built infrastructure has a tail of operational cost that is invisible at the moment of building. A typical custom data ingestion framework, three years in, is consuming the equivalent of one full-time engineer in maintenance and is on its second or third major rewrite. The vendor was always cheaper; the spreadsheet at year zero did not show it.
The Hidden Cost of Buying
Buying is not free of hidden costs either. Vendor lock-in is the most discussed: switching from Fivetran to Airbyte is non-trivial because connector configurations, monitoring, and downstream lineage all assume the vendor. Pricing surprises are real: a Snowflake bill that grows 4x in a quarter because a careless query pattern blew through compute credits is common in companies that have not invested in cost observability. Vendor outages cause direct outages: when Snowflake has a regional outage, every consumer in the company is affected at once, and the company has no recourse beyond waiting. The build-versus-buy conversation should price these tail risks, not the per-row cost alone.
Hidden Cost
Build
Buy
Maintenance
Tail of engineering effort that grows with feature surface
Locked into the company's own implementation; rewrite is the only exit
Locked into the vendor; switching cost depends on the layer
Outage
The team owns every outage and every fix
Vendor outages are external; no internal recourse
Hiring signal
Engineers want to work on novel problems, not internal Airflow clones
Engineers want to work on the latest tools, which vendors usually offer
The Decision Framework
A useful frame is the test of strategic differentiation. Anything that differentiates the company's product belongs in build. Anything that does not differentiate the company's product belongs in buy if a viable vendor exists. A streaming media company should not be building its own data warehouse, but it might build its own ML feature platform if recommendation quality is the product. A bank should not be building its own orchestrator, but it might build its own catalog if regulatory lineage is core to its compliance posture. The strategic differentiation test is faster than the spreadsheet and tends to produce the same answer.
Build the business logic, the differentiated transform, and the company-specific source connector
Price the tail: maintenance, on-call, lock-in, outage, hiring signal
✗Don't
Build a warehouse, an orchestrator, or a generic ingestion framework in 2026
Buy a black-box solution for the company's most differentiated capability
Compare a year-zero build cost to a year-zero buy cost; the tail dominates
Redesigning a Tangled Graph
Daily Life
Interviews
Diagnose a tangled production pipeline graph and apply the lesson's framings to redesign it for operability.
The synthesis exercise is a real-shaped problem. A mid-size company has accumulated 80 production DAGs over four years. The data team has grown from three engineers to twelve. The new tech lead has been asked to make the system operable. The exercise walks through the diagnosis and the redesign, using every concept from the lesson and the prior tiers.
The Symptoms
Operational symptoms reported by the data team:
▸On-call gets paged 8 to 12 times per night, half for failures with no clear owner
▸Three different teams compute weekly active users; numbers disagree by 4 to 7 percent
▸Schema changes in Postgres orders break six different DAGs simultaneously
▸The Snowflake bill grew 3.5x year over year with no clear cause
▸Adding a new source takes four weeks of cross-team coordination
▸Two engineers are leaving and the team is losing institutional memory
Diagnosis: What Each Symptom Reveals
Symptom
Underlying Cause
Concept From This Lesson
Pages with no clear owner
Pipelines treated as scripts; no contract names the producer
Pipelines as products
Three teams compute WAU differently
No shared curated layer; each team rebuilds the logic
The shared middle layer (intermediate tier)
Schema changes break six DAGs
Direct dependencies on raw schema; no decoupling at curated layer
Layered architecture; raw zone decoupling
Snowflake bill 3.5x
No cost observability; no cost attribution per pipeline
Cost as an undercurrent
Four weeks to add a new source
Each source built bespoke; no ingestion platform
Build vs buy at the ingestion layer
Losing institutional memory
Knowledge not captured in contracts, lineage, or catalog
Data management as an undercurrent
Redesign Step 1: Establish the Layered Shape
The first move is to introduce a shared raw layer and a shared curated layer if they do not already exist in clean form. Audit every existing DAG. Each one becomes a candidate for relocation: the parts that read sources move into a shared raw landing layer, the parts that produce business-ready tables move into a shared curated layer, and the parts that produce consumer-specific shapes stay with their owning team in the serving layer. Most DAGs decompose cleanly along these lines. Some do not, and those are the ones that need redesign rather than relocation.
Redesign Step 2: Write Contracts for the Top 10 Pipelines
The 80/20 rule applies. Roughly ten pipelines produce the data that powers most of the consumer experience. Write contracts for those ten. Name producers, name consumers, name freshness and quality SLAs. Publish the contracts and require new consumers to read them. The remaining seventy can be triaged later: half are probably orphans that can be deprecated; the rest will eventually get contracts as they become important enough to warrant the work.
Redesign Step 3: Buy What Should Be Bought
The four-week onboarding for a new source is a build-versus-buy signal. Audit the existing custom ingestion code. If it is solving generic SaaS connector problems (Stripe, Salesforce, Google Ads, Shopify), replace it with a managed ingestion service. The team's engineering capacity should go into the differentiated work, not into reimplementing connectors that twenty other companies have already built. The savings appear as a reduction in maintenance burden, faster time-to-onboard for new sources, and an end to the recurring 'why is this connector broken' Slack thread.
Redesign Step 4: Split the Mega-DAGs
Identify the DAGs that span team boundaries and split them at the ownership seam. Replace cross-DAG dependencies with asset triggers so the consumer DAG runs when the producer DAG's output is updated, not when the producer DAG's run state changes. The split reduces blast radius, clarifies on-call ownership, and lets each team deploy independently. The cost is a small amount of cross-DAG coordination overhead, which is now the right cost to pay because the savings in deployment friction and on-call clarity exceed it.
Redesign Step 5: Instrument the Undercurrents
Add cost attribution by tagging every Snowflake query with the pipeline ID. The 3.5x bill growth becomes legible immediately because the spend can be attributed to specific pipelines. Add observability via the five pillars (freshness, volume, schema, distribution, lineage) so failure modes route to alerts rather than to dashboards that nobody is watching. Add a catalog so consumers can discover existing curated tables before building new ones. The undercurrent investments do not produce immediate visible output, but they are the layer where the future operational pain or relief lives.
The Result, Six Months Later
•Before the Redesign
80 DAGs, no clear ownership
Three definitions of WAU
On-call paged 8 to 12 times per night
Snowflake bill growing 3.5x year over year
Four weeks to onboard a new source
Schema changes break six DAGs
✓After the Redesign
Top 10 pipelines have contracts; remainder triaged for deprecation
One canonical fact_active_users in the curated layer
On-call paged 1 to 2 times per night, all with named owners
Cost attributed per pipeline; growth budgeted and explained
Two days to onboard a new source via managed ingestion
Schema changes propagate via the raw layer; one DAG affected
The Underlying Lesson
The redesign did not introduce any tool the team did not already have. It introduced a shape: layers, contracts, ownership, asset triggers, and instrumented undercurrents. The shape is what the rest of this curriculum elaborates. Every later lesson, from idempotency to schema evolution to data quality, is an instance of one of the framings in this lesson, applied to a specific operational concern. A senior data engineer is not someone who knows more tools. A senior data engineer is someone who reaches for the right framing first and the right tool second.
TIP
Redesigning a tangled pipeline graph rarely starts with new tools. It starts with naming what each pipeline is supposed to be, who owns it, and what would happen if it stopped running. The naming exposes the redesign.
❯❯❯PUTTING IT ALL TOGETHER
> A senior data engineer joins a fintech that has 412 production DAGs accumulated over four years. Roughly 80 are mission critical; the other 332 have no clear ownership. The CTO asks: 'How do we make this system operable, and how do we make sure we never end up here again?'
Treat the top 80 as products. Write contracts that name producer, consumer, schema, freshness SLA, quality SLA, backfill policy, and deprecation policy. The contract is the difference between a script and a product.
Triage the remaining 332. Many are orphans. The deprecation policy from the contract framework defines the path: notice period, migration target, sunset date. Without a deprecation policy, every pipeline runs forever.
Establish the layered shape. Shared raw layer, shared curated layer, team-owned serving layer. The layered shape is the architectural pivot from N×M point-to-point to N+M layered. (Beginner-tier four-roles concept; intermediate-tier shared middle layer.)
Split DAGs at ownership boundaries; use asset triggers across DAGs. The split reduces blast radius, clarifies on-call, and lets teams deploy independently. (Intermediate-tier DAG concept extended to cross-DAG coordination.)
Buy what is commodity (ingestion connectors, warehouse, orchestrator, catalog). Build the differentiated logic. The strategic differentiation test is faster than the spreadsheet and tends to produce the same answer.
Instrument the six undercurrents: orchestration, observability, security, data management, DataOps, cost. The undercurrents are why a system that worked yesterday still works today. The 412-DAG mess accumulated because nobody invested in the undercurrents.
KEY TAKEAWAYS
Pipelines are products with contracts: producer, consumer, schema, freshness SLA, quality SLA, backfill policy, deprecation policy. Implicit contracts are not contracts.
Six undercurrents run beneath every layer: orchestration, observability, security, data management, DataOps, cost. Senior leverage compounds in the undercurrents.
Split DAGs at ownership and cadence boundaries: use asset triggers across DAGs to decouple producers from consumers.
Build vs buy is per-layer: buy commodity (warehouse, orchestrator, ingestion); build differentiated business logic. Price the tail, not the year-zero cost alone.
Redesign starts with naming, not with tools: what each pipeline is supposed to be, who owns it, what would happen if it stopped running. The naming exposes the redesign.
Pipelines are products with owners, contracts, and lifecycles, not scripts that move data
Category
Pipeline Architecture
Difficulty
advanced
Duration
35 minutes
Challenges
0 hands-on challenges
Topics covered: Pipelines as Products, The Cross-Cutting Undercurrents, When to Split, When to Merge, Build vs Buy at Each Layer, Redesigning a Tangled Graph
A script copies data; a pipeline serves consumers. The difference is not size. The difference is the existence of a contract. A contract names the consumer, names the producer, names what is delivered, names how often, and names what happens when the delivery fails. Pipelines without contracts accumulate, drift, and rot. The accumulated rot is the largest hidden cost in the data engineering organizations of mature companies. The discipline of treating pipelines as products is the only known anti
The four roles (source, transform, storage, consumer) describe what a pipeline does. They do not describe the cross-cutting concerns that touch every role. Joe Reis and Matt Housley call these concerns 'undercurrents' in their data engineering lifecycle framework, and the term is apt: they run beneath the surface of every layer. A pipeline that addresses the four roles but ignores the undercurrents is a pipeline that works on the demo and breaks in production. Senior engineers spend much of thei
Two pipeline architectures are equivalent in what they produce and very different in how they operate. One large DAG with sixty tasks runs as a single unit. Six DAGs with ten tasks each run as separate units. The choice is one of the most consequential architectural decisions a senior engineer makes, and it cannot be made once for all time; the right boundary changes as the system grows. The principle is simple to state and hard to apply: split when the cost of coupling exceeds the cost of coord
Every layer of a pipeline can be built in-house or bought from a vendor. The choice is rarely all build or all buy; the right answer differs per layer. Ingestion has mature SaaS options (Fivetran, Airbyte) that solve the boring 80% of source extraction at a real per-row cost. Orchestration has open-source options (Airflow, Dagster, Prefect) that have absorbed most of what custom schedulers used to do. Storage and warehousing have been almost entirely commoditized into Snowflake, BigQuery, Databr
The synthesis exercise is a real-shaped problem. A mid-size company has accumulated 80 production DAGs over four years. The data team has grown from three engineers to twelve. The new tech lead has been asked to make the system operable. The exercise walks through the diagnosis and the redesign, using every concept from the lesson and the prior tiers. The Symptoms Diagnosis: What Each Symptom Reveals Redesign Step 1: Establish the Layered Shape The first move is to introduce a shared raw layer a