What a Data Pipeline Is: Advanced

A senior data engineer at a public fintech inherited 412 production DAGs. Roughly 80 of them were mission critical and had owners. The other 332 had no owner, no documentation, and no consumers anyone could identify. Some of them had been running for four years. Turning any of them off was risky because nobody could prove the absence of a hidden consumer. The engineer's job for the next year was less about building new pipelines and more about deprecating old ones, which required a contract for what a pipeline was, who owned it, and how it would be retired. This lesson is about the framing that turns a pipeline from a script into a product with a lifecycle. The framing is the difference between a 200-pipeline environment that scales and a 200-pipeline environment that paralyzes.

What you will be able to do

Apply the pipeline-as-product framing: contracts, owners, SLAs, deprecation

Distinguish the cross-cutting undercurrents that run through every layer

Decide when to split a pipeline, when to merge several, and when to buy instead of build

Pipelines as Products

Daily Life

Interviews

Apply the pipeline-as-product framing: name the consumer, write the contract, and define the deprecation path before the pipeline ships.

A script copies data; a pipeline serves consumers. The difference is not size. The difference is the existence of a contract. A contract names the consumer, names the producer, names what is delivered, names how often, and names what happens when the delivery fails. Pipelines without contracts accumulate, drift, and rot. The accumulated rot is the largest hidden cost in the data engineering organizations of mature companies. The discipline of treating pipelines as products is the only known antidote.

What a Pipeline Contract Contains

Element	What It Specifies	Why It Matters
Producer	The team that owns the pipeline and is paged when it fails	Without an owner, no one fixes failures; orphaned pipelines rot
Consumer	The named downstream that depends on the output	If no consumer can be named, the pipeline is dead code
Schema	Column names, types, nullability, primary key	Consumer code depends on the shape; schema changes break consumers
Freshness SLA	How current the data is guaranteed to be (e.g., 'within the last hour')	Consumer planning depends on freshness; without a stated bar, every delay is a crisis
Quality SLA	Row count bounds, null thresholds, distribution checks	The pipeline is allowed to fail; what is not allowed is silent corruption
Backfill policy	How far back data can be re-derived and at what cost	Consumers ask for fixes to historical periods; the policy answers in advance
Deprecation policy	How the pipeline ends: notice period, migration path, sunset date	Without an end-of-life policy, every pipeline runs forever

Why the Contract Has to Be Written Down

Implicit contracts are not contracts. A consumer team that 'knows' the data is updated daily because it has been daily for two years is not party to a contract; they are party to a habit. Habits change. The producer who decides to switch to hourly to save cost has no way to know that a downstream model retrains daily and now retrains six times per cost cycle. The producer who decides to drop a column because nobody seems to use it has no way to know that a quarterly finance report depends on it once a quarter. Written contracts catch these collisions before they ship. The cost of writing them down is the cost of the conversations they force, which are conversations that need to happen anyway.

The Producer-Consumer Relationship

# A REAL pipeline contract, checked INTO the repo next to the dbt model contract : pipeline_id : fct_orders_v3 producer : team : data - platform on_call : '#data-platform-oncall' consumers : - team : finance - analytics use_case : monthly close contact : '#finance-data' - team : ml - platform use_case : revenue forecast feature contact : '#ml-features' SCHEMA : primary_key : order_id columns : - name : order_id type : string nullable : FALSE - name : order_timestamp type : TIMESTAMP nullable : FALSE sla : freshness : '<= 4 hours' quality : row_count_min_per_day : 1000 null_rate_max_per_column : 0.01 backfill : available_history : '24 months' cost_per_day_backfilled_usd : 12 deprecation : notice_period_days : 90 migration_target : fct_orders_v4

When Contracts Fail

Contracts fail in two directions. The producer fails to deliver: the pipeline runs late, the row count drops below the floor, a column starts arriving with too many nulls. The consumer fails to abide: someone reads a column that was marked deprecated, or assumes a freshness tighter than the SLA. Both failures are recoverable when the contract is written, because the contract names the failure mode and the response. Both failures are unrecoverable when the contract is implicit, because the conversation about who is wrong starts after the production incident is already in progress.

Signals that a pipeline is being treated as a script, not a product:

▸Asking 'who owns this?' returns a long pause or a name of someone who left the company
▸The freshness expectation is in someone's head, not in a YAML file
▸Consumer teams maintain their own copies of the same logic 'just in case'
▸Schema changes are announced in standups rather than in PRs
▸Deprecating a pipeline requires a months-long archaeology project

The Maturity Model

Level	Pipeline Treated As	Visible Behavior
0	A script someone wrote	Owner is whoever last touched it; failures are firefighting
1	A scheduled job	Owner is named; failures route to a Slack channel; no SLA
2	A service	Freshness SLA exists; alerts route to on-call; consumers expect uptime
3	A product	Contract is written; quality SLAs exist; deprecation has a process; new consumers sign on explicitly

Most companies operate at level 1 or 2 for most of their pipelines. Level 3 is achievable but requires sustained investment in tooling and discipline. The shift from level 2 to level 3 is the shift from 'we build pipelines' to 'we publish data products,' and it changes the conversation between the data team and the rest of the company. The data team becomes a producer with named consumers, not a service desk that fields one-off requests.

✓Do

Write a contract for every new pipeline at the time it is built, not after
Name the consumer; if no consumer can be named, do not build the pipeline
Treat schema, freshness, and quality SLAs as PR-reviewable artifacts

✗Don't

Build pipelines in service of 'someone might want this later'
Allow contracts to live in tribal knowledge; they leave the company when people leave
Treat deprecation as an afterthought; build the end-of-life path with the start-of-life path

The Cross-Cutting Undercurrents

Daily Life

Interviews

Identify the six cross-cutting undercurrents and map each one onto the four pipeline roles.

The four roles (source, transform, storage, consumer) describe what a pipeline does. They do not describe the cross-cutting concerns that touch every role. Joe Reis and Matt Housley call these concerns 'undercurrents' in their data engineering lifecycle framework, and the term is apt: they run beneath the surface of every layer. A pipeline that addresses the four roles but ignores the undercurrents is a pipeline that works on the demo and breaks in production. Senior engineers spend much of their time on the undercurrents, not on the roles, because the roles are well-understood and the undercurrents are where the failure modes live.

The Six Undercurrents

OrchestrationObservabilitySecurityData managementDataOpsCost

Orchestration

What runs, when, in what order

DAG scheduling, dependency resolution, retries, and failure isolation. Without orchestration, a pipeline is a script that someone has to remember to run.

Observability

Did it run, did it succeed, was the output right

Logs, metrics, traces, and lineage. Without observability, failure modes are invisible until consumers complain. Five pillars: freshness, volume, schema, distribution, lineage.

Security

Who can access what data, in what role

Authentication, authorization, encryption, PII handling. Without security, the pipeline is a compliance liability waiting to be discovered.

Data management

Catalog, lineage, governance, quality

The metadata layer that lets consumers discover and trust the data. Without it, the same data is rebuilt repeatedly because no one knows it exists.

DataOps

Deployment, testing, version control, CI/CD

The engineering discipline applied to data work. Without it, every change is a production change and rollbacks are manual.

Cost

Who pays for what compute and storage

Attribution and optimization. Without cost discipline, pipelines silently consume budget that nobody knows belongs to them.

Where the Undercurrents Touch Each Role

Undercurrent	At the Source	At the Transform	At the Consumer
Orchestration	When to extract; how to coordinate with the source's availability	DAG dependencies; retries; partial failure recovery	Notify when fresh data is available; trigger downstream
Observability	Did the extract pull the expected volume; source schema check	Did the transform produce the expected row count and distributions	Did the consumer-facing table update; did dashboards stay green
Security	Who has read access to the source; how is the credential stored	PII redaction during transform; column-level encryption	Access control on consumer tables; row-level security
Data management	Cataloging the source; lineage from source to raw	Lineage between transforms; quality tests at each step	Discoverability of consumer tables; ownership metadata
DataOps	Source change management; sandbox environments	Version-controlled transform code; PR review; CI tests	Backwards-compatible schema changes; deprecation process
Cost	Egress costs from source systems	Compute cost of transforms; warehouse credits	Storage cost of consumer tables; query cost

The Cost of Ignoring an Undercurrent

An ignored undercurrent does not stop the pipeline from running. It changes how the pipeline fails. A pipeline with no observability fails silently and is discovered when a consumer asks 'why are these numbers strange.' A pipeline with no cost discipline runs successfully every day and produces a budget overrun at the end of the quarter. A pipeline with no DataOps practice runs successfully every day until the day a transform change is rolled out and breaks downstream consumers, and the rollback takes hours because there is no test environment. The undercurrents are the reason for the gap between 'this code works' and 'this code is in production.'

•Pipeline Without Undercurrents

Runs successfully on the demo, fails silently in production
Failure modes are discovered by consumers, not by the pipeline
Cost is unknown until the cloud bill arrives
Schema changes are deployed without testing

✓Pipeline With Undercurrents

Failures route to on-call within minutes via observability
Quality issues caught at the transform, not at the dashboard
Cost is tagged per pipeline; budget overruns are predictable
Schema changes go through CI; consumers are notified before deploy

The Senior Engineer's Allocation

An engineer early in their career spends most of their time on the four roles: writing extracts, writing transforms, choosing storage, building consumer-facing tables. A senior engineer spends most of their time on the undercurrents: orchestrating dependencies, instrumenting observability, hardening security, governing data, applying DataOps, controlling cost. The shift is not a change in interest. It is a change in where the leverage is. The roles can be built once and reused. The undercurrents have to be reinforced continuously, and a senior engineer's time is best spent on the layer that compounds across every pipeline rather than on individual pipelines.

TIP

When reviewing a new pipeline design, ask 'what is the orchestration story, what is the observability story, what is the security story' before asking 'what does the transform do.' The transform is usually fine. The undercurrents are where designs fail review.

An ignored undercurrent does not prevent the pipeline from running; it changes how the pipeline fails.

The four roles describe what a pipeline does; the six undercurrents describe how it survives in production.

Senior leverage compounds when invested in undercurrents (platforms, tooling, governance) rather than individual pipelines.

app DB

source

extract

clean/shape

transform

warehouse

dashboard

consumer

The four roles every pipeline has: a source produces data, transforms reshape it, storage holds it, and a consumer reads it. Data flows left to right.

When to Split, When to Merge

Daily Life

Interviews

Decide when to split a pipeline into multiple DAGs and when to merge several into one, based on ownership, cadence, and operational cost.

Two pipeline architectures are equivalent in what they produce and very different in how they operate. One large DAG with sixty tasks runs as a single unit. Six DAGs with ten tasks each run as separate units. The choice is one of the most consequential architectural decisions a senior engineer makes, and it cannot be made once for all time; the right boundary changes as the system grows. The principle is simple to state and hard to apply: split when the cost of coupling exceeds the cost of coordination.

Costs of a Single Large Pipeline

Cost	What It Looks Like
Blast radius	One task fails, the whole DAG halts; unrelated downstream work is delayed
Deployment friction	Any change requires testing the whole DAG; small changes carry the risk of large ones
Ownership ambiguity	When sixty tasks span four teams, no one team owns the DAG
Scheduling rigidity	All tasks run on the same cadence; faster cadences are awkward to add
Long tail	The end-to-end latency is the sum of every task; tail latency dominates

Costs of Many Small Pipelines

Cost	What It Looks Like
Cross-DAG coordination	DAG B depends on DAG A; sensors or asset triggers add lag and complexity
Discoverability	Twenty DAGs are harder to find than one; lineage is fragmented
Operational overhead	Each DAG has its own alerts, runbooks, on-call rotation
Drift	Two DAGs that should share logic implement it twice
Latency from sensors	Polling sensors add minutes of lag at every cross-DAG boundary

The Right Place to Split

The boundary that ages well is the boundary of ownership. A pipeline that is owned by one team should be one DAG. A pipeline that crosses team boundaries should split at the team boundary, with a clear contract at the seam. The rule has the same shape as Conway's law applied to pipelines: the architecture mirrors the org chart, and pretending it does not is more expensive than embracing it. A second principle is the boundary of cadence. Tasks that must run hourly belong in a different DAG from tasks that must run daily. Mixing cadences in one DAG forces the slower cadence on faster work or wastes compute by over-running slow work.

Signals it is time to split a pipeline:

▸Two teams contribute to the same DAG and disagree about deployment cadence
▸The DAG mixes hourly and daily work; the slow tasks dominate the schedule
▸A single failure in one branch halts unrelated downstream branches
▸On-call cannot tell which team to page when the DAG fails
▸End-to-end latency is dominated by tasks that no consumer waits for

The Right Place to Merge

Merging two DAGs into one is the reverse decision and the rarer one. Merging makes sense when two DAGs are owned by the same team, share most of their inputs, must always run together, and have no independent consumers. The signal is operational pain from cross-DAG coordination: the sensors that connect them are unreliable, the latency between them is consistent and avoidable, and the failure modes always involve both DAGs. Merging is the right answer in those cases, and it is wrong every other time. Many engineers reach for merging because one DAG feels simpler, but the simplicity is local; the operational consequences are global.

The Cross-DAG Contract

When a pipeline does split, the seam between the two DAGs is a contract in itself. DAG A produces a table at a stated time with a stated schema; DAG B reads it. The seam is observable: 'fct_orders updated at 2:14am' is a fact that DAG B can act on. Modern orchestrators (Dagster, Airflow with asset triggers, Prefect with results) support asset-based triggering: DAG B starts when the asset DAG A produces is updated, regardless of which DAG produced it. The asset model decouples the two DAGs more cleanly than a sensor on DAG A's run state, because the consumer cares about the data being fresh, not about the producer DAG having run.

	# Asset-based cross-DAG dependency, Dagster-style
	@asset(deps=[fct_orders])
	def revenue_dashboard_mart(fct_orders):
	# Runs whenever fct_orders is updated, regardless of which DAG updated it
	...

	# vs. sensor-based, which polls the producer DAG's run state
	@sensor(asset_selection=AssetSelection.assets(revenue_dashboard_mart))
	def trigger_on_orders_dag_success(context):
	if last_run_status('orders_dag') == 'success':
	return RunRequest(...)

The Decision in Practice

✓Keep One DAG

Single team owns the entire pipeline
All tasks run on the same cadence
Tasks share most inputs and have no independent consumers
End-to-end latency is acceptable as the sum of all tasks

✓Split into Multiple

Multiple teams contribute to the work
Different parts must run at different cadences
Branches have independent consumers and independent failure tolerance
Tasks have no shared inputs and could deploy independently

TIP

Decide split versus merge by the question 'who is paged when this fails.' If two different teams should be paged for two different parts, split. If the same person is paged either way, leave it as one DAG.

Build vs Buy at Each Layer

Daily Life

Interviews

Apply the build-versus-buy decision per layer, weighing differentiation, total cost of ownership, and strategic risk.

Every layer of a pipeline can be built in-house or bought from a vendor. The choice is rarely all build or all buy; the right answer differs per layer. Ingestion has mature SaaS options (Fivetran, Airbyte) that solve the boring 80% of source extraction at a real per-row cost. Orchestration has open-source options (Airflow, Dagster, Prefect) that have absorbed most of what custom schedulers used to do. Storage and warehousing have been almost entirely commoditized into Snowflake, BigQuery, Databricks, and a few others. Transform tooling has converged on dbt for SQL-shaped work and Spark for non-SQL. The build-versus-buy conversation has shifted from 'should we build a warehouse' (no) to 'should we build the connector for this one weird source' (sometimes).

The Calculus

Factor	Pushes Toward Build	Pushes Toward Buy
Differentiation	The capability is core to the company's product	The capability is generic infrastructure
Volume	Volume is so high that vendor pricing exceeds engineering cost	Volume is moderate; vendor pricing is reasonable
Cycle time	Iteration speed matters more than total cost of ownership	Cycle time is acceptable at vendor pace
Compliance	Data cannot leave the company's environment	Vendor offers compliant deployment options
Engineer time	Engineers are available and want to build	Engineers are scarce and the build cost is opportunity cost

Layer-by-Layer

IngestionStorageTransformOrchestrationObservabilityCatalog/Lineage

Ingestion

Mostly buy, except for differentiated sources

Fivetran, Airbyte, and similar services handle the long tail of SaaS source connectors. Build only for sources unique to the company (proprietary IoT data, custom internal systems).

Storage

Almost always buy

Snowflake, BigQuery, Databricks, Redshift. The economics of building a warehouse are unfavorable for any company that is not in the business of selling warehouses. Self-managed Postgres for tiny analytics is the only common build case.

Transform

Buy the framework, build the logic

dbt for SQL transforms, Spark for non-SQL. The framework is bought; the actual transform logic (the business rules) is always built, because nobody else knows the business.

Orchestration

Buy or use open-source; rarely build from scratch

Airflow, Dagster, Prefect, Mage. Custom schedulers exist at very large scale (Airbnb's original Airflow, Lyft's Flyte) but the marginal benefit shrinks every year as the open-source options mature.

Observability

Mixed; depends on scale

Monte Carlo, Bigeye, Soda for buy. Custom dbt tests plus PagerDuty for build. Smaller teams build; larger teams that have outgrown homegrown tooling buy.

Catalog/Lineage

Increasingly buy

DataHub, Atlan, Collibra. Building a catalog is a multi-year project that almost always underperforms the bought version.

The Hidden Cost of Building

The most common error in build-versus-buy is comparing the cost of writing the first version to the cost of buying the vendor's product. The right comparison is the cost of writing, maintaining, debugging, monitoring, on-calling, and eventually replacing the in-house version against the vendor's price. Built infrastructure has a tail of operational cost that is invisible at the moment of building. A typical custom data ingestion framework, three years in, is consuming the equivalent of one full-time engineer in maintenance and is on its second or third major rewrite. The vendor was always cheaper; the spreadsheet at year zero did not show it.

The Hidden Cost of Buying

Buying is not free of hidden costs either. Vendor lock-in is the most discussed: switching from Fivetran to Airbyte is non-trivial because connector configurations, monitoring, and downstream lineage all assume the vendor. Pricing surprises are real: a Snowflake bill that grows 4x in a quarter because a careless query pattern blew through compute credits is common in companies that have not invested in cost observability. Vendor outages cause direct outages: when Snowflake has a regional outage, every consumer in the company is affected at once, and the company has no recourse beyond waiting. The build-versus-buy conversation should price these tail risks, not the per-row cost alone.

Hidden Cost	Build	Buy
Maintenance	Tail of engineering effort that grows with feature surface	Vendor invoices that scale with usage
On-call	Every internal tool needs an on-call rotation	Vendor handles infrastructure on-call; usage on-call remains
Lock-in	Locked into the company's own implementation; rewrite is the only exit	Locked into the vendor; switching cost depends on the layer
Outage	The team owns every outage and every fix	Vendor outages are external; no internal recourse
Hiring signal	Engineers want to work on novel problems, not internal Airflow clones	Engineers want to work on the latest tools, which vendors usually offer

The Decision Framework

A useful frame is the test of strategic differentiation. Anything that differentiates the company's product belongs in build. Anything that does not differentiate the company's product belongs in buy if a viable vendor exists. A streaming media company should not be building its own data warehouse, but it might build its own ML feature platform if recommendation quality is the product. A bank should not be building its own orchestrator, but it might build its own catalog if regulatory lineage is core to its compliance posture. The strategic differentiation test is faster than the spreadsheet and tends to produce the same answer.

✓Do

Buy commodity infrastructure (warehouses, orchestrators, ingestion connectors)
Build the business logic, the differentiated transform, and the company-specific source connector
Price the tail: maintenance, on-call, lock-in, outage, hiring signal

✗Don't

Build a warehouse, an orchestrator, or a generic ingestion framework in 2026
Buy a black-box solution for the company's most differentiated capability
Compare a year-zero build cost to a year-zero buy cost; the tail dominates

Redesigning a Tangled Graph

Daily Life

Interviews

Diagnose a tangled production pipeline graph and apply the lesson's framings to redesign it for operability.

The synthesis exercise is a real-shaped problem. A mid-size company has accumulated 80 production DAGs over four years. The data team has grown from three engineers to twelve. The new tech lead has been asked to make the system operable. The exercise walks through the diagnosis and the redesign, using every concept from the lesson and the prior tiers.

The Symptoms

Operational symptoms reported by the data team:

▸On-call gets paged 8 to 12 times per night, half for failures with no clear owner
▸Three different teams compute weekly active users; numbers disagree by 4 to 7 percent
▸Schema changes in Postgres orders break six different DAGs simultaneously
▸The Snowflake bill grew 3.5x year over year with no clear cause
▸Adding a new source takes four weeks of cross-team coordination
▸Two engineers are leaving and the team is losing institutional memory

Diagnosis: What Each Symptom Reveals

Symptom	Underlying Cause	Concept From This Lesson
Pages with no clear owner	Pipelines treated as scripts; no contract names the producer	Pipelines as products
Three teams compute WAU differently	No shared curated layer; each team rebuilds the logic	The shared middle layer (intermediate tier)
Schema changes break six DAGs	Direct dependencies on raw schema; no decoupling at curated layer	Layered architecture; raw zone decoupling
Snowflake bill 3.5x	No cost observability; no cost attribution per pipeline	Cost as an undercurrent
Four weeks to add a new source	Each source built bespoke; no ingestion platform	Build vs buy at the ingestion layer
Losing institutional memory	Knowledge not captured in contracts, lineage, or catalog	Data management as an undercurrent

Redesign Step 1: Establish the Layered Shape

The first move is to introduce a shared raw layer and a shared curated layer if they do not already exist in clean form. Audit every existing DAG. Each one becomes a candidate for relocation: the parts that read sources move into a shared raw landing layer, the parts that produce business-ready tables move into a shared curated layer, and the parts that produce consumer-specific shapes stay with their owning team in the serving layer. Most DAGs decompose cleanly along these lines. Some do not, and those are the ones that need redesign rather than relocation.

Redesign Step 2: Write Contracts for the Top 10 Pipelines

The 80/20 rule applies. Roughly ten pipelines produce the data that powers most of the consumer experience. Write contracts for those ten. Name producers, name consumers, name freshness and quality SLAs. Publish the contracts and require new consumers to read them. The remaining seventy can be triaged later: half are probably orphans that can be deprecated; the rest will eventually get contracts as they become important enough to warrant the work.

Redesign Step 3: Buy What Should Be Bought

The four-week onboarding for a new source is a build-versus-buy signal. Audit the existing custom ingestion code. If it is solving generic SaaS connector problems (Stripe, Salesforce, Google Ads, Shopify), replace it with a managed ingestion service. The team's engineering capacity should go into the differentiated work, not into reimplementing connectors that twenty other companies have already built. The savings appear as a reduction in maintenance burden, faster time-to-onboard for new sources, and an end to the recurring 'why is this connector broken' Slack thread.

Redesign Step 4: Split the Mega-DAGs

Identify the DAGs that span team boundaries and split them at the ownership seam. Replace cross-DAG dependencies with asset triggers so the consumer DAG runs when the producer DAG's output is updated, not when the producer DAG's run state changes. The split reduces blast radius, clarifies on-call ownership, and lets each team deploy independently. The cost is a small amount of cross-DAG coordination overhead, which is now the right cost to pay because the savings in deployment friction and on-call clarity exceed it.

Redesign Step 5: Instrument the Undercurrents

Add cost attribution by tagging every Snowflake query with the pipeline ID. The 3.5x bill growth becomes legible immediately because the spend can be attributed to specific pipelines. Add observability via the five pillars (freshness, volume, schema, distribution, lineage) so failure modes route to alerts rather than to dashboards that nobody is watching. Add a catalog so consumers can discover existing curated tables before building new ones. The undercurrent investments do not produce immediate visible output, but they are the layer where the future operational pain or relief lives.

The Result, Six Months Later

•Before the Redesign

80 DAGs, no clear ownership
Three definitions of WAU
On-call paged 8 to 12 times per night
Snowflake bill growing 3.5x year over year
Four weeks to onboard a new source
Schema changes break six DAGs

✓After the Redesign

Top 10 pipelines have contracts; remainder triaged for deprecation
One canonical fact_active_users in the curated layer
On-call paged 1 to 2 times per night, all with named owners
Cost attributed per pipeline; growth budgeted and explained
Two days to onboard a new source via managed ingestion
Schema changes propagate via the raw layer; one DAG affected

The Underlying Lesson

The redesign did not introduce any tool the team did not already have. It introduced a shape: layers, contracts, ownership, asset triggers, and instrumented undercurrents. The shape is what the rest of this curriculum elaborates. Every later lesson, from idempotency to schema evolution to data quality, is an instance of one of the framings in this lesson, applied to a specific operational concern. A senior data engineer is not someone who knows more tools. A senior data engineer is someone who reaches for the right framing first and the right tool second.

TIP

Redesigning a tangled pipeline graph rarely starts with new tools. It starts with naming what each pipeline is supposed to be, who owns it, and what would happen if it stopped running. The naming exposes the redesign.

❯❯❯PUTTING IT ALL TOGETHER

> A senior data engineer joins a fintech that has 412 production DAGs accumulated over four years. Roughly 80 are mission critical; the other 332 have no clear ownership. The CTO asks: 'How do we make this system operable, and how do we make sure we never end up here again?'

Treat the top 80 as products. Write contracts that name producer, consumer, schema, freshness SLA, quality SLA, backfill policy, and deprecation policy. The contract is the difference between a script and a product.

Triage the remaining 332. Many are orphans. The deprecation policy from the contract framework defines the path: notice period, migration target, sunset date. Without a deprecation policy, every pipeline runs forever.

Establish the layered shape. Shared raw layer, shared curated layer, team-owned serving layer. The layered shape is the architectural pivot from N×M point-to-point to N+M layered. (Beginner-tier four-roles concept; intermediate-tier shared middle layer.)

Split DAGs at ownership boundaries; use asset triggers across DAGs. The split reduces blast radius, clarifies on-call, and lets teams deploy independently. (Intermediate-tier DAG concept extended to cross-DAG coordination.)

Buy what is commodity (ingestion connectors, warehouse, orchestrator, catalog). Build the differentiated logic. The strategic differentiation test is faster than the spreadsheet and tends to produce the same answer.

Instrument the six undercurrents: orchestration, observability, security, data management, DataOps, cost. The undercurrents are why a system that worked yesterday still works today. The 412-DAG mess accumulated because nobody invested in the undercurrents.

KEY TAKEAWAYS

Pipelines are products with contracts: producer, consumer, schema, freshness SLA, quality SLA, backfill policy, deprecation policy. Implicit contracts are not contracts.

Six undercurrents run beneath every layer: orchestration, observability, security, data management, DataOps, cost. Senior leverage compounds in the undercurrents.

Split DAGs at ownership and cadence boundaries: use asset triggers across DAGs to decouple producers from consumers.

Build vs buy is per-layer: buy commodity (warehouse, orchestrator, ingestion); build differentiated business logic. Price the tail, not the year-zero cost alone.

Redesign starts with naming, not with tools: what each pipeline is supposed to be, who owns it, what would happen if it stopped running. The naming exposes the redesign.

Pipelines are products with owners, contracts, and lifecycles, not scripts that move data

Category: Pipeline Architecture
Difficulty: advanced
Duration: 35 minutes
Challenges: 0 hands-on challenges

Topics covered: Pipelines as Products, The Cross-Cutting Undercurrents, When to Split, When to Merge, Build vs Buy at Each Layer, Redesigning a Tangled Graph

Lesson Sections

Pipelines as Products (concepts: paDataQuality)
A script copies data; a pipeline serves consumers. The difference is not size. The difference is the existence of a contract. A contract names the consumer, names the producer, names what is delivered, names how often, and names what happens when the delivery fails. Pipelines without contracts accumulate, drift, and rot. The accumulated rot is the largest hidden cost in the data engineering organizations of mature companies. The discipline of treating pipelines as products is the only known anti
The Cross-Cutting Undercurrents (concepts: paDagOrchestration)
The four roles (source, transform, storage, consumer) describe what a pipeline does. They do not describe the cross-cutting concerns that touch every role. Joe Reis and Matt Housley call these concerns 'undercurrents' in their data engineering lifecycle framework, and the term is apt: they run beneath the surface of every layer. A pipeline that addresses the four roles but ignores the undercurrents is a pipeline that works on the demo and breaks in production. Senior engineers spend much of thei
When to Split, When to Merge (concepts: paDagOrchestration)
Two pipeline architectures are equivalent in what they produce and very different in how they operate. One large DAG with sixty tasks runs as a single unit. Six DAGs with ten tasks each run as separate units. The choice is one of the most consequential architectural decisions a senior engineer makes, and it cannot be made once for all time; the right boundary changes as the system grows. The principle is simple to state and hard to apply: split when the cost of coupling exceeds the cost of coord
Build vs Buy at Each Layer (concepts: paCostOptimization)
Every layer of a pipeline can be built in-house or bought from a vendor. The choice is rarely all build or all buy; the right answer differs per layer. Ingestion has mature SaaS options (Fivetran, Airbyte) that solve the boring 80% of source extraction at a real per-row cost. Orchestration has open-source options (Airflow, Dagster, Prefect) that have absorbed most of what custom schedulers used to do. Storage and warehousing have been almost entirely commoditized into Snowflake, BigQuery, Databr
Redesigning a Tangled Graph (concepts: paDagOrchestration)
The synthesis exercise is a real-shaped problem. A mid-size company has accumulated 80 production DAGs over four years. The data team has grown from three engineers to twelve. The new tech lead has been asked to make the system operable. The exercise walks through the diagnosis and the redesign, using every concept from the lesson and the prior tiers. The Symptoms Diagnosis: What Each Symptom Reveals Redesign Step 1: Establish the Layered Shape The first move is to introduce a shared raw layer a