What a Data Pipeline Is: Advanced

A senior data engineer at a public fintech inherited 412 production DAGs. Roughly 80 of them were mission critical and had owners. The other 332 had no owner, no documentation, and no consumers anyone could identify. Some of them had been running for four years. Turning any of them off was risky because nobody could prove the absence of a hidden consumer. The engineer's job for the next year was less about building new pipelines and more about deprecating old ones, which required a contract for what a pipeline was, who owned it, and how it would be retired. This lesson is about the framing that turns a pipeline from a script into a product with a lifecycle. The framing is the difference between a 200-pipeline environment that scales and a 200-pipeline environment that paralyzes.

Pipelines as Products

Daily Life
Interviews

Apply the pipeline-as-product framing: name the consumer, write the contract, and define the deprecation path before the pipeline ships.

A script copies data; a pipeline serves consumers. The difference is not size. The difference is the existence of a contract. A contract names the consumer, names the producer, names what is delivered, names how often, and names what happens when the delivery fails. Pipelines without contracts accumulate, drift, and rot. The accumulated rot is the largest hidden cost in the data engineering organizations of mature companies. The discipline of treating pipelines as products is the only known antidote.

What a Pipeline Contract Contains

ElementWhat It SpecifiesWhy It Matters
ProducerThe team that owns the pipeline and is paged when it failsWithout an owner, no one fixes failures; orphaned pipelines rot
ConsumerThe named downstream that depends on the outputIf no consumer can be named, the pipeline is dead code
SchemaColumn names, types, nullability, primary keyConsumer code depends on the shape; schema changes break consumers
Freshness SLAHow current the data is guaranteed to be (e.g., 'within the last hour')Consumer planning depends on freshness; without a stated bar, every delay is a crisis
Quality SLARow count bounds, null thresholds, distribution checksThe pipeline is allowed to fail; what is not allowed is silent corruption
Backfill policyHow far back data can be re-derived and at what costConsumers ask for fixes to historical periods; the policy answers in advance
Deprecation policyHow the pipeline ends: notice period, migration path, sunset dateWithout an end-of-life policy, every pipeline runs forever

Why the Contract Has to Be Written Down

Implicit contracts are not contracts. A consumer team that 'knows' the data is updated daily because it has been daily for two years is not party to a contract; they are party to a habit. Habits change. The producer who decides to switch to hourly to save cost has no way to know that a downstream model retrains daily and now retrains six times per cost cycle. The producer who decides to drop a column because nobody seems to use it has no way to know that a quarterly finance report depends on it once a quarter. Written contracts catch these collisions before they ship. The cost of writing them down is the cost of the conversations they force, which are conversations that need to happen anyway.

The Producer-Consumer Relationship

1# A REAL pipeline contract, checked INTO the repo next to the dbt model contract : pipeline_id : fct_orders_v3 producer : team : data - platform on_call : '#data-platform-oncall' consumers : - team : finance - analytics use_case : monthly close contact : '#finance-data' - team : ml - platform use_case : revenue forecast feature contact : '#ml-features' SCHEMA : primary_key : order_id columns : - name : order_id type : string nullable : FALSE - name : order_timestamp type : TIMESTAMP nullable : FALSE sla : freshness : '<= 4 hours' quality : row_count_min_per_day : 1000 null_rate_max_per_column : 0.01 backfill : available_history : '24 months' cost_per_day_backfilled_usd : 12 deprecation : notice_period_days : 90 migration_target : fct_orders_v4

When Contracts Fail

Contracts fail in two directions. The producer fails to deliver: the pipeline runs late, the row count drops below the floor, a column starts arriving with too many nulls. The consumer fails to abide: someone reads a column that was marked deprecated, or assumes a freshness tighter than the SLA. Both failures are recoverable when the contract is written, because the contract names the failure mode and the response. Both failures are unrecoverable when the contract is implicit, because the conversation about who is wrong starts after the production incident is already in progress.
Signals that a pipeline is being treated as a script, not a product:
  • Asking 'who owns this?' returns a long pause or a name of someone who left the company
  • The freshness expectation is in someone's head, not in a YAML file
  • Consumer teams maintain their own copies of the same logic 'just in case'
  • Schema changes are announced in standups rather than in PRs
  • Deprecating a pipeline requires a months-long archaeology project

The Maturity Model

LevelPipeline Treated AsVisible Behavior
0A script someone wroteOwner is whoever last touched it; failures are firefighting
1A scheduled jobOwner is named; failures route to a Slack channel; no SLA
2A serviceFreshness SLA exists; alerts route to on-call; consumers expect uptime
3A productContract is written; quality SLAs exist; deprecation has a process; new consumers sign on explicitly
Most companies operate at level 1 or 2 for most of their pipelines. Level 3 is achievable but requires sustained investment in tooling and discipline. The shift from level 2 to level 3 is the shift from 'we build pipelines' to 'we publish data products,' and it changes the conversation between the data team and the rest of the company. The data team becomes a producer with named consumers, not a service desk that fields one-off requests.
Do
  • Write a contract for every new pipeline at the time it is built, not after
  • Name the consumer; if no consumer can be named, do not build the pipeline
  • Treat schema, freshness, and quality SLAs as PR-reviewable artifacts
Don't
  • Build pipelines in service of 'someone might want this later'
  • Allow contracts to live in tribal knowledge; they leave the company when people leave
  • Treat deprecation as an afterthought; build the end-of-life path with the start-of-life path

The Cross-Cutting Undercurrents

Daily Life
Interviews

Identify the six cross-cutting undercurrents and map each one onto the four pipeline roles.

The four roles (source, transform, storage, consumer) describe what a pipeline does. They do not describe the cross-cutting concerns that touch every role. Joe Reis and Matt Housley call these concerns 'undercurrents' in their data engineering lifecycle framework, and the term is apt: they run beneath the surface of every layer. A pipeline that addresses the four roles but ignores the undercurrents is a pipeline that works on the demo and breaks in production. Senior engineers spend much of their time on the undercurrents, not on the roles, because the roles are well-understood and the undercurrents are where the failure modes live.

The Six Undercurrents

OrchestrationObservabilitySecurityData managementDataOpsCost
Orchestration
What runs, when, in what order
DAG scheduling, dependency resolution, retries, and failure isolation. Without orchestration, a pipeline is a script that someone has to remember to run.
Observability
Did it run, did it succeed, was the output right
Logs, metrics, traces, and lineage. Without observability, failure modes are invisible until consumers complain. Five pillars: freshness, volume, schema, distribution, lineage.
Security
Who can access what data, in what role
Authentication, authorization, encryption, PII handling. Without security, the pipeline is a compliance liability waiting to be discovered.
Data management
Catalog, lineage, governance, quality
The metadata layer that lets consumers discover and trust the data. Without it, the same data is rebuilt repeatedly because no one knows it exists.
DataOps
Deployment, testing, version control, CI/CD
The engineering discipline applied to data work. Without it, every change is a production change and rollbacks are manual.
Cost
Who pays for what compute and storage
Attribution and optimization. Without cost discipline, pipelines silently consume budget that nobody knows belongs to them.

Where the Undercurrents Touch Each Role

UndercurrentAt the SourceAt the TransformAt the Consumer
OrchestrationWhen to extract; how to coordinate with the source's availabilityDAG dependencies; retries; partial failure recoveryNotify when fresh data is available; trigger downstream
ObservabilityDid the extract pull the expected volume; source schema checkDid the transform produce the expected row count and distributionsDid the consumer-facing table update; did dashboards stay green
SecurityWho has read access to the source; how is the credential storedPII redaction during transform; column-level encryptionAccess control on consumer tables; row-level security
Data managementCataloging the source; lineage from source to rawLineage between transforms; quality tests at each stepDiscoverability of consumer tables; ownership metadata
DataOpsSource change management; sandbox environmentsVersion-controlled transform code; PR review; CI testsBackwards-compatible schema changes; deprecation process
CostEgress costs from source systemsCompute cost of transforms; warehouse creditsStorage cost of consumer tables; query cost

The Cost of Ignoring an Undercurrent

An ignored undercurrent does not stop the pipeline from running. It changes how the pipeline fails. A pipeline with no observability fails silently and is discovered when a consumer asks 'why are these numbers strange.' A pipeline with no cost discipline runs successfully every day and produces a budget overrun at the end of the quarter. A pipeline with no DataOps practice runs successfully every day until the day a transform change is rolled out and breaks downstream consumers, and the rollback takes hours because there is no test environment. The undercurrents are the reason for the gap between 'this code works' and 'this code is in production.'
Pipeline Without Undercurrents
  • Runs successfully on the demo, fails silently in production
  • Failure modes are discovered by consumers, not by the pipeline
  • Cost is unknown until the cloud bill arrives
  • Schema changes are deployed without testing
Pipeline With Undercurrents
  • Failures route to on-call within minutes via observability
  • Quality issues caught at the transform, not at the dashboard
  • Cost is tagged per pipeline; budget overruns are predictable
  • Schema changes go through CI; consumers are notified before deploy

The Senior Engineer's Allocation

An engineer early in their career spends most of their time on the four roles: writing extracts, writing transforms, choosing storage, building consumer-facing tables. A senior engineer spends most of their time on the undercurrents: orchestrating dependencies, instrumenting observability, hardening security, governing data, applying DataOps, controlling cost. The shift is not a change in interest. It is a change in where the leverage is. The roles can be built once and reused. The undercurrents have to be reinforced continuously, and a senior engineer's time is best spent on the layer that compounds across every pipeline rather than on individual pipelines.
TIP
When reviewing a new pipeline design, ask 'what is the orchestration story, what is the observability story, what is the security story' before asking 'what does the transform do.' The transform is usually fine. The undercurrents are where designs fail review.
alert
An ignored undercurrent does not prevent the pipeline from running; it changes how the pipeline fails.
check
The four roles describe what a pipeline does; the six undercurrents describe how it survives in production.
query
Senior leverage compounds when invested in undercurrents (platforms, tooling, governance) rather than individual pipelines.

When to Split, When to Merge

Daily Life
Interviews

Decide when to split a pipeline into multiple DAGs and when to merge several into one, based on ownership, cadence, and operational cost.

Two pipeline architectures are equivalent in what they produce and very different in how they operate. One large DAG with sixty tasks runs as a single unit. Six DAGs with ten tasks each run as separate units. The choice is one of the most consequential architectural decisions a senior engineer makes, and it cannot be made once for all time; the right boundary changes as the system grows. The principle is simple to state and hard to apply: split when the cost of coupling exceeds the cost of coordination.

Costs of a Single Large Pipeline

CostWhat It Looks Like
Blast radiusOne task fails, the whole DAG halts; unrelated downstream work is delayed
Deployment frictionAny change requires testing the whole DAG; small changes carry the risk of large ones
Ownership ambiguityWhen sixty tasks span four teams, no one team owns the DAG
Scheduling rigidityAll tasks run on the same cadence; faster cadences are awkward to add
Long tailThe end-to-end latency is the sum of every task; tail latency dominates

Costs of Many Small Pipelines

CostWhat It Looks Like
Cross-DAG coordinationDAG B depends on DAG A; sensors or asset triggers add lag and complexity
DiscoverabilityTwenty DAGs are harder to find than one; lineage is fragmented
Operational overheadEach DAG has its own alerts, runbooks, on-call rotation
DriftTwo DAGs that should share logic implement it twice
Latency from sensorsPolling sensors add minutes of lag at every cross-DAG boundary

The Right Place to Split

The boundary that ages well is the boundary of ownership. A pipeline that is owned by one team should be one DAG. A pipeline that crosses team boundaries should split at the team boundary, with a clear contract at the seam. The rule has the same shape as Conway's law applied to pipelines: the architecture mirrors the org chart, and pretending it does not is more expensive than embracing it. A second principle is the boundary of cadence. Tasks that must run hourly belong in a different DAG from tasks that must run daily. Mixing cadences in one DAG forces the slower cadence on faster work or wastes compute by over-running slow work.
Signals it is time to split a pipeline:
  • Two teams contribute to the same DAG and disagree about deployment cadence
  • The DAG mixes hourly and daily work; the slow tasks dominate the schedule
  • A single failure in one branch halts unrelated downstream branches
  • On-call cannot tell which team to page when the DAG fails
  • End-to-end latency is dominated by tasks that no consumer waits for

The Right Place to Merge

Merging two DAGs into one is the reverse decision and the rarer one. Merging makes sense when two DAGs are owned by the same team, share most of their inputs, must always run together, and have no independent consumers. The signal is operational pain from cross-DAG coordination: the sensors that connect them are unreliable, the latency between them is consistent and avoidable, and the failure modes always involve both DAGs. Merging is the right answer in those cases, and it is wrong every other time. Many engineers reach for merging because one DAG feels simpler, but the simplicity is local; the operational consequences are global.

The Cross-DAG Contract

When a pipeline does split, the seam between the two DAGs is a contract in itself. DAG A produces a table at a stated time with a stated schema; DAG B reads it. The seam is observable: 'fct_orders updated at 2:14am' is a fact that DAG B can act on. Modern orchestrators (Dagster, Airflow with asset triggers, Prefect with results) support asset-based triggering: DAG B starts when the asset DAG A produces is updated, regardless of which DAG produced it. The asset model decouples the two DAGs more cleanly than a sensor on DAG A's run state, because the consumer cares about the data being fresh, not about the producer DAG having run.
1# Asset-based cross-DAG dependency, Dagster-style
2@asset(deps=[fct_orders])
3def revenue_dashboard_mart(fct_orders):
4 # Runs whenever fct_orders is updated, regardless of which DAG updated it
5 ...
6
7# vs. sensor-based, which polls the producer DAG's run state
8@sensor(asset_selection=AssetSelection.assets(revenue_dashboard_mart))
9def trigger_on_orders_dag_success(context):
10 if last_run_status('orders_dag') == 'success':
11 return RunRequest(...)

The Decision in Practice

Keep One DAG
  • Single team owns the entire pipeline
  • All tasks run on the same cadence
  • Tasks share most inputs and have no independent consumers
  • End-to-end latency is acceptable as the sum of all tasks
Split into Multiple
  • Multiple teams contribute to the work
  • Different parts must run at different cadences
  • Branches have independent consumers and independent failure tolerance
  • Tasks have no shared inputs and could deploy independently
TIP
Decide split versus merge by the question 'who is paged when this fails.' If two different teams should be paged for two different parts, split. If the same person is paged either way, leave it as one DAG.

Build vs Buy at Each Layer

Daily Life
Interviews

Apply the build-versus-buy decision per layer, weighing differentiation, total cost of ownership, and strategic risk.

Every layer of a pipeline can be built in-house or bought from a vendor. The choice is rarely all build or all buy; the right answer differs per layer. Ingestion has mature SaaS options (Fivetran, Airbyte) that solve the boring 80% of source extraction at a real per-row cost. Orchestration has open-source options (Airflow, Dagster, Prefect) that have absorbed most of what custom schedulers used to do. Storage and warehousing have been almost entirely commoditized into Snowflake, BigQuery, Databricks, and a few others. Transform tooling has converged on dbt for SQL-shaped work and Spark for non-SQL. The build-versus-buy conversation has shifted from 'should we build a warehouse' (no) to 'should we build the connector for this one weird source' (sometimes).

The Calculus

FactorPushes Toward BuildPushes Toward Buy
DifferentiationThe capability is core to the company's productThe capability is generic infrastructure
VolumeVolume is so high that vendor pricing exceeds engineering costVolume is moderate; vendor pricing is reasonable
Cycle timeIteration speed matters more than total cost of ownershipCycle time is acceptable at vendor pace
ComplianceData cannot leave the company's environmentVendor offers compliant deployment options
Engineer timeEngineers are available and want to buildEngineers are scarce and the build cost is opportunity cost

Layer-by-Layer

IngestionStorageTransformOrchestrationObservabilityCatalog/Lineage
Ingestion
Mostly buy, except for differentiated sources
Fivetran, Airbyte, and similar services handle the long tail of SaaS source connectors. Build only for sources unique to the company (proprietary IoT data, custom internal systems).
Storage
Almost always buy
Snowflake, BigQuery, Databricks, Redshift. The economics of building a warehouse are unfavorable for any company that is not in the business of selling warehouses. Self-managed Postgres for tiny analytics is the only common build case.
Transform
Buy the framework, build the logic
dbt for SQL transforms, Spark for non-SQL. The framework is bought; the actual transform logic (the business rules) is always built, because nobody else knows the business.
Orchestration
Buy or use open-source; rarely build from scratch
Airflow, Dagster, Prefect, Mage. Custom schedulers exist at very large scale (Airbnb's original Airflow, Lyft's Flyte) but the marginal benefit shrinks every year as the open-source options mature.
Observability
Mixed; depends on scale
Monte Carlo, Bigeye, Soda for buy. Custom dbt tests plus PagerDuty for build. Smaller teams build; larger teams that have outgrown homegrown tooling buy.
Catalog/Lineage
Increasingly buy
DataHub, Atlan, Collibra. Building a catalog is a multi-year project that almost always underperforms the bought version.

The Hidden Cost of Building

The most common error in build-versus-buy is comparing the cost of writing the first version to the cost of buying the vendor's product. The right comparison is the cost of writing, maintaining, debugging, monitoring, on-calling, and eventually replacing the in-house version against the vendor's price. Built infrastructure has a tail of operational cost that is invisible at the moment of building. A typical custom data ingestion framework, three years in, is consuming the equivalent of one full-time engineer in maintenance and is on its second or third major rewrite. The vendor was always cheaper; the spreadsheet at year zero did not show it.

The Hidden Cost of Buying

Buying is not free of hidden costs either. Vendor lock-in is the most discussed: switching from Fivetran to Airbyte is non-trivial because connector configurations, monitoring, and downstream lineage all assume the vendor. Pricing surprises are real: a Snowflake bill that grows 4x in a quarter because a careless query pattern blew through compute credits is common in companies that have not invested in cost observability. Vendor outages cause direct outages: when Snowflake has a regional outage, every consumer in the company is affected at once, and the company has no recourse beyond waiting. The build-versus-buy conversation should price these tail risks, not the per-row cost alone.
Hidden CostBuildBuy
MaintenanceTail of engineering effort that grows with feature surfaceVendor invoices that scale with usage
On-callEvery internal tool needs an on-call rotationVendor handles infrastructure on-call; usage on-call remains
Lock-inLocked into the company's own implementation; rewrite is the only exitLocked into the vendor; switching cost depends on the layer
OutageThe team owns every outage and every fixVendor outages are external; no internal recourse
Hiring signalEngineers want to work on novel problems, not internal Airflow clonesEngineers want to work on the latest tools, which vendors usually offer

The Decision Framework

A useful frame is the test of strategic differentiation. Anything that differentiates the company's product belongs in build. Anything that does not differentiate the company's product belongs in buy if a viable vendor exists. A streaming media company should not be building its own data warehouse, but it might build its own ML feature platform if recommendation quality is the product. A bank should not be building its own orchestrator, but it might build its own catalog if regulatory lineage is core to its compliance posture. The strategic differentiation test is faster than the spreadsheet and tends to produce the same answer.
Do
  • Buy commodity infrastructure (warehouses, orchestrators, ingestion connectors)
  • Build the business logic, the differentiated transform, and the company-specific source connector
  • Price the tail: maintenance, on-call, lock-in, outage, hiring signal
Don't
  • Build a warehouse, an orchestrator, or a generic ingestion framework in 2026
  • Buy a black-box solution for the company's most differentiated capability
  • Compare a year-zero build cost to a year-zero buy cost; the tail dominates

Redesigning a Tangled Graph

Daily Life
Interviews

Diagnose a tangled production pipeline graph and apply the lesson's framings to redesign it for operability.

The synthesis exercise is a real-shaped problem. A mid-size company has accumulated 80 production DAGs over four years. The data team has grown from three engineers to twelve. The new tech lead has been asked to make the system operable. The exercise walks through the diagnosis and the redesign, using every concept from the lesson and the prior tiers.

The Symptoms

Operational symptoms reported by the data team:
  • On-call gets paged 8 to 12 times per night, half for failures with no clear owner
  • Three different teams compute weekly active users; numbers disagree by 4 to 7 percent
  • Schema changes in Postgres orders break six different DAGs simultaneously
  • The Snowflake bill grew 3.5x year over year with no clear cause
  • Adding a new source takes four weeks of cross-team coordination
  • Two engineers are leaving and the team is losing institutional memory

Diagnosis: What Each Symptom Reveals

SymptomUnderlying CauseConcept From This Lesson
Pages with no clear ownerPipelines treated as scripts; no contract names the producerPipelines as products
Three teams compute WAU differentlyNo shared curated layer; each team rebuilds the logicThe shared middle layer (intermediate tier)
Schema changes break six DAGsDirect dependencies on raw schema; no decoupling at curated layerLayered architecture; raw zone decoupling
Snowflake bill 3.5xNo cost observability; no cost attribution per pipelineCost as an undercurrent
Four weeks to add a new sourceEach source built bespoke; no ingestion platformBuild vs buy at the ingestion layer
Losing institutional memoryKnowledge not captured in contracts, lineage, or catalogData management as an undercurrent

Redesign Step 1: Establish the Layered Shape

The first move is to introduce a shared raw layer and a shared curated layer if they do not already exist in clean form. Audit every existing DAG. Each one becomes a candidate for relocation: the parts that read sources move into a shared raw landing layer, the parts that produce business-ready tables move into a shared curated layer, and the parts that produce consumer-specific shapes stay with their owning team in the serving layer. Most DAGs decompose cleanly along these lines. Some do not, and those are the ones that need redesign rather than relocation.

Redesign Step 2: Write Contracts for the Top 10 Pipelines

The 80/20 rule applies. Roughly ten pipelines produce the data that powers most of the consumer experience. Write contracts for those ten. Name producers, name consumers, name freshness and quality SLAs. Publish the contracts and require new consumers to read them. The remaining seventy can be triaged later: half are probably orphans that can be deprecated; the rest will eventually get contracts as they become important enough to warrant the work.

Redesign Step 3: Buy What Should Be Bought

The four-week onboarding for a new source is a build-versus-buy signal. Audit the existing custom ingestion code. If it is solving generic SaaS connector problems (Stripe, Salesforce, Google Ads, Shopify), replace it with a managed ingestion service. The team's engineering capacity should go into the differentiated work, not into reimplementing connectors that twenty other companies have already built. The savings appear as a reduction in maintenance burden, faster time-to-onboard for new sources, and an end to the recurring 'why is this connector broken' Slack thread.

Redesign Step 4: Split the Mega-DAGs

Identify the DAGs that span team boundaries and split them at the ownership seam. Replace cross-DAG dependencies with asset triggers so the consumer DAG runs when the producer DAG's output is updated, not when the producer DAG's run state changes. The split reduces blast radius, clarifies on-call ownership, and lets each team deploy independently. The cost is a small amount of cross-DAG coordination overhead, which is now the right cost to pay because the savings in deployment friction and on-call clarity exceed it.

Redesign Step 5: Instrument the Undercurrents

Add cost attribution by tagging every Snowflake query with the pipeline ID. The 3.5x bill growth becomes legible immediately because the spend can be attributed to specific pipelines. Add observability via the five pillars (freshness, volume, schema, distribution, lineage) so failure modes route to alerts rather than to dashboards that nobody is watching. Add a catalog so consumers can discover existing curated tables before building new ones. The undercurrent investments do not produce immediate visible output, but they are the layer where the future operational pain or relief lives.

The Result, Six Months Later

Before the Redesign
  • 80 DAGs, no clear ownership
  • Three definitions of WAU
  • On-call paged 8 to 12 times per night
  • Snowflake bill growing 3.5x year over year
  • Four weeks to onboard a new source
  • Schema changes break six DAGs
After the Redesign
  • Top 10 pipelines have contracts; remainder triaged for deprecation
  • One canonical fact_active_users in the curated layer
  • On-call paged 1 to 2 times per night, all with named owners
  • Cost attributed per pipeline; growth budgeted and explained
  • Two days to onboard a new source via managed ingestion
  • Schema changes propagate via the raw layer; one DAG affected

The Underlying Lesson

The redesign did not introduce any tool the team did not already have. It introduced a shape: layers, contracts, ownership, asset triggers, and instrumented undercurrents. The shape is what the rest of this curriculum elaborates. Every later lesson, from idempotency to schema evolution to data quality, is an instance of one of the framings in this lesson, applied to a specific operational concern. A senior data engineer is not someone who knows more tools. A senior data engineer is someone who reaches for the right framing first and the right tool second.
TIP
Redesigning a tangled pipeline graph rarely starts with new tools. It starts with naming what each pipeline is supposed to be, who owns it, and what would happen if it stopped running. The naming exposes the redesign.
PUTTING IT ALL TOGETHER

> A senior data engineer joins a fintech that has 412 production DAGs accumulated over four years. Roughly 80 are mission critical; the other 332 have no clear ownership. The CTO asks: 'How do we make this system operable, and how do we make sure we never end up here again?'

Treat the top 80 as products. Write contracts that name producer, consumer, schema, freshness SLA, quality SLA, backfill policy, and deprecation policy. The contract is the difference between a script and a product.
Triage the remaining 332. Many are orphans. The deprecation policy from the contract framework defines the path: notice period, migration target, sunset date. Without a deprecation policy, every pipeline runs forever.
Establish the layered shape. Shared raw layer, shared curated layer, team-owned serving layer. The layered shape is the architectural pivot from N×M point-to-point to N+M layered. (Beginner-tier four-roles concept; intermediate-tier shared middle layer.)
Split DAGs at ownership boundaries; use asset triggers across DAGs. The split reduces blast radius, clarifies on-call, and lets teams deploy independently. (Intermediate-tier DAG concept extended to cross-DAG coordination.)
Buy what is commodity (ingestion connectors, warehouse, orchestrator, catalog). Build the differentiated logic. The strategic differentiation test is faster than the spreadsheet and tends to produce the same answer.
Instrument the six undercurrents: orchestration, observability, security, data management, DataOps, cost. The undercurrents are why a system that worked yesterday still works today. The 412-DAG mess accumulated because nobody invested in the undercurrents.
KEY TAKEAWAYS
Pipelines are products with contracts: producer, consumer, schema, freshness SLA, quality SLA, backfill policy, deprecation policy. Implicit contracts are not contracts.
Six undercurrents run beneath every layer: orchestration, observability, security, data management, DataOps, cost. Senior leverage compounds in the undercurrents.
Split DAGs at ownership and cadence boundaries: use asset triggers across DAGs to decouple producers from consumers.
Build vs buy is per-layer: buy commodity (warehouse, orchestrator, ingestion); build differentiated business logic. Price the tail, not the year-zero cost alone.
Redesign starts with naming, not with tools: what each pipeline is supposed to be, who owns it, what would happen if it stopped running. The naming exposes the redesign.

Pipelines are products with owners, contracts, and lifecycles, not scripts that move data

Category
Pipeline Architecture
Difficulty
advanced
Duration
35 minutes
Challenges
0 hands-on challenges

Topics covered: Pipelines as Products, The Cross-Cutting Undercurrents, When to Split, When to Merge, Build vs Buy at Each Layer, Redesigning a Tangled Graph

Lesson Sections

  1. Pipelines as Products (concepts: paPipelineAsProduct, paDataContracts)

    A script copies data; a pipeline serves consumers. The difference is not size. The difference is the existence of a contract. A contract names the consumer, names the producer, names what is delivered, names how often, and names what happens when the delivery fails. Pipelines without contracts accumulate, drift, and rot. The accumulated rot is the largest hidden cost in the data engineering organizations of mature companies. The discipline of treating pipelines as products is the only known anti

  2. The Cross-Cutting Undercurrents (concepts: paUndercurrents, paPipelineLifecycle)

    The four roles (source, transform, storage, consumer) describe what a pipeline does. They do not describe the cross-cutting concerns that touch every role. Joe Reis and Matt Housley call these concerns 'undercurrents' in their data engineering lifecycle framework, and the term is apt: they run beneath the surface of every layer. A pipeline that addresses the four roles but ignores the undercurrents is a pipeline that works on the demo and breaks in production. Senior engineers spend much of thei

  3. When to Split, When to Merge (concepts: paDagBoundaries, paAssetTriggers)

    Two pipeline architectures are equivalent in what they produce and very different in how they operate. One large DAG with sixty tasks runs as a single unit. Six DAGs with ten tasks each run as separate units. The choice is one of the most consequential architectural decisions a senior engineer makes, and it cannot be made once for all time; the right boundary changes as the system grows. The principle is simple to state and hard to apply: split when the cost of coupling exceeds the cost of coord

  4. Build vs Buy at Each Layer (concepts: paBuildVsBuy)

    Every layer of a pipeline can be built in-house or bought from a vendor. The choice is rarely all build or all buy; the right answer differs per layer. Ingestion has mature SaaS options (Fivetran, Airbyte) that solve the boring 80% of source extraction at a real per-row cost. Orchestration has open-source options (Airflow, Dagster, Prefect) that have absorbed most of what custom schedulers used to do. Storage and warehousing have been almost entirely commoditized into Snowflake, BigQuery, Databr

  5. Redesigning a Tangled Graph (concepts: paArchitectureRedesign)

    The synthesis exercise is a real-shaped problem. A mid-size company has accumulated 80 production DAGs over four years. The data team has grown from three engineers to twelve. The new tech lead has been asked to make the system operable. The exercise walks through the diagnosis and the redesign, using every concept from the lesson and the prior tiers. The Symptoms Diagnosis: What Each Symptom Reveals Redesign Step 1: Establish the Layered Shape The first move is to introduce a shared raw layer a