What a Data Pipeline Is: Intermediate

A growth-stage company at 200 engineers had eight sources, fifty-three transforms, and around a hundred dashboards. The architecture diagram on the wall looked like a circuit board. Two engineers had quit because the on-call rotation was unpredictable. The CTO asked the new head of data engineering what was wrong. The answer was the diagram. There was no shared raw zone, no canonical curated layer, and no agreement on whether transforms ran in the warehouse or in Spark. Every team had built its own pipeline its own way. The fix was not new tools. The fix was a shared shape: one place where raw data landed, one place where curated data lived, one direction of flow. This lesson is about the shape that makes many pipelines feel like one system instead of fifty disconnected ones.

What you will be able to do

Recognize when multiple sources and consumers force a shared middle layer

Distinguish ETL from ELT and choose between them based on cost and flexibility

Read a pipeline as a directed acyclic graph and explain why cycles are forbidden

Many Sources, One Curated Layer

Daily Life

Interviews

Recognize when a shared middle layer becomes necessary and design the three-layer pattern that supports many sources and many consumers.

A first pipeline is one source, one transform, one consumer. The vocabulary is small enough to fit on a napkin. A real production environment has many of each, and the question changes from 'what should this pipeline do' to 'how do these pipelines fit together so each one does not solve the same problem in a slightly different way.' The answer is almost always a shared middle layer that every pipeline writes to and reads from. Without that shared layer, the same data ends up extracted three times, cleaned three different ways, and reconciled in spreadsheets at quarter end.

The Combinatorial Problem

Five sources times ten consumers is fifty potential point-to-point pipelines. Five sources times one shared curated layer plus ten consumers reading from it is fifteen. The reduction is not about line count alone; it is about ownership and consistency. With fifty point-to-point pipelines, every consumer team owns its own extract from every source. With one curated layer, the data engineering team owns the extracts and the consumer teams own the analysis. The split is more sustainable because it matches expertise to responsibility.

Architecture	Pipelines Required	Consequence
Point-to-point: every consumer extracts directly	Sources × Consumers (e.g., 5 × 10 = 50)	Same source extracted N times, inconsistent definitions, brittle
Hub-and-spoke: shared curated layer	Sources + Consumers (e.g., 5 + 10 = 15)	One canonical version of each dataset, consistent definitions
Layered (raw + curated + serving)	Sources + (curated tables) + Consumers	Decoupling at every layer; debugging follows the layers

The Three-Layer Pattern

The pattern that nearly every modern data architecture converges on has three layers. A raw layer holds source data unchanged, partitioned by ingestion date. A curated layer holds cleaned, joined, deduplicated, business-ready tables. A serving layer holds shapes specific to particular consumers (a dashboard mart, a feature store table). The Databricks community calls this the medallion architecture (bronze, silver, gold), but the idea predates the name and the tooling by decades. The shape is what matters.

Layer 1: RawLayer 2: CuratedLayer 3: Serving

Layer 1: Raw

Source data unchanged

Files or tables that mirror the source. Partitioned by ingestion date. The source of truth for everything downstream. Cheap to store, cheap to replay.

Layer 2: Curated

Cleaned, joined, conformed

Business-ready datasets that multiple consumers read. One canonical fact_orders, one canonical dim_customer. The contracts the rest of the company relies on.

Layer 3: Serving

Consumer-specific marts

Pre-aggregated tables and feature store features shaped to a particular dashboard, model, or application. Cheap to query because the work is precomputed.

Why the Middle Layer Has to Be Shared

The case against a shared layer is real and worth taking seriously. A shared layer is a coordination point. Multiple teams have to agree on what fact_orders means, what counts as a customer, how returns are handled. Coordination is slow. Without coordination, every team moves faster in the short term. The trap is that the speed compounds into incompatibility. Three months in, the marketing team's revenue number does not match the finance team's revenue number, and a meeting gets booked to figure out why. The meeting takes longer than the coordination would have. The shared layer is not a tool choice; it is a decision to pay coordination cost up front instead of reconciliation cost forever.

•Without a Shared Curated Layer

Three teams compute revenue three different ways
Source schema changes break N pipelines simultaneously
New consumers re-implement the same joins from scratch
Debugging starts with 'whose pipeline owns this column'

✓With a Shared Curated Layer

One canonical revenue definition; debates resolved in the curated layer
Source changes break one pipeline (the extract); curated tables protect downstream
New consumers read existing curated tables; little new pipeline work
Debugging follows the layer boundary; ownership is clear

When to Skip the Curated Layer

The shared curated layer earns its cost when more than one consumer reads from it. A company with one analyst reading one dashboard does not need a curated layer; the dashboard's SQL is already the curated logic. A company with three analysts and four dashboards probably does, because the same logic is starting to be repeated. The decision rule is the same as the four-question pipeline test from the beginner tier, applied at the architecture level: if the same prepared data is read by more than one consumer, the prepared data deserves to live in a shared layer.

Signals that a curated layer is overdue:

▸Two teams report different numbers for the same metric
▸A schema change in a source breaks more than one downstream report
▸Analysts ask 'what is the right fact_orders table' and get different answers
▸The same five-table join shows up in multiple dashboards' SQL

TIP

Build the raw layer first, build the first curated table second, and only add a serving layer when a specific consumer's performance needs it. Premature serving layers are the most common cause of stale pre-aggregations that nobody trusts.

ETL vs ELT

Daily Life

Interviews

Distinguish ETL from ELT, explain why cloud warehouses shifted the default, and pick the right model per transform.

The two acronyms ETL and ELT differ by a single letter, but the architectural implications are large. ETL extracts data from sources, transforms it on a separate compute layer, and loads the transformed result into the destination. ELT extracts the data, loads it into the destination warehouse first, and runs the transforms inside that warehouse. The order is the entire difference, and that order changes which system bears the cost of the transform work.

Step	ETL	ELT
Extract	Pull from source into a staging area	Pull from source into the warehouse
Transform	Run on a separate compute layer (Spark, Python, an ETL tool)	Run inside the warehouse using SQL
Load	Write the transformed result to the warehouse	Already loaded; transform produces tables in the warehouse

Why ETL Was the Default

Before cloud warehouses, storing data was expensive and querying it was even more expensive. A data warehouse like Teradata or Oracle charged for every gigabyte and every CPU cycle. The economics forced data engineering teams to transform data outside the warehouse, on cheaper general-purpose hardware, and load only the small, finished result. The compute happened on a Spark cluster, an Informatica server, or a custom Python runner. The warehouse received only what was needed. ETL was not a preference; it was the only economically viable design.

Why ELT Took Over

Cloud warehouses changed the math. Snowflake, BigQuery, and Redshift separate storage and compute, charge near-trivial amounts per terabyte stored, and scale compute horizontally on demand. Loading raw data is now cheap. Transforming that data using SQL on warehouse compute is fast and elastic. The old constraint that pushed transforms outside the warehouse no longer applies. Modern stacks load the raw data first and use the warehouse itself as the transform engine. dbt is the dominant tool for this pattern, and it works precisely because the warehouse is now the cheapest place to do the work.

	-- ELT: the transform is a SQL model that runs inside Snowflake
	-- (this is a dbt model in models/marts/fct_orders.sql)

	SELECT
	o.order_id,
	o.customer_id,
	o.order_timestamp,
	c.country,
	o.amount_cents / 100.0 AS amount,
	o.amount_cents - p.refund_cents AS net_amount_cents
	FROM {{ ref('stg_orders') }} o
	LEFT JOIN {{ ref('stg_payments') }} p USING (order_id)
	LEFT JOIN {{ ref('dim_customer') }} c USING (customer_id)
	WHERE o.order_status != 'cart_abandoned'

When ETL Still Wins

ELT is the modern default, but ETL is still the right answer in specific cases. When the transform requires logic the warehouse SQL dialect cannot express (image processing, complex graph algorithms, ML feature extraction), an external compute layer is unavoidable. When the source data contains PII that must be redacted before it enters the warehouse for compliance reasons, the transform happens before load. When the source is so large that loading raw is more expensive than transforming first (rare in 2026, common in 2014), pre-transform pays. The lesson is that ETL versus ELT is not a religion; it is a cost-and-capability comparison that depends on the warehouse, the data shape, and the transform logic.

•ETL Wins When

Transform requires non-SQL logic (ML, image, graph)
PII must be removed before data enters the warehouse
Source data is too large to land cheaply
Warehouse compute is significantly more expensive than the alternative

✓ELT Wins When

Cloud warehouse is in use (Snowflake, BigQuery, Redshift, Databricks)
Transform logic is expressible in SQL or dbt models
Warehouse compute is elastic and cost-competitive
Multiple consumers need to inspect the raw data, not just the transformed result

The Hybrid Reality

Most production environments are hybrids. Heavy joins and aggregations run as ELT in Snowflake using dbt. Image and unstructured-text processing runs as ETL on a Spark cluster. PII redaction runs as ETL before load. The right framing is not 'this company is ETL' or 'this company is ELT' but 'this transform is ETL and that transform is ELT, and the architecture documents which is which.' Treating ETL versus ELT as a per-transform decision rather than a per-company decision keeps the architecture pragmatic.

Signals that ETL is the wrong choice for a particular transform:

▸The transform is a SQL join that has been rewritten in PySpark
▸Engineers maintain a separate compute cluster only for transforms
▸The transformed table is loaded back into the warehouse anyway
▸Adding a column requires a code change in two systems

✓Do

Default to ELT when the warehouse can do the work
Use ETL for genuinely non-SQL workloads (ML, image, graph, redaction)
Document which transforms are ETL and which are ELT in the architecture diagram

✗Don't

Treat ETL versus ELT as a one-time architectural decision; revisit per-transform
Maintain a Spark cluster for SQL-shaped work because of historical inertia
Hide the choice in tooling; future maintainers need to see where transforms run

app DB

source

extract

clean/shape

transform

warehouse

dashboard

consumer

The four roles every pipeline has: a source produces data, transforms reshape it, storage holds it, and a consumer reads it. Data flows left to right.

The DAG: Why Dependencies Form

Daily Life

Interviews

Read a pipeline as a DAG, identify nodes and edges, and explain why cycles cannot exist in a valid pipeline graph.

A pipeline with one transform is a line: source, transform, destination. A pipeline with several transforms that depend on each other is a graph. The data engineering term for the structure is a directed acyclic graph, abbreviated DAG. Directed because data flows one way. Acyclic because no transform may depend, directly or indirectly, on its own output. Every modern orchestration tool, from Airflow to Dagster to Prefect, models pipelines as DAGs because the structure has the right properties: it is computable, it is debuggable, and it is impossible to deadlock.

Anatomy of a DAG

Term	Meaning	In a Pipeline
Node	A unit of work	An extract, a transform, a load, a quality check
Edge	A dependency between nodes	B runs after A; B reads what A produced
Directed	Edges have direction	Data and dependency flow one way; no read-back
Acyclic	No cycles allowed	Cannot have A depends on B depends on C depends on A

Why Cycles Are Forbidden

A cycle in a dependency graph is a deadlock waiting to happen. If A depends on B, B depends on C, and C depends on A, the orchestrator cannot start any of them. Each is waiting for the others. Even if the orchestrator could start one (by ignoring the cycle), the result would be undefined: the output of A on this run depends on the output of A on a previous run, which depends on its previous run, and so on backward in time. The acyclic constraint is not a stylistic preference; it is what makes the graph computable in finite time with a defined result.

Forms a cycle accidentally creep in:

▸A transform reads from a table that another transform overwrites later in the same run
▸A circular foreign key in the data model leaks into the pipeline as a circular dependency
▸Two teams add an edge each, neither aware that the new edges close a loop
▸A backfill script is added to the production DAG and creates a self-reference

What a DAG Is Not

DAG is sometimes used loosely to mean 'a pipeline.' That is imprecise. A DAG is a structure: nodes and edges with the directed and acyclic properties. A pipeline can be modeled as a DAG, but a DAG is not a pipeline by itself. The distinction matters because some pipelines are not pure DAGs. Streaming pipelines that process unbounded data have feedback loops in some senses; ML pipelines that retrain on their own outputs have learning dynamics that look cyclic. The orchestration layer for those workloads still uses a DAG to schedule work, but the workload itself can have time-shifted self-references that the DAG hides by treating each run as separate.

Reading a DAG

extract_orders \ extract_customers / extract_payments(sources)(TRANSFORM)(quality)(publish)

Three extract tasks run in parallel. The stage_join task waits for all three. A quality check runs after the join. Publishing to consumer-facing tables runs only after the quality check passes. The graph reads as 'do these three things, then this, then this, then this.' Spoken out loud it is the same as a recipe with prerequisites: gather ingredients first, then mix, then taste, then serve. The DAG encodes the prerequisites in a form a machine can schedule.

Topological Order: How the Orchestrator Knows What to Run

An orchestrator schedules a DAG by computing a topological order: a sequence in which every node appears after all of its dependencies. There may be many valid topological orders for the same DAG, which is why parallelism works (any task whose dependencies are satisfied can start). The acyclic constraint is what guarantees that a topological order exists at all. A graph with a cycle has no topological order, which is the formal way of saying the orchestrator cannot run it.

	# Sketch of how an orchestrator schedules tasks
	ready = set(node for node in dag.nodes if not dag.deps_of(node))
	done = set()
	while ready:
	task = ready.pop()
	run(task)
	done.add(task)
	for child in dag.children_of(task):
	if all(parent in done for parent in dag.deps_of(child)):
	ready.add(child)
	# If `done` does not contain every node, the DAG had a cycle

DAG Boundaries

A real production environment has many DAGs, not one. A DAG is a unit of scheduling and ownership; a typical company has a DAG per source, per business domain, or per consumer. Cross-DAG dependencies are common and are handled with sensors or asset-based dependencies (the next DAG starts when the previous DAG's output table updates). The boundary between DAGs is itself an architectural decision: too many small DAGs creates coordination overhead, too few large DAGs creates blast radius problems where one task's failure stalls unrelated work.

Every dependency between transforms is an edge in the DAG.

Cycles deadlock the orchestrator and produce undefined results; the acyclic constraint is what makes scheduling possible.

DAG boundaries follow ownership; cross-DAG dependencies use sensors or asset-based triggers.

Reading a Real Pipeline Diagram

Daily Life

Interviews

Read a real production pipeline diagram with multiple sources, multiple consumers, branches, and joins, and identify cadence, failure behavior, and ownership.

A diagram from a real production environment is denser than the toy diagrams of the beginner tier. It has multiple sources, multiple consumers, branches, joins, and a layered middle. The same reading skills apply, but the eye has to be trained to find the structure. The exercise below walks through a real-shaped diagram and names every element.

The Diagram

Sources Raw Layer(S3) Curated Layer(Snowflake) Serving Consumers Postgres orders + Stripe API + Salesforce CRM + Mobile events Kafka -> raw.events/hr=... + +

Four sources on the left. Four raw landing zones in S3, partitioned by date or hour. Four curated tables in Snowflake (fct_orders, fct_sessions, dim_customer, and the joined mart_revenue). Two serving destinations: a feature store and a reverse-ETL push. Three consumers: a Looker dashboard, an ML training job, and Salesforce as a reverse-ETL target. Read left to right, the system has the same structure as the beginner-tier example, only larger.

Annotation: Cadence

The diagram is incomplete without timing. Postgres orders is pulled hourly. The Stripe API is pulled every fifteen minutes because the rate limit allows it and finance wants near-real-time revenue. Salesforce CRM is pulled once a day because the data does not change faster. Mobile events stream continuously through Kafka into S3, micro-batched every five minutes. Cadence is not visible in a static diagram, but a real architecture document annotates every edge with how often the work runs.

Edge	Cadence	Why
Postgres orders -> raw.orders	Hourly	App writes continuously; hourly is the freshness bar set by finance
Stripe API -> raw.payments	Every 15 minutes	Finance needs near-real-time revenue; rate limit allows this cadence
Salesforce CRM -> raw.accounts	Daily at 2am	CRM data changes slowly; once a day is more than enough
Mobile events Kafka -> raw.events	Continuous (micro-batch every 5 min)	Volume is too high for hourly; streaming consumers expect sub-hour freshness

Annotation: Failure Behavior

The diagram is also incomplete without failure semantics. What happens if Stripe is down at the moment the 15-minute pull starts? The pipeline retries with exponential backoff up to a limit. If the limit is hit, the run is marked failed and an alert fires. The next run, fifteen minutes later, will pull both batches together because the high-water mark advances only on success. What happens if the dbt transform that builds fct_orders fails? The published mart_revenue table is not updated; downstream Looker users see the previous run's data with a freshness warning. None of this is visible in the boxes-and-arrows view. All of it is critical to operating the system.

Annotation: Ownership

Every node in the diagram has an owner. The data engineering team owns the extracts and the curated layer. The analytics engineering team owns the dbt models. The marketing analytics team owns the Looker dashboards. The ML team owns the feature store and the training job. Reverse-ETL is jointly owned by data engineering (the mechanism) and the GTM team (the business logic). When something breaks, the owner of the broken node is the first responder. A diagram that does not encode ownership produces incidents where everyone assumes someone else is responsible.

Five things a complete pipeline diagram must encode:

▸The four roles for every box (source, transform, storage, consumer)
▸The direction of data flow on every edge
▸The cadence of each edge (hourly, daily, continuous, event-triggered)
▸The failure behavior at each transform (retry, fail-fast, fail-open)
▸The owner of every node so on-call knows who is paged

How to Draw One

When drawing a pipeline diagram for a system that does not yet have one, start from the consumer. Name the dashboard, model, or application that needs the data. Walk backward, asking 'what does this consumer read from' until the answer is a source the team does not own. The walk produces the boxes; the arrows are obvious from the walk. Annotate cadence and ownership last; they are the easiest pieces to get wrong if drawn first because the structure has not yet been validated. The discipline is that diagrams are written for readers, not for writers, and the consumer-first walk produces diagrams that read in the natural direction.

TIP

Treat the architecture diagram as documentation that lives with the code, not a slide. A diagram that is regenerated from the orchestrator (Airflow's UI, Dagster's asset graph) cannot drift from reality the way a hand-drawn slide can.

One Source, Two Different Consumers

Daily Life

Interviews

Design a pipeline that serves a dashboard and a feature store from the same source by splitting at the curated layer rather than at ingestion.

A common architecture problem is one rich source feeding two consumers with different needs. The example here is a single Kafka topic of user activity events being read by two consumers: a daily executive dashboard and a machine learning feature store that powers churn prediction. The same event stream, two completely different shapes at the edge.

The Source

	{
	"event_id": "evt_8a91b",
	"user_id": "u_4f12c",
	"event_type": "page_view",
	"event_timestamp": "2026-04-25T14:33:08.412Z",
	"properties": {
	"page": "/pricing",
	"referrer": "google",
	"device": "mobile"
	}
	}

Each event is small, semi-structured, and produced at a rate of roughly five thousand per second at peak. The Kafka topic has thirty-day retention and twelve partitions. The pipeline reads continuously and lands raw events in S3, partitioned by hour. From there, the same raw data feeds two very different transforms.

Consumer 1: The Executive Dashboard

The executive dashboard wants daily active users, broken out by acquisition channel and device type, with weekly and monthly aggregates available on click. The freshness bar is 'updated by 7am Pacific each morning.' The data shape is a small fact table: one row per user per day. The scale is millions of rows over the trailing year, easily queried in Snowflake.

	INSERT INTO mart.daily_active_users
	SELECT
	DATE(event_timestamp) AS activity_date,
	user_id,
	FIRST_VALUE(properties : referrer) OVER(PARTITION BY user_id ORDER BY event_timestamp) AS acquisition_channel,
	FIRST_VALUE(properties : device) OVER(PARTITION BY user_id ORDER BY event_timestamp) AS first_device
	FROM raw.events
	WHERE DATE(event_timestamp) = : run_date
	GROUP BY activity_date, user_id, properties : referrer, properties : device ;

Consumer 2: The Feature Store

The churn prediction model needs a different shape entirely. For every user, it needs counts and rates over rolling windows: page views in the last 7 days, last 30 days, and last 90 days; sessions per week; the time since the most recent activity; the ratio of mobile to desktop usage. The shape is wide, with dozens of features per user. The freshness bar is 'updated daily, before the 6am model training run.' The output is a feature table in the feature store, indexed by user_id, that the training pipeline reads as a flat set of features.

	INSERT INTO feature_store.user_activity_features
	SELECT
	user_id,
	COUNT(*) FILTER(WHERE event_timestamp >= : run_date - INTERVAL '7 days') AS views_last_7d,
	COUNT(*) FILTER(WHERE event_timestamp >= : run_date - INTERVAL '30 days') AS views_last_30d,
	COUNT(*) FILTER(WHERE event_timestamp >= : run_date - INTERVAL '90 days') AS views_last_90d,
	MAX(event_timestamp) AS last_active_at,
	AVG(CASE WHEN properties : device = 'mobile' THEN 1 ELSE 0 END) AS mobile_ratio_lifetime
	FROM raw.events
	WHERE event_timestamp >= : run_date - INTERVAL '90 days'
	GROUP BY user_id ;

What the Two Consumers Share

Both transforms read from the same raw layer. Both depend on the same hourly ingestion job that lands events from Kafka into S3. Both run as scheduled DAGs in Airflow. Both write to durable storage that downstream consumers query. The shared raw layer is the architectural pivot. Without it, the dashboard team would build its own Kafka consumer and the ML team would build a different one, and the two would diverge in ways that produce different answers to the same question.

What the Two Consumers Need Differently

Property	Dashboard	Feature Store
Output shape	One row per user per day	One row per user, with dozens of feature columns
Freshness bar	Daily by 7am Pacific	Daily by 6am, before model training
Lookback window	Most recent day; weekly and monthly aggregates pre-computed	Rolling 7, 30, 90 days for every user
Compute cost shape	Small daily aggregation	Large rolling-window scan over 90 days
Failure tolerance	Stale dashboard for one morning is acceptable	Stale features cause model decisions on outdated data; tighter SLA

Why the Split Belongs in the Curated Layer, Not Earlier

A naive design would build two Kafka consumers, one for the dashboard and one for the feature store, each maintaining its own state and writing to its own destination. That design is fragile. Schema changes have to be coordinated in two places. Backfills have to be coordinated in two places. The two consumers can drift in subtle ways: one filters bot traffic and the other does not, leading to numbers that disagree. The shared raw layer pushes the split downstream, where the consumer-specific shapes are computed from a common source of truth. Drift becomes a transform problem, not a consumption problem, and transform problems are easier to fix than consumption problems because the raw data is still there to recompute from.

•Two Direct Consumers

Each consumer maintains its own Kafka offsets
Schema changes coordinated across two consumer teams
Bot filtering implemented twice, possibly differently
Backfill requires both consumers to be replayed

✓Shared Raw Layer, Two Transforms

One ingestion job manages Kafka offsets
Schema changes handled in the single raw landing job
Bot filtering implemented once in the curated layer
Backfill replays raw, then both transforms compute from the same data

TIP

When two consumers ask for the same source data in different shapes, the answer is almost never two extracts. The answer is one extract and two transforms downstream of a shared raw layer.

❯❯❯PUTTING IT ALL TOGETHER

> A growth-stage company has eight sources, fifty-three transforms, and one hundred dashboards. The architecture diagram on the wall looks like a circuit board. Two engineers have quit because the on-call rotation is unpredictable. The new head of data engineering is asked: 'What is the smallest set of changes that would make this system operable again?'

Step one: introduce a shared raw layer so every source lands once, partitioned by ingestion date. The raw layer is the foundation that lets later layers exist.

Step two: define a curated layer with one canonical fact table per business domain (orders, payments, sessions, accounts). All consumer-facing reports read from these. Drift between teams is no longer possible because there is one definition of each entity.

Step three: pick ELT as the default for the curated layer because the warehouse is in use and SQL transforms are testable, versioned, and elastic. Hold ETL in reserve for the genuinely non-SQL workloads (image processing, ML feature extraction outside SQL).

Step four: model every transform graph as a DAG. Cycles fail at validation time, ownership is per node, and the orchestrator schedules in topological order. Cross-DAG dependencies use sensors or asset triggers.

Step five: for any source that feeds two consumers, split at the curated layer rather than at ingestion. One extract, two transforms downstream. The shared raw layer is the architectural pivot that prevents the system from drifting back into a hundred unrelated pipelines.

KEY TAKEAWAYS

A shared middle layer is the pivot: raw, curated, and serving layers turn N×M point-to-point pipelines into N+M layered pipelines.

ETL versus ELT is per-transform: default to ELT in cloud warehouses; reach for ETL only when the work is genuinely non-SQL.

DAGs make scheduling possible: directed for one-way flow, acyclic so the orchestrator can compute a topological order.

A complete diagram encodes more than boxes and arrows: cadence, failure behavior, and ownership turn a picture into operable documentation.

When two consumers need the same data in different shapes: one extract, two transforms downstream of a shared raw layer; never two ingestion jobs that drift apart.

When one pipeline becomes many, the question is not what to build but how the pieces fit

Category: Pipeline Architecture
Difficulty: intermediate
Duration: 30 minutes
Challenges: 0 hands-on challenges

Topics covered: Many Sources, One Curated Layer, ETL vs ELT, The DAG: Why Dependencies Form, Reading a Real Pipeline Diagram, One Source, Two Different Consumers

Lesson Sections

Many Sources, One Curated Layer (concepts: paMedallion)
A first pipeline is one source, one transform, one consumer. The vocabulary is small enough to fit on a napkin. A real production environment has many of each, and the question changes from 'what should this pipeline do' to 'how do these pipelines fit together so each one does not solve the same problem in a slightly different way.' The answer is almost always a shared middle layer that every pipeline writes to and reads from. Without that shared layer, the same data ends up extracted three time
ETL vs ELT (concepts: paEltVsEtl)
The two acronyms ETL and ELT differ by a single letter, but the architectural implications are large. ETL extracts data from sources, transforms it on a separate compute layer, and loads the transformed result into the destination. ELT extracts the data, loads it into the destination warehouse first, and runs the transforms inside that warehouse. The order is the entire difference, and that order changes which system bears the cost of the transform work. Why ETL Was the Default Before cloud ware
The DAG: Why Dependencies Form (concepts: paDagOrchestration)
A pipeline with one transform is a line: source, transform, destination. A pipeline with several transforms that depend on each other is a graph. The data engineering term for the structure is a directed acyclic graph, abbreviated DAG. Directed because data flows one way. Acyclic because no transform may depend, directly or indirectly, on its own output. Every modern orchestration tool, from Airflow to Dagster to Prefect, models pipelines as DAGs because the structure has the right properties: i
Reading a Real Pipeline Diagram (concepts: paDagOrchestration)
A diagram from a real production environment is denser than the toy diagrams of the beginner tier. It has multiple sources, multiple consumers, branches, joins, and a layered middle. The same reading skills apply, but the eye has to be trained to find the structure. The exercise below walks through a real-shaped diagram and names every element. The Diagram Four sources on the left. Four raw landing zones in S3, partitioned by date or hour. Four curated tables in Snowflake (fct_orders, fct_sessio
One Source, Two Different Consumers (concepts: paMedallion)
A common architecture problem is one rich source feeding two consumers with different needs. The example here is a single Kafka topic of user activity events being read by two consumers: a daily executive dashboard and a machine learning feature store that powers churn prediction. The same event stream, two completely different shapes at the edge. The Source Each event is small, semi-structured, and produced at a rate of roughly five thousand per second at peak. The Kafka topic has thirty-day re