Loading lesson...
What a Data Pipeline Is: Intermediate
When one pipeline becomes many, the question is not what to build but how the pieces fit
When one pipeline becomes many, the question is not what to build but how the pieces fit
- Category
- Pipeline Architecture
- Difficulty
- intermediate
- Duration
- 30 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Many Sources, One Curated Layer, ETL vs ELT, The DAG: Why Dependencies Form, Reading a Real Pipeline Diagram, One Source, Two Different Consumers
Lesson Sections
- Many Sources, One Curated Layer (concepts: paLayeredArchitecture, paMedallion)
A first pipeline is one source, one transform, one consumer. The vocabulary is small enough to fit on a napkin. A real production environment has many of each, and the question changes from 'what should this pipeline do' to 'how do these pipelines fit together so each one does not solve the same problem in a slightly different way.' The answer is almost always a shared middle layer that every pipeline writes to and reads from. Without that shared layer, the same data ends up extracted three time
- ETL vs ELT (concepts: paEltVsEtl)
The two acronyms ETL and ELT differ by a single letter, but the architectural implications are large. ETL extracts data from sources, transforms it on a separate compute layer, and loads the transformed result into the destination. ELT extracts the data, loads it into the destination warehouse first, and runs the transforms inside that warehouse. The order is the entire difference, and that order changes which system bears the cost of the transform work. Why ETL Was the Default Before cloud ware
- The DAG: Why Dependencies Form (concepts: paDagOrchestration, paTopologicalOrder)
A pipeline with one transform is a line: source, transform, destination. A pipeline with several transforms that depend on each other is a graph. The data engineering term for the structure is a directed acyclic graph, abbreviated DAG. Directed because data flows one way. Acyclic because no transform may depend, directly or indirectly, on its own output. Every modern orchestration tool, from Airflow to Dagster to Prefect, models pipelines as DAGs because the structure has the right properties: i
- Reading a Real Pipeline Diagram (concepts: paDiagramAnnotations)
A diagram from a real production environment is denser than the toy diagrams of the beginner tier. It has multiple sources, multiple consumers, branches, joins, and a layered middle. The same reading skills apply, but the eye has to be trained to find the structure. The exercise below walks through a real-shaped diagram and names every element. The Diagram Four sources on the left. Four raw landing zones in S3, partitioned by date or hour. Four curated tables in Snowflake (fct_orders, fct_sessio
- One Source, Two Different Consumers (concepts: paSharedRawLayer, paConsumerSpecificTransforms)
A common architecture problem is one rich source feeding two consumers with different needs. The example here is a single Kafka topic of user activity events being read by two consumers: a daily executive dashboard and a machine learning feature store that powers churn prediction. The same event stream, two completely different shapes at the edge. The Source Each event is small, semi-structured, and produced at a rate of roughly five thousand per second at peak. The Kafka topic has thirty-day re