A subscription company at Series B scale collected three streams of data: app events from a mobile SDK, payment events from Stripe, and customer support tickets from Zendesk. The product manager wanted one chart: weekly active subscribers who had opened a support ticket in the last 30 days. The data existed. None of it lived in one place. App events were in a Kafka topic, payments were in Stripe's API behind paginated rate limits, and tickets were in a Postgres replica nobody owned. The chart that should have taken an afternoon took six weeks. The reason was that no pipeline existed. Building one is not glamorous work. It is the work that makes every chart, model, and decision downstream possible. This lesson is the picture of what a pipeline is before any tool, framework, or vocabulary gets in the way.
Why Pipelines Exist
Daily Life
Interviews
Recognize the structural reason pipelines exist and name the three gaps a pipeline closes.
Every company that runs software produces data in one shape and needs it in a different shape, in a different place, on a different schedule. That gap is the entire reason data engineering exists. The gap is not a bug. It is structural. Operational systems are built to handle one user at a time, fast, with strict consistency. Analytical systems are built to scan billions of rows, slow per row, with relaxed consistency. The two are different machines optimized for different jobs.
Three Gaps That Force a Pipeline
Gap
What It Means
Concrete Example
Location
Data is created in one system, needed in another
App writes to Postgres; analyst reads from Snowflake
Shape
The shape that is fast for the writer is slow for the reader
Normalized rows for transactions; wide denormalized columns for dashboards
Time
Data is produced continuously; reports want a daily or hourly snapshot
Click events stream all day; the marketing team wants a 9am summary
A pipeline closes all three gaps at once. It moves data from where it lives to where it is needed. It reshapes the data along the way. It coordinates the timing so the consumer sees a consistent picture rather than a constantly shifting one. Anything that closes those gaps is, in some sense, a pipeline. A nightly bash script that copies a CSV file from one server to another is a tiny pipeline. A platform like Airbnb's that runs tens of thousands of orchestrated jobs is a very large pipeline. The mechanics scale; the underlying job does not change.
The defining property of a pipeline:
▸Data flows in one direction from sources to consumers
▸Each step transforms or moves the data toward the consumer's needed shape
▸The pipeline runs on a schedule or in response to events, not on every individual request
Without a Pipeline
Consider a startup with no pipeline. The CEO asks for revenue by region. An engineer SSHes into the production database and runs a SELECT. The query takes the production database to its knees, the app slows to a crawl, and customers notice. The next time the question comes up, the engineer extracts the data manually to a spreadsheet, where it becomes stale within hours. The third time, someone writes a script that runs at 3am. The script is a pipeline, even if nobody calls it that yet.
•Without a Pipeline
Analytics queries hit production and slow the app
Every report is a one-off manual extract
Data is stale by the time it is read
No one knows which numbers are authoritative
✓With a Pipeline
Production stays fast; analytical work runs on a copy
Reports run on demand from a prepared dataset
Freshness is explicit and known (last hour, last day)
One pipeline produces the canonical numbers; debates end
The Smallest Possible Pipeline
1
#AREALpipelineINthreelinesofbash:pg_dumppsqlanalytics_db<orders.sqlecho'Extract complete at $(date)'>>/var/log/pipeline.log
This script does the three things a pipeline does. It reads from a source (production_db). It moves data toward a destination (analytics_db). It records that the run happened. It is missing scheduling, error handling, transformation, and any concept of incremental updates, but it is a pipeline. Everything in the rest of this lesson, and everything in the rest of this curriculum, is a refinement of what those three lines are trying to do.
If a question can be answered by a single SQL query against the existing database, a pipeline is overkill. Pipelines exist because most real questions cannot.
A pipeline closes the gap between where data is produced and where it is needed.
Three gaps drive the need: location, shape, and time.
Every script that copies and reshapes data on a schedule is a pipeline, no matter how small.
TIP
Before building anything, name the three gaps for the specific problem at hand. If only one of the three gaps exists, a pipeline may be more machinery than the problem requires.
The Four Roles in Any Pipeline
Daily Life
Interviews
Identify the four roles in any pipeline diagram: source, transform, storage, consumer.
Every pipeline, no matter how complex, can be described in terms of four roles. A source produces data. A transform reshapes it. Storage holds it for later. A consumer reads it for some purpose. Real pipelines often have many of each, chained together, but the roles do not change. Naming the four roles is the single most useful skill a new data engineer can develop, because once they are named, every architecture diagram becomes legible.
Role 1: Source
A source is wherever data originates. It is the system that produced the data in the first place. The source is upstream of everything else. A pipeline does not own its sources; it consumes from them. This distinction matters because the source can change without warning, and the pipeline must absorb that change. Common sources include operational databases (Postgres, MySQL), event streams (Kafka, Kinesis), third-party APIs (Stripe, Salesforce), application logs (CloudWatch, Datadog), and file drops (an FTP server, an S3 bucket where a partner deposits CSVs).
Source Type
What It Looks Like
Typical Cadence
Operational database
Tables in Postgres, MySQL, or DynamoDB the app writes to
Continuous writes; pipeline pulls every N minutes
Event stream
Kafka topic, Kinesis stream, Pub/Sub topic of individual events
Continuous; pipeline consumes as events arrive
Third-party API
REST or GraphQL endpoint owned by a vendor
Pipeline polls on a schedule, respects rate limits
File drop
A directory or bucket where a partner deposits CSV, JSON, or Parquet
Hourly, daily, or whenever the partner uploads
Role 2: Transform
A transform takes data in one shape and produces data in another. The work spans a wide range. Cleaning a phone number into a standard format is a transform. Joining two tables to produce a denormalized fact table is a transform. Aggregating a billion events into a thousand daily summaries is a transform. The defining feature of a transform is that the output shape differs from the input shape. Transforms can be written in SQL, Python, Spark, dbt, or anything else that can read data and write data. The language is a tool choice; the role does not change.
Role 3: Storage
Storage is the layer that holds data between steps. It is the resting place. Storage is what makes pipelines durable: if the next step fails, the data does not have to be re-fetched from the source, because it is sitting safely in storage. Common storage layers include data warehouses (Snowflake, BigQuery, Redshift), data lakes (S3, GCS, ADLS), and operational databases when used as a destination rather than a source. Storage and source can be the same physical system in different roles. A Postgres database is a source for the pipeline that pulls from it and a storage layer for the pipeline that writes to it.
Role 4: Consumer
A consumer is anything downstream that reads the prepared data and uses it. Consumers include dashboards (Looker, Tableau, Mode), machine learning training jobs, reverse-ETL tools (which push curated data back into operational systems like Salesforce or HubSpot), internal applications that show data to users, and humans running ad-hoc SQL queries. The consumer is the reason the pipeline exists. A pipeline with no consumer is, by definition, dead code. Designing the consumer-facing shape first and working backward is the more common pattern; designing the source-facing shape first and hoping it works for consumers is the more common mistake.
SourceTransformStorageConsumer
Source
Where data is produced
Owned by another team or vendor. Cannot be controlled, only consumed. Postgres, Kafka, Stripe API, S3 file drops.
Transform
Where data is reshaped
The work the pipeline does. SQL, Python, Spark, dbt. Cleaning, joining, aggregating, deduplicating.
Storage
Where data rests
Durable layer between steps. Snowflake, BigQuery, S3. Survives transform failures so retries do not re-fetch.
Consumer
Why the pipeline exists
Dashboards, ML jobs, reverse-ETL, applications. The shape consumers need drives the design.
All Four Roles in One Sentence
A pipeline reads from one or more sources, applies one or more transforms, lands the result in storage, and serves consumers. That sentence describes a script someone wrote in 2008 and a modern lakehouse running on Databricks. The roles are invariant; only the tools change.
1
StripeAPIsourceTRANSFORMstorageconsumer
✓Do
Name the four roles for any pipeline before adding detail
Treat sources as untrusted: their schema and timing can change
Pick the storage layer based on how the consumer will read it
✗Don't
Confuse storage with source (the same Postgres table can be either, depending on the pipeline's role)
Skip storage between transforms in long pipelines (failures lose work)
Build pipelines without a named consumer in mind
Reading a Pipeline Left to Right
Daily Life
Interviews
Read a pipeline diagram, name the role of every box, and trace the direction of data flow.
Architecture diagrams are the lingua franca of data engineering. Reading one fluently is more useful than knowing any specific tool. The convention is left-to-right, sources on the left, consumers on the right, with arrows showing the direction data flows. The arrows are not optional decoration; they encode the most important fact about the system, which is which way data moves.
The Reading Convention
Diagram Element
What It Means
What It Does Not Mean
Box on the left
A source: data originates here
Not necessarily a database; could be an API or file drop
Box in the middle
A transform or a storage layer (or both, in modern lakehouses)
Order matters; left-to-right is the temporal sequence
Box on the right
A consumer: someone or something reads the data here
Not always a dashboard; can be an ML pipeline or reverse-ETL
Arrow from A to B
Data flows from A to B
Not bidirectional; pipelines have direction
Dashed arrow
Often a control dependency, not a data flow
B waits for A, but data may not actually transfer between them
Read top to bottom or left to right; both work. Spoken aloud: 'A daily job extracts from Postgres orders, lands raw files in S3, dbt transforms those files into a fact_orders table in Snowflake, and Looker reads from fact_orders.' Five sentences, five named roles, one direction of flow. That description is enough to ask intelligent questions about the system: how often does the daily job run, what does dbt do to the raw files, what is the freshness SLA on fact_orders, who owns the dashboard.
What the Arrows Hide
Diagrams are abstractions. They hide a lot. An arrow from Postgres to a daily extract job hides the question of how the job authenticates, whether it pulls all rows or only changed rows, what happens if Postgres is down at the moment the job starts. An arrow from S3 to dbt hides which files are read, in what order, and what happens to files dbt has already processed. None of this nuance is missing because the diagram is bad. It is missing because a diagram that included it would be illegible. The skill is knowing which questions to ask once a diagram has oriented the reader.
Questions to ask of any pipeline diagram:
▸How often does each step run? Continuous, hourly, daily?
▸What happens if a source is unavailable when the pipeline tries to read?
▸Where is data durable, and where is it in flight?
▸Who is the consumer at the end, and what is their freshness expectation?
▸What runs first, what runs after, and how does the system know?
Branching and Joining
Real pipelines branch and join. A source can feed multiple consumers; a single dataset can be assembled from multiple sources. Branching shows up as one box with arrows leaving to several destinations. Joining shows up as several boxes with arrows arriving at the same destination. Both are common and both are legible if the convention is followed.
1
Stripepayments\/\//\SalesforceCRM/\
Two sources join into one fact table; that fact table branches to two consumers. The pipeline has one transform in the middle and four endpoints at the edges. Reading left to right tells the whole story. Stripe and Salesforce are sources, the joined fact table is both a transform and a storage layer, and the two dashboards are consumers.
TIP
When a pipeline diagram is hard to read, draw it again with strict left-to-right flow and rename every box with the role it plays. Most architecture confusion is diagram confusion.
A First End-to-End Pipeline
Daily Life
Interviews
Walk through a one-source, one-transform, one-destination pipeline end to end and describe what each step produces.
Vocabulary becomes useful when applied to a concrete case. Take a small subscription product that wants a daily report of new signups by country. The data exists. The app records every signup to a Postgres table. The marketing team wants a chart on Monday morning showing last week's daily numbers, broken out by country. There is no pipeline. The work below builds one, end to end, with each role visible.
Step 1: Identify the Source
The source is the Postgres signups table. It has many columns; the pipeline needs only three: signup_timestamp, country_code, and user_id. The pipeline must not query Postgres at peak traffic, so it runs at 2am Pacific when load is lowest. It must not download the whole table every day, so it pulls only signups since the last successful run. That last constraint introduces the idea of a high-water mark: a single saved value (typically the last successful run's max signup_timestamp) that lets the next run pick up where this one left off. The pattern appears in nearly every pipeline that pulls from a database.
The pulled rows do not go directly into the dashboard. They go into a raw storage layer first, in this case an S3 bucket organized by date. A file written today contains today's signups; a file written tomorrow contains tomorrow's. This pattern is called partitioning by ingestion date, and it makes everything that comes later easier. If a transform breaks, the raw data is still safe. If the pipeline needs to be re-run for last Tuesday, the file for last Tuesday is right there.
The transform takes the raw rows and produces the daily aggregate the dashboard needs. It groups by date and country and counts. It filters out test accounts. It joins to a country dimension table to translate ISO codes into human-readable names. The output is a small table, possibly only a few hundred rows per day, that the dashboard can query instantly.
The dashboard reads from analytics.daily_signups_by_country. It does not read from raw.signups, and it does not read from Postgres. The consumer-facing layer is small, fast, and shaped exactly the way the dashboard wants it. Any change to the dashboard's needs becomes a change to the transform; any change to the upstream Postgres schema becomes a change to the extract. The two changes are decoupled because the raw zone sits between them. That decoupling is most of the value of having a pipeline at all.
Five boxes, four arrows. One direction. The marketing team gets the chart they wanted, the production database is not affected, and the pipeline runs unattended every night. This is the smallest example that shows every role doing real work. Every more complex pipeline in the rest of this curriculum is an elaboration of one or more of these four steps.
Step
Role
What It Produces
1. Extract from Postgres
Source consumption
Raw rows for the day
2. Land in S3
Storage (raw zone)
Durable file partitioned by date
3. Aggregate in Snowflake
Transform + storage (curated)
Daily summary table
4. Looker reads
Consumer
The chart the marketing team wanted
Every pipeline can be described in four steps: extract, land, transform, serve.
The raw zone decouples upstream changes from downstream consumers.
The high-water mark is the small piece of state that turns 'pull everything' into 'pull what changed'.
When a Pipeline Is Not Needed
Daily Life
Interviews
Decide whether a problem warrants a pipeline or whether a query, replica, or cache solves it more cheaply.
Building a pipeline is engineering work. It carries cost: the code itself, the orchestration that runs it, the storage it consumes, the alerts that fire when it fails, the on-call rotation that responds to those alerts. Engineers reach for pipelines reflexively, but a pipeline is the wrong answer to many problems. Knowing when to skip the pipeline is a more senior skill than knowing how to build one.
Three Cases Where a Direct Query Is Better
Situation
Why a Pipeline Is Overkill
What to Do Instead
One-time question
The cost of building exceeds the value of the answer
Run a SQL query, save the result to a doc, move on
Tiny dataset, infrequent reads
The data fits in a spreadsheet and changes once a quarter
Use a Google Sheet or a static CSV in version control
Dataset already shaped for the consumer
The source already produces what the consumer needs
Point the consumer at the source directly
When the Read Replica Is the Right Answer
Many companies need only one thing from analytics: someone to query the production data without slowing down the app. The right answer here is often not a pipeline. It is a read replica, a copy of the production database that absorbs read traffic. Read replicas are a database feature, not a data engineering build. They keep the data in its operational shape, which is unfriendly for analytics, but for a small company with simple needs they suffice for years. The signal that a read replica is no longer enough is when the operational schema becomes painful for analysts. At that point the pipeline pays for itself.
When an Application Should Just Cache
Sometimes the apparent need for a pipeline is really a need for caching. A user-facing feature that displays 'top 10 most popular products in the last hour' does not require a pipeline. It requires a Redis key that the application updates as products are viewed. The trap is reaching for the data engineering toolkit when the product engineering toolkit is closer to hand. The clue: if the consumer is the application itself, not a dashboard or a model, the answer is more often caching, materialized views, or a denormalized read model than a pipeline.
✓Build a Pipeline
Multiple consumers will read the same prepared dataset
The transform logic is non-trivial and changes over time
Source data is large enough that ad-hoc queries hurt production
Freshness, lineage, and monitoring matter to the business
•Skip the Pipeline
A single one-off question that may never be asked again
A read replica solves the slowness problem alone
The dataset is small and changes only when a human edits it
The consumer is the application, and a cache fits the access pattern
The Pipeline Test
A simple test names the situations where a pipeline earns its cost. If the answer to all four questions below is yes, build a pipeline. If any answer is no, consider a simpler tool first. The test is not perfect, but it filters out most false starts.
The four-question pipeline test:
▸Will the same prepared data be read more than once?
▸Does the transform involve more than a single SELECT?
▸Is the consumer separate from the source (different system, different team, different freshness)?
▸Will the work need to be re-run on a schedule?
Four yeses earn a pipeline. Three yeses earn a discussion. Fewer than three is a sign that the work belongs somewhere else: a dashboard query, a notebook, a cache, or a one-time export. Engineers who answer 'yes, always, every time' to all four questions out of habit end up maintaining many small pipelines that should never have been built. The cost shows up later, in operations and in attention, not in the day the pipeline was written.
✓Do
Apply the four-question test before reaching for orchestration tools
Use read replicas for analytics on small datasets with simple needs
Prefer caches and materialized views when the consumer is the application
✗Don't
Build a pipeline for a question that has been asked exactly once
Reach for Airflow when a daily SQL query in a scheduled job suffices
Confuse 'we need data faster' with 'we need a pipeline'; sometimes the right fix is upstream
❯❯❯PUTTING IT ALL TOGETHER
> A startup CTO has just hired their first data engineer. The CTO says: 'We have Postgres for the app, Stripe for payments, and Zendesk for support tickets. The product team wants a weekly retention dashboard, the finance team wants monthly revenue by plan, and customer support wants to know which users opened tickets in their first week. Where do we start?'
The four roles are visible immediately. Sources are Postgres, Stripe, and Zendesk. Consumers are three dashboards, each owned by a different team. The work in the middle is the pipeline.
A raw zone in S3 sits between sources and transforms so that schema changes in any of the three sources do not break consumer dashboards directly. The raw zone is the decoupling layer.
The four-question test passes for all three dashboards: the same data will be read repeatedly, transforms are non-trivial joins, consumers are separate from sources, and the work runs on a schedule. A pipeline earns its cost here.
The smallest first build is one source, one transform, one destination, one consumer. Pick the highest-value dashboard, build that pipeline end to end, then extend. Avoid building three pipelines in parallel before any one of them works.
KEY TAKEAWAYS
Pipelines exist to close three gaps: location, shape, and time. If only one gap exists, a simpler tool may be the right answer.
Every pipeline contains four roles: source, transform, storage, consumer. Naming them turns any architecture diagram into something legible.
Read diagrams left to right: data flows in one direction, sources on the left, consumers on the right. Arrows are the most important element.
A raw zone decouples upstream from downstream: raw data lands first, then transforms run. Source schema changes do not directly break consumer dashboards.
Not every problem needs a pipeline: the four-question test (repeated reads, non-trivial transform, decoupled consumer, scheduled cadence) filters out false starts.
Data lives where it is created, not where it is needed; pipelines move and reshape it
Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges
Topics covered: Why Pipelines Exist, The Four Roles in Any Pipeline, Reading a Pipeline Left to Right, A First End-to-End Pipeline, When a Pipeline Is Not Needed
Every company that runs software produces data in one shape and needs it in a different shape, in a different place, on a different schedule. That gap is the entire reason data engineering exists. The gap is not a bug. It is structural. Operational systems are built to handle one user at a time, fast, with strict consistency. Analytical systems are built to scan billions of rows, slow per row, with relaxed consistency. The two are different machines optimized for different jobs. Three Gaps That
Every pipeline, no matter how complex, can be described in terms of four roles. A source produces data. A transform reshapes it. Storage holds it for later. A consumer reads it for some purpose. Real pipelines often have many of each, chained together, but the roles do not change. Naming the four roles is the single most useful skill a new data engineer can develop, because once they are named, every architecture diagram becomes legible. Role 1: Source A source is wherever data originates. It is
Architecture diagrams are the lingua franca of data engineering. Reading one fluently is more useful than knowing any specific tool. The convention is left-to-right, sources on the left, consumers on the right, with arrows showing the direction data flows. The arrows are not optional decoration; they encode the most important fact about the system, which is which way data moves. The Reading Convention A Real Diagram, Read Out Loud Read top to bottom or left to right; both work. Spoken aloud: 'A
Vocabulary becomes useful when applied to a concrete case. Take a small subscription product that wants a daily report of new signups by country. The data exists. The app records every signup to a Postgres table. The marketing team wants a chart on Monday morning showing last week's daily numbers, broken out by country. There is no pipeline. The work below builds one, end to end, with each role visible. Step 1: Identify the Source The source is the Postgres signups table. It has many columns; th
Building a pipeline is engineering work. It carries cost: the code itself, the orchestration that runs it, the storage it consumes, the alerts that fire when it fails, the on-call rotation that responds to those alerts. Engineers reach for pipelines reflexively, but a pipeline is the wrong answer to many problems. Knowing when to skip the pipeline is a more senior skill than knowing how to build one. Three Cases Where a Direct Query Is Better When the Read Replica Is the Right Answer Many compan