What a Data Pipeline Is: Beginner

A subscription company at Series B scale collected three streams of data: app events from a mobile SDK, payment events from Stripe, and customer support tickets from Zendesk. The product manager wanted one chart: weekly active subscribers who had opened a support ticket in the last 30 days. The data existed. None of it lived in one place. App events were in a Kafka topic, payments were in Stripe's API behind paginated rate limits, and tickets were in a Postgres replica nobody owned. The chart that should have taken an afternoon took six weeks. The reason was that no pipeline existed. Building one is not glamorous work. It is the work that makes every chart, model, and decision downstream possible. This lesson is the picture of what a pipeline is before any tool, framework, or vocabulary gets in the way.

Why Pipelines Exist

Daily Life
Interviews

Recognize the structural reason pipelines exist and name the three gaps a pipeline closes.

Every company that runs software produces data in one shape and needs it in a different shape, in a different place, on a different schedule. That gap is the entire reason data engineering exists. The gap is not a bug. It is structural. Operational systems are built to handle one user at a time, fast, with strict consistency. Analytical systems are built to scan billions of rows, slow per row, with relaxed consistency. The two are different machines optimized for different jobs.

Three Gaps That Force a Pipeline

GapWhat It MeansConcrete Example
LocationData is created in one system, needed in anotherApp writes to Postgres; analyst reads from Snowflake
ShapeThe shape that is fast for the writer is slow for the readerNormalized rows for transactions; wide denormalized columns for dashboards
TimeData is produced continuously; reports want a daily or hourly snapshotClick events stream all day; the marketing team wants a 9am summary
A pipeline closes all three gaps at once. It moves data from where it lives to where it is needed. It reshapes the data along the way. It coordinates the timing so the consumer sees a consistent picture rather than a constantly shifting one. Anything that closes those gaps is, in some sense, a pipeline. A nightly bash script that copies a CSV file from one server to another is a tiny pipeline. A platform like Airbnb's that runs tens of thousands of orchestrated jobs is a very large pipeline. The mechanics scale; the underlying job does not change.
The defining property of a pipeline:
  • Data flows in one direction from sources to consumers
  • Each step transforms or moves the data toward the consumer's needed shape
  • The pipeline runs on a schedule or in response to events, not on every individual request

Without a Pipeline

Consider a startup with no pipeline. The CEO asks for revenue by region. An engineer SSHes into the production database and runs a SELECT. The query takes the production database to its knees, the app slows to a crawl, and customers notice. The next time the question comes up, the engineer extracts the data manually to a spreadsheet, where it becomes stale within hours. The third time, someone writes a script that runs at 3am. The script is a pipeline, even if nobody calls it that yet.
Without a Pipeline
  • Analytics queries hit production and slow the app
  • Every report is a one-off manual extract
  • Data is stale by the time it is read
  • No one knows which numbers are authoritative
With a Pipeline
  • Production stays fast; analytical work runs on a copy
  • Reports run on demand from a prepared dataset
  • Freshness is explicit and known (last hour, last day)
  • One pipeline produces the canonical numbers; debates end

The Smallest Possible Pipeline

1# A REAL pipeline IN three lines of bash : pg_dump psql analytics_db < orders.sql echo 'Extract complete at $(date)' > > / var / log / pipeline.log
This script does the three things a pipeline does. It reads from a source (production_db). It moves data toward a destination (analytics_db). It records that the run happened. It is missing scheduling, error handling, transformation, and any concept of incremental updates, but it is a pipeline. Everything in the rest of this lesson, and everything in the rest of this curriculum, is a refinement of what those three lines are trying to do.

If a question can be answered by a single SQL query against the existing database, a pipeline is overkill. Pipelines exist because most real questions cannot.

query
A pipeline closes the gap between where data is produced and where it is needed.
check
Three gaps drive the need: location, shape, and time.
alert
Every script that copies and reshapes data on a schedule is a pipeline, no matter how small.
TIP
Before building anything, name the three gaps for the specific problem at hand. If only one of the three gaps exists, a pipeline may be more machinery than the problem requires.

The Four Roles in Any Pipeline

Daily Life
Interviews

Identify the four roles in any pipeline diagram: source, transform, storage, consumer.

Every pipeline, no matter how complex, can be described in terms of four roles. A source produces data. A transform reshapes it. Storage holds it for later. A consumer reads it for some purpose. Real pipelines often have many of each, chained together, but the roles do not change. Naming the four roles is the single most useful skill a new data engineer can develop, because once they are named, every architecture diagram becomes legible.

Role 1: Source

A source is wherever data originates. It is the system that produced the data in the first place. The source is upstream of everything else. A pipeline does not own its sources; it consumes from them. This distinction matters because the source can change without warning, and the pipeline must absorb that change. Common sources include operational databases (Postgres, MySQL), event streams (Kafka, Kinesis), third-party APIs (Stripe, Salesforce), application logs (CloudWatch, Datadog), and file drops (an FTP server, an S3 bucket where a partner deposits CSVs).
Source TypeWhat It Looks LikeTypical Cadence
Operational databaseTables in Postgres, MySQL, or DynamoDB the app writes toContinuous writes; pipeline pulls every N minutes
Event streamKafka topic, Kinesis stream, Pub/Sub topic of individual eventsContinuous; pipeline consumes as events arrive
Third-party APIREST or GraphQL endpoint owned by a vendorPipeline polls on a schedule, respects rate limits
File dropA directory or bucket where a partner deposits CSV, JSON, or ParquetHourly, daily, or whenever the partner uploads

Role 2: Transform

A transform takes data in one shape and produces data in another. The work spans a wide range. Cleaning a phone number into a standard format is a transform. Joining two tables to produce a denormalized fact table is a transform. Aggregating a billion events into a thousand daily summaries is a transform. The defining feature of a transform is that the output shape differs from the input shape. Transforms can be written in SQL, Python, Spark, dbt, or anything else that can read data and write data. The language is a tool choice; the role does not change.

Role 3: Storage

Storage is the layer that holds data between steps. It is the resting place. Storage is what makes pipelines durable: if the next step fails, the data does not have to be re-fetched from the source, because it is sitting safely in storage. Common storage layers include data warehouses (Snowflake, BigQuery, Redshift), data lakes (S3, GCS, ADLS), and operational databases when used as a destination rather than a source. Storage and source can be the same physical system in different roles. A Postgres database is a source for the pipeline that pulls from it and a storage layer for the pipeline that writes to it.

Role 4: Consumer

A consumer is anything downstream that reads the prepared data and uses it. Consumers include dashboards (Looker, Tableau, Mode), machine learning training jobs, reverse-ETL tools (which push curated data back into operational systems like Salesforce or HubSpot), internal applications that show data to users, and humans running ad-hoc SQL queries. The consumer is the reason the pipeline exists. A pipeline with no consumer is, by definition, dead code. Designing the consumer-facing shape first and working backward is the more common pattern; designing the source-facing shape first and hoping it works for consumers is the more common mistake.
SourceTransformStorageConsumer
Source
Where data is produced
Owned by another team or vendor. Cannot be controlled, only consumed. Postgres, Kafka, Stripe API, S3 file drops.
Transform
Where data is reshaped
The work the pipeline does. SQL, Python, Spark, dbt. Cleaning, joining, aggregating, deduplicating.
Storage
Where data rests
Durable layer between steps. Snowflake, BigQuery, S3. Survives transform failures so retries do not re-fetch.
Consumer
Why the pipeline exists
Dashboards, ML jobs, reverse-ETL, applications. The shape consumers need drives the design.

All Four Roles in One Sentence

A pipeline reads from one or more sources, applies one or more transforms, lands the result in storage, and serves consumers. That sentence describes a script someone wrote in 2008 and a modern lakehouse running on Databricks. The roles are invariant; only the tools change.
1Stripe API source TRANSFORM storage consumer
Do
  • Name the four roles for any pipeline before adding detail
  • Treat sources as untrusted: their schema and timing can change
  • Pick the storage layer based on how the consumer will read it
Don't
  • Confuse storage with source (the same Postgres table can be either, depending on the pipeline's role)
  • Skip storage between transforms in long pipelines (failures lose work)
  • Build pipelines without a named consumer in mind

Reading a Pipeline Left to Right

Daily Life
Interviews

Read a pipeline diagram, name the role of every box, and trace the direction of data flow.

Architecture diagrams are the lingua franca of data engineering. Reading one fluently is more useful than knowing any specific tool. The convention is left-to-right, sources on the left, consumers on the right, with arrows showing the direction data flows. The arrows are not optional decoration; they encode the most important fact about the system, which is which way data moves.

The Reading Convention

Diagram ElementWhat It MeansWhat It Does Not Mean
Box on the leftA source: data originates hereNot necessarily a database; could be an API or file drop
Box in the middleA transform or a storage layer (or both, in modern lakehouses)Order matters; left-to-right is the temporal sequence
Box on the rightA consumer: someone or something reads the data hereNot always a dashboard; can be an ML pipeline or reverse-ETL
Arrow from A to BData flows from A to BNot bidirectional; pipelines have direction
Dashed arrowOften a control dependency, not a data flowB waits for A, but data may not actually transfer between them

A Real Diagram, Read Out Loud

1Postgres orders | v Daily extract job | v dbt transform | v Snowflake fact_orders | v Looker dashboard
Read top to bottom or left to right; both work. Spoken aloud: 'A daily job extracts from Postgres orders, lands raw files in S3, dbt transforms those files into a fact_orders table in Snowflake, and Looker reads from fact_orders.' Five sentences, five named roles, one direction of flow. That description is enough to ask intelligent questions about the system: how often does the daily job run, what does dbt do to the raw files, what is the freshness SLA on fact_orders, who owns the dashboard.

What the Arrows Hide

Diagrams are abstractions. They hide a lot. An arrow from Postgres to a daily extract job hides the question of how the job authenticates, whether it pulls all rows or only changed rows, what happens if Postgres is down at the moment the job starts. An arrow from S3 to dbt hides which files are read, in what order, and what happens to files dbt has already processed. None of this nuance is missing because the diagram is bad. It is missing because a diagram that included it would be illegible. The skill is knowing which questions to ask once a diagram has oriented the reader.
Questions to ask of any pipeline diagram:
  • How often does each step run? Continuous, hourly, daily?
  • What happens if a source is unavailable when the pipeline tries to read?
  • Where is data durable, and where is it in flight?
  • Who is the consumer at the end, and what is their freshness expectation?
  • What runs first, what runs after, and how does the system know?

Branching and Joining

Real pipelines branch and join. A source can feed multiple consumers; a single dataset can be assembled from multiple sources. Branching shows up as one box with arrows leaving to several destinations. Joining shows up as several boxes with arrows arriving at the same destination. Both are common and both are legible if the convention is followed.
1Stripe payments \ / \ / / \ Salesforce CRM / \
Two sources join into one fact table; that fact table branches to two consumers. The pipeline has one transform in the middle and four endpoints at the edges. Reading left to right tells the whole story. Stripe and Salesforce are sources, the joined fact table is both a transform and a storage layer, and the two dashboards are consumers.
TIP
When a pipeline diagram is hard to read, draw it again with strict left-to-right flow and rename every box with the role it plays. Most architecture confusion is diagram confusion.

A First End-to-End Pipeline

Daily Life
Interviews

Walk through a one-source, one-transform, one-destination pipeline end to end and describe what each step produces.

Vocabulary becomes useful when applied to a concrete case. Take a small subscription product that wants a daily report of new signups by country. The data exists. The app records every signup to a Postgres table. The marketing team wants a chart on Monday morning showing last week's daily numbers, broken out by country. There is no pipeline. The work below builds one, end to end, with each role visible.

Step 1: Identify the Source

The source is the Postgres signups table. It has many columns; the pipeline needs only three: signup_timestamp, country_code, and user_id. The pipeline must not query Postgres at peak traffic, so it runs at 2am Pacific when load is lowest. It must not download the whole table every day, so it pulls only signups since the last successful run. That last constraint introduces the idea of a high-water mark: a single saved value (typically the last successful run's max signup_timestamp) that lets the next run pick up where this one left off. The pattern appears in nearly every pipeline that pulls from a database.
1SELECT
2 signup_timestamp,
3 country_code,
4 user_id
5FROM signups
6WHERE signup_timestamp >= : last_run_timestamp AND signup_timestamp < : this_run_timestamp ;

Step 2: Land Raw Data in Storage

The pulled rows do not go directly into the dashboard. They go into a raw storage layer first, in this case an S3 bucket organized by date. A file written today contains today's signups; a file written tomorrow contains tomorrow's. This pattern is called partitioning by ingestion date, and it makes everything that comes later easier. If a transform breaks, the raw data is still safe. If the pipeline needs to be re-run for last Tuesday, the file for last Tuesday is right there.
1s3 : / / company - data - lake / raw / signups / dt = 2026 - 04 - 25 / signups.parquet

Step 3: Transform

The transform takes the raw rows and produces the daily aggregate the dashboard needs. It groups by date and country and counts. It filters out test accounts. It joins to a country dimension table to translate ISO codes into human-readable names. The output is a small table, possibly only a few hundred rows per day, that the dashboard can query instantly.
1INSERT INTO analytics.daily_signups_by_country
2SELECT
3 DATE(signup_timestamp) AS signup_date,
4 c.country_name,
5 COUNT(*) AS signup_count
6FROM raw.signups r
7JOIN dim.country c
8 ON r.country_code = c.iso_code
9WHERE r.user_id NOT IN(SELECT user_id FROM dim.test_accounts) AND DATE(signup_timestamp) = : run_date
10GROUP BY 1, 2 ;

Step 4: Serve the Consumer

The dashboard reads from analytics.daily_signups_by_country. It does not read from raw.signups, and it does not read from Postgres. The consumer-facing layer is small, fast, and shaped exactly the way the dashboard wants it. Any change to the dashboard's needs becomes a change to the transform; any change to the upstream Postgres schema becomes a change to the extract. The two changes are decoupled because the raw zone sits between them. That decoupling is most of the value of having a pipeline at all.

The Whole Picture

1Postgres signups |(extract AT 2 am, last_run_ts -> this_run_ts) v S3 raw/signups/dt=YYYY-MM-DD |(TRANSFORM : clean, JOIN, aggregate) v Snowflake analytics.daily_signups_by_country |(read BY Looker) v Marketing dashboard
Five boxes, four arrows. One direction. The marketing team gets the chart they wanted, the production database is not affected, and the pipeline runs unattended every night. This is the smallest example that shows every role doing real work. Every more complex pipeline in the rest of this curriculum is an elaboration of one or more of these four steps.
StepRoleWhat It Produces
1. Extract from PostgresSource consumptionRaw rows for the day
2. Land in S3Storage (raw zone)Durable file partitioned by date
3. Aggregate in SnowflakeTransform + storage (curated)Daily summary table
4. Looker readsConsumerThe chart the marketing team wanted
check
Every pipeline can be described in four steps: extract, land, transform, serve.
query
The raw zone decouples upstream changes from downstream consumers.
alert
The high-water mark is the small piece of state that turns 'pull everything' into 'pull what changed'.

When a Pipeline Is Not Needed

Daily Life
Interviews

Decide whether a problem warrants a pipeline or whether a query, replica, or cache solves it more cheaply.

Building a pipeline is engineering work. It carries cost: the code itself, the orchestration that runs it, the storage it consumes, the alerts that fire when it fails, the on-call rotation that responds to those alerts. Engineers reach for pipelines reflexively, but a pipeline is the wrong answer to many problems. Knowing when to skip the pipeline is a more senior skill than knowing how to build one.

Three Cases Where a Direct Query Is Better

SituationWhy a Pipeline Is OverkillWhat to Do Instead
One-time questionThe cost of building exceeds the value of the answerRun a SQL query, save the result to a doc, move on
Tiny dataset, infrequent readsThe data fits in a spreadsheet and changes once a quarterUse a Google Sheet or a static CSV in version control
Dataset already shaped for the consumerThe source already produces what the consumer needsPoint the consumer at the source directly

When the Read Replica Is the Right Answer

Many companies need only one thing from analytics: someone to query the production data without slowing down the app. The right answer here is often not a pipeline. It is a read replica, a copy of the production database that absorbs read traffic. Read replicas are a database feature, not a data engineering build. They keep the data in its operational shape, which is unfriendly for analytics, but for a small company with simple needs they suffice for years. The signal that a read replica is no longer enough is when the operational schema becomes painful for analysts. At that point the pipeline pays for itself.

When an Application Should Just Cache

Sometimes the apparent need for a pipeline is really a need for caching. A user-facing feature that displays 'top 10 most popular products in the last hour' does not require a pipeline. It requires a Redis key that the application updates as products are viewed. The trap is reaching for the data engineering toolkit when the product engineering toolkit is closer to hand. The clue: if the consumer is the application itself, not a dashboard or a model, the answer is more often caching, materialized views, or a denormalized read model than a pipeline.
Build a Pipeline
  • Multiple consumers will read the same prepared dataset
  • The transform logic is non-trivial and changes over time
  • Source data is large enough that ad-hoc queries hurt production
  • Freshness, lineage, and monitoring matter to the business
Skip the Pipeline
  • A single one-off question that may never be asked again
  • A read replica solves the slowness problem alone
  • The dataset is small and changes only when a human edits it
  • The consumer is the application, and a cache fits the access pattern

The Pipeline Test

A simple test names the situations where a pipeline earns its cost. If the answer to all four questions below is yes, build a pipeline. If any answer is no, consider a simpler tool first. The test is not perfect, but it filters out most false starts.
The four-question pipeline test:
  • Will the same prepared data be read more than once?
  • Does the transform involve more than a single SELECT?
  • Is the consumer separate from the source (different system, different team, different freshness)?
  • Will the work need to be re-run on a schedule?
Four yeses earn a pipeline. Three yeses earn a discussion. Fewer than three is a sign that the work belongs somewhere else: a dashboard query, a notebook, a cache, or a one-time export. Engineers who answer 'yes, always, every time' to all four questions out of habit end up maintaining many small pipelines that should never have been built. The cost shows up later, in operations and in attention, not in the day the pipeline was written.
Do
  • Apply the four-question test before reaching for orchestration tools
  • Use read replicas for analytics on small datasets with simple needs
  • Prefer caches and materialized views when the consumer is the application
Don't
  • Build a pipeline for a question that has been asked exactly once
  • Reach for Airflow when a daily SQL query in a scheduled job suffices
  • Confuse 'we need data faster' with 'we need a pipeline'; sometimes the right fix is upstream
PUTTING IT ALL TOGETHER

> A startup CTO has just hired their first data engineer. The CTO says: 'We have Postgres for the app, Stripe for payments, and Zendesk for support tickets. The product team wants a weekly retention dashboard, the finance team wants monthly revenue by plan, and customer support wants to know which users opened tickets in their first week. Where do we start?'

The four roles are visible immediately. Sources are Postgres, Stripe, and Zendesk. Consumers are three dashboards, each owned by a different team. The work in the middle is the pipeline.
A raw zone in S3 sits between sources and transforms so that schema changes in any of the three sources do not break consumer dashboards directly. The raw zone is the decoupling layer.
The four-question test passes for all three dashboards: the same data will be read repeatedly, transforms are non-trivial joins, consumers are separate from sources, and the work runs on a schedule. A pipeline earns its cost here.
The smallest first build is one source, one transform, one destination, one consumer. Pick the highest-value dashboard, build that pipeline end to end, then extend. Avoid building three pipelines in parallel before any one of them works.
KEY TAKEAWAYS
Pipelines exist to close three gaps: location, shape, and time. If only one gap exists, a simpler tool may be the right answer.
Every pipeline contains four roles: source, transform, storage, consumer. Naming them turns any architecture diagram into something legible.
Read diagrams left to right: data flows in one direction, sources on the left, consumers on the right. Arrows are the most important element.
A raw zone decouples upstream from downstream: raw data lands first, then transforms run. Source schema changes do not directly break consumer dashboards.
Not every problem needs a pipeline: the four-question test (repeated reads, non-trivial transform, decoupled consumer, scheduled cadence) filters out false starts.

Data lives where it is created, not where it is needed; pipelines move and reshape it

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: Why Pipelines Exist, The Four Roles in Any Pipeline, Reading a Pipeline Left to Right, A First End-to-End Pipeline, When a Pipeline Is Not Needed

Lesson Sections

  1. Why Pipelines Exist (concepts: paPipelinePurpose, paOperationalVsAnalytical)

    Every company that runs software produces data in one shape and needs it in a different shape, in a different place, on a different schedule. That gap is the entire reason data engineering exists. The gap is not a bug. It is structural. Operational systems are built to handle one user at a time, fast, with strict consistency. Analytical systems are built to scan billions of rows, slow per row, with relaxed consistency. The two are different machines optimized for different jobs. Three Gaps That

  2. The Four Roles in Any Pipeline (concepts: paPipelineRoles)

    Every pipeline, no matter how complex, can be described in terms of four roles. A source produces data. A transform reshapes it. Storage holds it for later. A consumer reads it for some purpose. Real pipelines often have many of each, chained together, but the roles do not change. Naming the four roles is the single most useful skill a new data engineer can develop, because once they are named, every architecture diagram becomes legible. Role 1: Source A source is wherever data originates. It is

  3. Reading a Pipeline Left to Right (concepts: paPipelineDiagrams)

    Architecture diagrams are the lingua franca of data engineering. Reading one fluently is more useful than knowing any specific tool. The convention is left-to-right, sources on the left, consumers on the right, with arrows showing the direction data flows. The arrows are not optional decoration; they encode the most important fact about the system, which is which way data moves. The Reading Convention A Real Diagram, Read Out Loud Read top to bottom or left to right; both work. Spoken aloud: 'A

  4. A First End-to-End Pipeline (concepts: paEndToEndPipeline, paRawZone, paHighWaterMark)

    Vocabulary becomes useful when applied to a concrete case. Take a small subscription product that wants a daily report of new signups by country. The data exists. The app records every signup to a Postgres table. The marketing team wants a chart on Monday morning showing last week's daily numbers, broken out by country. There is no pipeline. The work below builds one, end to end, with each role visible. Step 1: Identify the Source The source is the Postgres signups table. It has many columns; th

  5. When a Pipeline Is Not Needed (concepts: paWhenNotToPipeline)

    Building a pipeline is engineering work. It carries cost: the code itself, the orchestration that runs it, the storage it consumes, the alerts that fire when it fails, the on-call rotation that responds to those alerts. Engineers reach for pipelines reflexively, but a pipeline is the wrong answer to many problems. Knowing when to skip the pipeline is a more senior skill than knowing how to build one. Three Cases Where a Direct Query Is Better When the Read Replica Is the Right Answer Many compan