Design a Pipeline: Intermediate

"Design a pipeline" is the most common system design question in data engineering interviews. The question sounds open-ended, but interviewers are scoring you against a mental rubric with five layers: ingestion, transformation, serving, orchestration, and quality. Miss one layer and you cap yourself at "lean hire." Cover all five with clear tradeoff reasoning and you're a strong hire. This lesson gives you the framework to decompose any pipeline prompt into those layers and articulate each one with precision.

What you will be able to do

Break any "design a pipeline" prompt into 5 tractable sub-problems in under 60 seconds

Justify ELT vs ETL and medallion architecture choices with concrete tradeoffs

Articulate orchestration, data quality, and monitoring as first-class pipeline components

Decomposing the Prompt

Daily Life

Interviews

Decompose any pipeline prompt into five layers in 60 seconds

When an interviewer says "design a pipeline to ingest clickstream data from our mobile app into our analytics warehouse," they are not asking you to start writing Spark code. They're asking: can you think in layers? The single biggest mistake candidates make is diving into implementation before establishing scope.

The Five-Layer Framework

Every pipeline decomposes into five layers. Naming them explicitly in the first 60 seconds of your answer signals seniority. Say: "I'll walk through this in five parts: ingestion, transformation, serving, orchestration, and quality." Then pause. Let the interviewer redirect if they want depth on a specific layer. You've just demonstrated that you see the whole system, not just the Spark job.

Layer	Core Question	Time to Spend
Ingestion	How does data enter the system?	20%
Transformation	How is raw data cleaned and enriched?	30%
Serving	How do consumers query the output?	15%
Orchestration	What runs when, and what happens on failure?	15%
Quality	How do we know the data is correct?	20%

What Interviewers Actually Ask

▸"Design a pipeline to move clickstream events into our data lake"
▸"How would you build a daily ETL for our marketplace transaction data?"
▸"Walk me through ingesting 50M events/day from Kafka into Snowflake"

Clarifying Questions for Seniority

Before designing anything, ask three questions. First: what's the latency requirement? The answer splits your design into batch (hourly/daily), micro-batch (minutes), or streaming (seconds). Second: what's the data volume? 1GB/day and 1TB/day are fundamentally different architectures. Third: who consumes the output? Analysts writing SQL need a different serving layer than ML engineers pulling training sets.

TIP

Interviewers have reported that candidates who ask clarifying questions before designing score higher on 'scope definition' - a rubric category worth 20-25% of the overall score. Ask volume, latency, and consumer questions before touching architecture.

Model Answer: The First 60 Seconds

Here's the framework: (1) Restate the problem in your own words. (2) Ask 2-3 clarifying questions. (3) Name the five layers. (4) Propose a high-level stack. (5) Say which layer you'll dive into first and why. This takes under 90 seconds and sets up a structured 30-minute discussion where you control the pacing.

•Weak Opening

"So I'd use Kafka and Spark..."
Jumps to tools before requirements
No clarifying questions asked
Interviewer has to pull structure out of you

•Strong Opening

"Let me confirm scope first - what latency?"
Asks volume, latency, consumer questions
Names five layers explicitly
You drive the conversation structure

KEY TAKEAWAYS

Decompose every pipeline into 5 layers: ingestion, transformation, serving, orchestration, quality

Ask clarifying questions (volume, latency, consumers) before designing

Name all five layers in the first 60 seconds to signal systems thinking

Control the pacing - tell the interviewer which layer you'll dive into first

The Ingestion Layer

Daily Life

Interviews

Choose and justify the right ingestion pattern for any source

When the interviewer asks how data gets into your pipeline, they are probing for a real choice, not a list. The three source patterns (file drops, API pulls, and Change Data Capture) have different reliability, latency, and cost profiles. Name which one you chose and why before they have to ask.

File-Based Ingestion

The simplest and most common pattern: a source system drops files (CSV, JSON, Parquet) into cloud storage (S3, GCS, ADLS). Your pipeline picks them up on a schedule. This is the default for vendor data feeds, data exports from SaaS tools, and any system where you don't control the source. It's batch by nature - latency is measured in hours, not seconds.

1	s3 : / / data - lake / raw / clickstream / year = 2025 / month = 03 / day = 15 / events_20250315_001.parquet events_20250315_002.parquet _SUCCESS

TIP

Always mention idempotent file processing in interviews. If a file lands twice, your pipeline should produce the same output. Say: 'I'd use a file manifest table to track processed files and skip duplicates.' This is a senior signal that interviewers explicitly look for.

API-Based Ingestion

When the source is a SaaS API (Salesforce, Stripe, HubSpot), you pull data via REST or GraphQL endpoints. The challenge is rate limiting, pagination, and incremental extraction. Naive full-refresh pulls are simple but don't scale: pulling 10M records every hour when only 500 changed is wasteful. Use cursor-based pagination with a high-water mark - store the last modified timestamp and only pull records newer than that.

Change Data Capture (CDC)

CDC reads the database transaction log (binlog in MySQL, WAL in Postgres) and streams row-level changes. Tools like Debezium, Fivetran, and AWS DMS make this accessible. CDC is the gold standard for replicating operational databases into a warehouse: low latency, minimal source load, and complete change history. The tradeoff is operational complexity - you're reading internal database logs, which means schema changes in the source can break your CDC pipeline.

•File / API Ingestion

Simple to implement and debug
Works with any source that can export
Batch latency (minutes to hours)
Full refresh is safe but expensive at scale

•CDC Ingestion

Real-time row-level changes
Minimal load on source database
Complete change history (inserts, updates, deletes)
Schema evolution requires careful handling

What Interviewers Ask: Ingestion

▸"What happens if the source sends duplicate files?" - tests idempotency awareness
▸"The API has a 100 req/min rate limit and you need 5M records. How?" - tests pagination strategy
▸"The source Postgres DB has 200 tables. How do you replicate them?" - wants to hear CDC

KEY TAKEAWAYS

File ingestion: simple, batch, use manifest tables for idempotency

API ingestion: use high-water marks for incremental extraction, handle rate limits

CDC: gold standard for database replication, uses transaction log reading

Always state your idempotency strategy unprompted - it's a scoring rubric item

The Transformation Layer

Daily Life

Interviews

Justify ELT, medallion tiers, and partitioning with interview-ready reasoning

The transformation layer is where the interview is won or lost. This is where interviewers spend the most time probing, because it reveals whether you understand data modeling, partitioning, and the ELT vs ETL tradeoff - the single most tested concept in pipeline interviews.

ELT vs ETL: The #1 Tested Concept

ETL (Extract-Transform-Load) transforms data before loading it into the warehouse. ELT (Extract-Load-Transform) loads raw data first, then transforms it inside the warehouse. This isn't just an acronym difference - it's a fundamental architecture decision that affects cost, flexibility, and debugging.

•ETL (Transform First)

Data is cleaned before it enters the warehouse
Warehouse stores only processed data - lower storage cost
Hard to reprocess: raw data may not be retained
Common in legacy Informatica/SSIS pipelines

•ELT (Load First)

Raw data lands in the warehouse immediately
Higher storage cost, but raw data is always available
Easy to reprocess: re-run transforms on raw data
Standard in modern Snowflake/BigQuery/Databricks stacks

TIP

The correct answer in 2025 is almost always ELT for analytical pipelines. Say: 'I'd land raw data in the lake first so we can reprocess without re-extracting from the source. Storage is cheap; re-extraction is expensive and sometimes impossible.' Interviewers are testing whether you understand this shift.

Medallion: Bronze / Silver / Gold

Medallion architecture organizes your lake into three quality tiers. Bronze is raw, exactly as ingested. Silver is cleaned, deduplicated, and typed. Gold is business-ready aggregates. This isn't vendor marketing - it's a genuinely useful mental model that interviewers recognize. When you say 'bronze, silver, gold,' you're speaking a shared vocabulary that signals you've built real pipelines.

Tier	Data State	Example	Who Reads It
Bronze	Raw, as-ingested	Raw JSON clickstream events	Pipeline debugging
Silver	Cleaned, typed, deduped	Parsed events with user_id, event_type, timestamp	Data engineers, data scientists
Gold	Business aggregates	Daily active users by country	Analysts, dashboards, executives

Partitioning Strategy

Partitioning is how you avoid scanning 10TB when you only need 10GB. Partition by the most common filter column - usually date. If analysts always query by date range and country, partition by date and cluster by country. State this explicitly in interviews: 'I'd partition the gold tables by event_date so daily dashboard queries scan a single partition instead of the full table.'

1	CREATE TABLE gold.daily_active_users(event_date DATE, country STRING, active_users BIGINT, revenue DECIMAL(18, 2)) PARTITIONED BY(event_date) CLUSTERED BY(country) INTO 16 BUCKETS STORED AS PARQUET ;

✓Do

Always justify ELT over ETL by citing reprocessing capability
Use medallion terminology (bronze/silver/gold) to signal production experience
State your partitioning strategy and explain why you chose that column

✗Don't

Say 'we transform the data' without specifying where (in Spark? in the warehouse?)
Skip partitioning - interviewers treat it as a must-mention
Forget to mention how you handle late-arriving data

KEY TAKEAWAYS

ELT is the modern default: load raw, transform in-warehouse, retain raw for reprocessing

Medallion architecture (bronze/silver/gold) is the shared vocabulary - use it

Partition by the most common filter column (usually date) to avoid full scans

Late-arriving data is a common follow-up - have a strategy ready

The Serving Layer

Daily Life

Interviews

Design serving patterns that match consumer needs and SLAs

The serving layer is where most candidates go thin. They spend 25 minutes on ingestion and transformation, then say 'and then analysts query it.' That's a missed opportunity. How data is consumed drives the entire upstream design - and interviewers know it.

Consumer Archetypes

Different consumers need different data shapes. Analysts writing SQL dashboards need pre-aggregated, denormalized gold tables with low query latency. Data scientists building ML features need wide tables with historical snapshots. Reverse ETL consumers (pushing data back to Salesforce, Braze, Iterable) need narrow, frequently-refreshed tables keyed on user_id. Name the consumer explicitly in your answer - it justifies your entire transformation design.

Consumer	Data Shape	Latency Need	Serving Pattern
BI Dashboards	Denormalized, pre-aggregated	< 5 sec query	Materialized views, OLAP cubes
ML Features	Wide tables, point-in-time correct	Minutes to hours	Feature store, snapshot tables
Reverse ETL	Narrow, user-keyed	Near real-time	CDC to SaaS, event triggers
Downstream Pipelines	Clean, typed, partitioned	Hours	Silver/Gold lake tables

Materialized Views vs. Pre-Computed

When dashboards need sub-second response times, you have two options: materialized views (the warehouse auto-refreshes the pre-computation) or pre-computed tables (your pipeline writes the aggregated output). Materialized views are simpler to manage but give you less control over refresh timing. Pre-computed tables are more work but let you run data quality checks before exposing data to consumers.

TIP

A strong interview move: 'I'd build gold tables as pre-computed aggregates rather than materialized views so I can run data quality checks between transformation and serving. I don't want a dashboard showing incorrect numbers because a transform bug made it through.' This shows you think about data trust, not just data movement.

The SLA Question

Every serving layer has an implicit SLA. Dashboards need data by 7 AM so the executive standup has current numbers. ML models need features refreshed before the daily training run at midnight. Reverse ETL needs to sync user segments to the marketing platform before the 9 AM email campaign. State the SLA explicitly: 'The gold table must be refreshed by 6 AM UTC so the US team's dashboards are current by market open.'

What Interviewers Ask About Serving

▸"How would analysts query this data?" - testing whether you've thought about the end consumer
▸"What if the dashboard query is too slow?" - wants to hear materialized views or pre-aggregation
▸"How do you handle a downstream team that needs real-time data but your pipeline is batch?" - testing architectural flexibility

•Thin Serving Answer

"Analysts query the table"
No mention of latency or SLAs
No consumer-specific optimization
Treats serving as an afterthought

•Strong Serving Answer

"Analysts query a gold table partitioned by date with a materialized view for the executive dashboard"
States SLA: data ready by 6 AM UTC
Different tables for different consumers
Serving layer drives upstream design

KEY TAKEAWAYS

Name the consumer explicitly - it justifies your entire transformation design

Different consumers need different data shapes (denormalized, wide, narrow)

State the SLA: when must the data be ready, and who depends on it

Pre-computed tables give you a quality gate between transformation and serving

The Meta Layer

Daily Life

Interviews

Articulate orchestration, quality, and monitoring as first-class pipeline components

Orchestration, data quality, and monitoring are the meta layer - the infrastructure that makes a pipeline a pipeline instead of a script. Candidates who skip this layer cap themselves at 'hire.' Candidates who treat it as first-class get 'strong hire.' The meta layer is where you prove you've operated pipelines in production, not just built them.

Orchestration: DAGs, Not Scripts

A production pipeline isn't a Python script that runs on a cron job. It's a directed acyclic graph (DAG) of tasks with dependencies, retries, and alerting. Say 'Airflow' or 'Dagster' - the specific tool matters less than demonstrating you understand task dependencies. If the ingestion task fails, the transformation task shouldn't run. If the transformation task fails, the quality checks should still execute to capture what went wrong.

1	# Airflow DAG skeleton - the structure interviewers want to see
2	ingest_task >> [
3	validate_schema_task,
4	check_row_counts_task
5	] >> transform_bronze_to_silver \
6	>> transform_silver_to_gold \
7	>> [
8	run_dq_checks_task,
9	refresh_materialized_views_task
10	] >> notify_stakeholders_task

Data Quality: Trust as a Feature

Data quality is not optional. In interviews, mention at least two types of checks: row count validation (did we get roughly the expected number of records?) and schema validation (did the source change column types?). More advanced: null rate thresholds, freshness checks, and cross-table referential integrity. Tools like Great Expectations and dbt tests make this concrete, but the concept matters more than the tool.

1	/* dbt-style data quality checks */
2	/* Row count sanity: flag if daily events drop below 80% of 7-day average */
3	SELECT
4	COUNT(*) AS today_count,
5	(
6	SELECT
7	AVG(daily_count)
8	FROM event_counts
9	WHERE dt BETWEEN CURRENT_DATE - 7
10	AND CURRENT_DATE - 1
11	) AS avg_7d
12	FROM silver.clickstream_events
13	WHERE event_date = CURRENT_DATE
14	HAVING today_count < avg_7d * 0.8

TIP

Mention data quality before the interviewer asks. Say: 'Between silver and gold, I'd run row count validation and null rate checks. If today's event count drops below 80% of the 7-day average, the pipeline halts and pages on-call.' Proactive quality discussion is the strongest senior signal in pipeline interviews.

Monitoring and Alerting

Monitoring answers: is the pipeline healthy right now? Track three things: task duration (is the transform taking 3x longer than usual?), data freshness (when was the gold table last updated?), and data volume (did we process the expected number of records?). Alert on anomalies, not thresholds. A 2-hour pipeline that suddenly takes 6 hours is a problem even if it finishes before the SLA.

What Interviewers Ask: Meta Layer

▸"What happens when a task fails at 3 AM?" - tests retry strategy and alerting
▸"How do you know the data is correct?" - tests data quality awareness
▸"How do you handle backfills?" - tests idempotency and orchestration design

Backfills and Idempotency

Backfills are the hidden test of pipeline maturity. When a bug corrupts three days of data, can you reprocess those three days without affecting the rest? The answer requires idempotent pipelines: running the same date twice produces the same output. Use partition-level overwrites (OVERWRITE PARTITION) rather than appends, so re-execution replaces bad data instead of duplicating it.

✓Do

Mention orchestration (Airflow/Dagster) and describe task dependencies
State at least two data quality checks: row counts and schema validation
Design for idempotency: partition-level overwrites, not appends
Describe monitoring: task duration, data freshness, volume anomalies

✗Don't

Say 'we'd run it on a cron job' - this signals no production experience
Skip data quality entirely - interviewers notice and it caps your score
Assume backfills won't happen - they always do

KEY TAKEAWAYS

Orchestration means DAGs with dependencies, retries, and alerts - not cron jobs

Data quality checks between layers are the #1 senior signal

Monitor duration, freshness, and volume - alert on anomalies, not fixed thresholds

Idempotent design enables backfills: partition-level overwrites, not appends

The universal pipeline architecture question

Category: Pipeline Architecture
Difficulty: intermediate
Duration: 35 minutes
Challenges: 0 hands-on challenges

Topics covered: Decomposing the Prompt, The Ingestion Layer, The Transformation Layer, The Serving Layer, The Meta Layer

Lesson Sections

Decomposing the Prompt (concepts: paBatchVsStreaming)
When an interviewer says "design a pipeline to ingest clickstream data from our mobile app into our analytics warehouse," they are not asking you to start writing Spark code. They're asking: can you think in layers? The single biggest mistake candidates make is diving into implementation before establishing scope. The Five-Layer Framework Every pipeline decomposes into five layers. Naming them explicitly in the first 60 seconds of your answer signals seniority. Say: "I'll walk through this in fi
The Ingestion Layer (concepts: paCdc)
When the interviewer asks how data gets into your pipeline, they are probing for a real choice, not a list. The three source patterns (file drops, API pulls, and Change Data Capture) have different reliability, latency, and cost profiles. Name which one you chose and why before they have to ask. File-Based Ingestion The simplest and most common pattern: a source system drops files (CSV, JSON, Parquet) into cloud storage (S3, GCS, ADLS). Your pipeline picks them up on a schedule. This is the defa
The Transformation Layer (concepts: paEltVsEtl)
The transformation layer is where the interview is won or lost. This is where interviewers spend the most time probing, because it reveals whether you understand data modeling, partitioning, and the ELT vs ETL tradeoff - the single most tested concept in pipeline interviews. ELT vs ETL: The #1 Tested Concept ETL (Extract-Transform-Load) transforms data before loading it into the warehouse. ELT (Extract-Load-Transform) loads raw data first, then transforms it inside the warehouse. This isn't ju
The Serving Layer (concepts: paMedallion)
The serving layer is where most candidates go thin. They spend 25 minutes on ingestion and transformation, then say 'and then analysts query it.' That's a missed opportunity. How data is consumed drives the entire upstream design - and interviewers know it. Consumer Archetypes Different consumers need different data shapes. Analysts writing SQL dashboards need pre-aggregated, denormalized gold tables with low query latency. Data scientists building ML features need wide tables with historical
The Meta Layer (concepts: paDagOrchestration)
Orchestration, data quality, and monitoring are the meta layer - the infrastructure that makes a pipeline a pipeline instead of a script. Candidates who skip this layer cap themselves at 'hire.' Candidates who treat it as first-class get 'strong hire.' The meta layer is where you prove you've operated pipelines in production, not just built them. Orchestration: DAGs, Not Scripts A production pipeline isn't a Python script that runs on a cron job. It's a directed acyclic graph (DAG) of tasks wi