Design a Pipeline: Intermediate

"Design a pipeline" is the most common system design question in data engineering interviews. The question sounds open-ended, but interviewers are scoring you against a mental rubric with five layers: ingestion, transformation, serving, orchestration, and quality. Miss one layer and you cap yourself at "lean hire." Cover all five with clear tradeoff reasoning and you're a strong hire. This lesson gives you the framework to decompose any pipeline prompt into those layers and articulate each one with precision.

What you will be able to do

Break any "design a pipeline" prompt into 5 tractable sub-problems in under 60 seconds
Break any "design a pipeline" prompt into 5 tractable sub-problems in under 60 seconds
Justify ELT vs ETL and medallion architecture choices with concrete tradeoffs
Justify ELT vs ETL and medallion architecture choices with concrete tradeoffs
Articulate orchestration, data quality, and monitoring as first-class pipeline components
Articulate orchestration, data quality, and monitoring as first-class pipeline components

Decomposing the Prompt

Daily Life
Interviews

Decompose any pipeline prompt into five layers in 60 seconds

When an interviewer says "design a pipeline to ingest clickstream data from our mobile app into our analytics warehouse," they are not asking you to start writing Spark code. They're asking: can you think in layers? The single biggest mistake candidates make is diving into implementation before establishing scope.

The Five-Layer Framework

Every pipeline decomposes into five layers. Naming them explicitly in the first 60 seconds of your answer signals seniority. Say: "I'll walk through this in five parts: ingestion, transformation, serving, orchestration, and quality." Then pause. Let the interviewer redirect if they want depth on a specific layer. You've just demonstrated that you see the whole system, not just the Spark job.
LayerCore QuestionTime to Spend
IngestionHow does data enter the system?20%
TransformationHow is raw data cleaned and enriched?30%
ServingHow do consumers query the output?15%
OrchestrationWhat runs when, and what happens on failure?15%
QualityHow do we know the data is correct?20%
What Interviewers Actually Ask
  • "Design a pipeline to move clickstream events into our data lake"
  • "How would you build a daily ETL for our marketplace transaction data?"
  • "Walk me through ingesting 50M events/day from Kafka into Snowflake"

Clarifying Questions for Seniority

Before designing anything, ask three questions. First: what's the latency requirement? The answer splits your design into batch (hourly/daily), micro-batch (minutes), or streaming (seconds). Second: what's the data volume? 1GB/day and 1TB/day are fundamentally different architectures. Third: who consumes the output? Analysts writing SQL need a different serving layer than ML engineers pulling training sets.
TIP
Interviewers have reported that candidates who ask clarifying questions before designing score higher on 'scope definition' - a rubric category worth 20-25% of the overall score. Ask volume, latency, and consumer questions before touching architecture.

Model Answer: The First 60 Seconds

Here's the framework: (1) Restate the problem in your own words. (2) Ask 2-3 clarifying questions. (3) Name the five layers. (4) Propose a high-level stack. (5) Say which layer you'll dive into first and why. This takes under 90 seconds and sets up a structured 30-minute discussion where you control the pacing.
Weak Opening
  • "So I'd use Kafka and Spark..."
  • Jumps to tools before requirements
  • No clarifying questions asked
  • Interviewer has to pull structure out of you
Strong Opening
  • "Let me confirm scope first - what latency?"
  • Asks volume, latency, consumer questions
  • Names five layers explicitly
  • You drive the conversation structure
KEY TAKEAWAYS
Decompose every pipeline into 5 layers: ingestion, transformation, serving, orchestration, quality
Ask clarifying questions (volume, latency, consumers) before designing
Name all five layers in the first 60 seconds to signal systems thinking
Control the pacing - tell the interviewer which layer you'll dive into first

The Ingestion Layer

Daily Life
Interviews

Choose and justify the right ingestion pattern for any source

When the interviewer asks how data gets into your pipeline, they are probing for a real choice, not a list. The three source patterns (file drops, API pulls, and Change Data Capture) have different reliability, latency, and cost profiles. Name which one you chose and why before they have to ask.

File-Based Ingestion

The simplest and most common pattern: a source system drops files (CSV, JSON, Parquet) into cloud storage (S3, GCS, ADLS). Your pipeline picks them up on a schedule. This is the default for vendor data feeds, data exports from SaaS tools, and any system where you don't control the source. It's batch by nature - latency is measured in hours, not seconds.
1s3 : / / data - lake / raw / clickstream / year = 2025 / month = 03 / day = 15 / events_20250315_001.parquet events_20250315_002.parquet _SUCCESS
TIP
Always mention idempotent file processing in interviews. If a file lands twice, your pipeline should produce the same output. Say: 'I'd use a file manifest table to track processed files and skip duplicates.' This is a senior signal that interviewers explicitly look for.

API-Based Ingestion

When the source is a SaaS API (Salesforce, Stripe, HubSpot), you pull data via REST or GraphQL endpoints. The challenge is rate limiting, pagination, and incremental extraction. Naive full-refresh pulls are simple but don't scale: pulling 10M records every hour when only 500 changed is wasteful. Use cursor-based pagination with a high-water mark - store the last modified timestamp and only pull records newer than that.

Change Data Capture (CDC)

CDC reads the database transaction log (binlog in MySQL, WAL in Postgres) and streams row-level changes. Tools like Debezium, Fivetran, and AWS DMS make this accessible. CDC is the gold standard for replicating operational databases into a warehouse: low latency, minimal source load, and complete change history. The tradeoff is operational complexity - you're reading internal database logs, which means schema changes in the source can break your CDC pipeline.
File / API Ingestion
  • Simple to implement and debug
  • Works with any source that can export
  • Batch latency (minutes to hours)
  • Full refresh is safe but expensive at scale
CDC Ingestion
  • Real-time row-level changes
  • Minimal load on source database
  • Complete change history (inserts, updates, deletes)
  • Schema evolution requires careful handling
What Interviewers Ask: Ingestion
  • "What happens if the source sends duplicate files?" - tests idempotency awareness
  • "The API has a 100 req/min rate limit and you need 5M records. How?" - tests pagination strategy
  • "The source Postgres DB has 200 tables. How do you replicate them?" - wants to hear CDC
KEY TAKEAWAYS
File ingestion: simple, batch, use manifest tables for idempotency
API ingestion: use high-water marks for incremental extraction, handle rate limits
CDC: gold standard for database replication, uses transaction log reading
Always state your idempotency strategy unprompted - it's a scoring rubric item

The Transformation Layer

Daily Life
Interviews

Justify ELT, medallion tiers, and partitioning with interview-ready reasoning

The transformation layer is where the interview is won or lost. This is where interviewers spend the most time probing, because it reveals whether you understand data modeling, partitioning, and the ELT vs ETL tradeoff - the single most tested concept in pipeline interviews.

ELT vs ETL: The #1 Tested Concept

ETL (Extract-Transform-Load) transforms data before loading it into the warehouse. ELT (Extract-Load-Transform) loads raw data first, then transforms it inside the warehouse. This isn't just an acronym difference - it's a fundamental architecture decision that affects cost, flexibility, and debugging.
ETL (Transform First)
  • Data is cleaned before it enters the warehouse
  • Warehouse stores only processed data - lower storage cost
  • Hard to reprocess: raw data may not be retained
  • Common in legacy Informatica/SSIS pipelines
ELT (Load First)
  • Raw data lands in the warehouse immediately
  • Higher storage cost, but raw data is always available
  • Easy to reprocess: re-run transforms on raw data
  • Standard in modern Snowflake/BigQuery/Databricks stacks
TIP
The correct answer in 2025 is almost always ELT for analytical pipelines. Say: 'I'd land raw data in the lake first so we can reprocess without re-extracting from the source. Storage is cheap; re-extraction is expensive and sometimes impossible.' Interviewers are testing whether you understand this shift.

Medallion: Bronze / Silver / Gold

Medallion architecture organizes your lake into three quality tiers. Bronze is raw, exactly as ingested. Silver is cleaned, deduplicated, and typed. Gold is business-ready aggregates. This isn't vendor marketing - it's a genuinely useful mental model that interviewers recognize. When you say 'bronze, silver, gold,' you're speaking a shared vocabulary that signals you've built real pipelines.
TierData StateExampleWho Reads It
BronzeRaw, as-ingestedRaw JSON clickstream eventsPipeline debugging
SilverCleaned, typed, dedupedParsed events with user_id, event_type, timestampData engineers, data scientists
GoldBusiness aggregatesDaily active users by countryAnalysts, dashboards, executives

Partitioning Strategy

Partitioning is how you avoid scanning 10TB when you only need 10GB. Partition by the most common filter column - usually date. If analysts always query by date range and country, partition by date and cluster by country. State this explicitly in interviews: 'I'd partition the gold tables by event_date so daily dashboard queries scan a single partition instead of the full table.'
1CREATE TABLE gold.daily_active_users(event_date DATE, country STRING, active_users BIGINT, revenue DECIMAL(18, 2)) PARTITIONED BY(event_date) CLUSTERED BY(country) INTO 16 BUCKETS STORED AS PARQUET ;
Do
  • Always justify ELT over ETL by citing reprocessing capability
  • Use medallion terminology (bronze/silver/gold) to signal production experience
  • State your partitioning strategy and explain why you chose that column
Don't
  • Say 'we transform the data' without specifying where (in Spark? in the warehouse?)
  • Skip partitioning - interviewers treat it as a must-mention
  • Forget to mention how you handle late-arriving data
KEY TAKEAWAYS
ELT is the modern default: load raw, transform in-warehouse, retain raw for reprocessing
Medallion architecture (bronze/silver/gold) is the shared vocabulary - use it
Partition by the most common filter column (usually date) to avoid full scans
Late-arriving data is a common follow-up - have a strategy ready

The Serving Layer

Daily Life
Interviews

Design serving patterns that match consumer needs and SLAs

The serving layer is where most candidates go thin. They spend 25 minutes on ingestion and transformation, then say 'and then analysts query it.' That's a missed opportunity. How data is consumed drives the entire upstream design - and interviewers know it.

Consumer Archetypes

Different consumers need different data shapes. Analysts writing SQL dashboards need pre-aggregated, denormalized gold tables with low query latency. Data scientists building ML features need wide tables with historical snapshots. Reverse ETL consumers (pushing data back to Salesforce, Braze, Iterable) need narrow, frequently-refreshed tables keyed on user_id. Name the consumer explicitly in your answer - it justifies your entire transformation design.
ConsumerData ShapeLatency NeedServing Pattern
BI DashboardsDenormalized, pre-aggregated< 5 sec queryMaterialized views, OLAP cubes
ML FeaturesWide tables, point-in-time correctMinutes to hoursFeature store, snapshot tables
Reverse ETLNarrow, user-keyedNear real-timeCDC to SaaS, event triggers
Downstream PipelinesClean, typed, partitionedHoursSilver/Gold lake tables

Materialized Views vs. Pre-Computed

When dashboards need sub-second response times, you have two options: materialized views (the warehouse auto-refreshes the pre-computation) or pre-computed tables (your pipeline writes the aggregated output). Materialized views are simpler to manage but give you less control over refresh timing. Pre-computed tables are more work but let you run data quality checks before exposing data to consumers.
TIP
A strong interview move: 'I'd build gold tables as pre-computed aggregates rather than materialized views so I can run data quality checks between transformation and serving. I don't want a dashboard showing incorrect numbers because a transform bug made it through.' This shows you think about data trust, not just data movement.

The SLA Question

Every serving layer has an implicit SLA. Dashboards need data by 7 AM so the executive standup has current numbers. ML models need features refreshed before the daily training run at midnight. Reverse ETL needs to sync user segments to the marketing platform before the 9 AM email campaign. State the SLA explicitly: 'The gold table must be refreshed by 6 AM UTC so the US team's dashboards are current by market open.'
What Interviewers Ask About Serving
  • "How would analysts query this data?" - testing whether you've thought about the end consumer
  • "What if the dashboard query is too slow?" - wants to hear materialized views or pre-aggregation
  • "How do you handle a downstream team that needs real-time data but your pipeline is batch?" - testing architectural flexibility
Thin Serving Answer
  • "Analysts query the table"
  • No mention of latency or SLAs
  • No consumer-specific optimization
  • Treats serving as an afterthought
Strong Serving Answer
  • "Analysts query a gold table partitioned by date with a materialized view for the executive dashboard"
  • States SLA: data ready by 6 AM UTC
  • Different tables for different consumers
  • Serving layer drives upstream design
KEY TAKEAWAYS
Name the consumer explicitly - it justifies your entire transformation design
Different consumers need different data shapes (denormalized, wide, narrow)
State the SLA: when must the data be ready, and who depends on it
Pre-computed tables give you a quality gate between transformation and serving

The Meta Layer

Daily Life
Interviews

Articulate orchestration, quality, and monitoring as first-class pipeline components

Orchestration, data quality, and monitoring are the meta layer - the infrastructure that makes a pipeline a pipeline instead of a script. Candidates who skip this layer cap themselves at 'hire.' Candidates who treat it as first-class get 'strong hire.' The meta layer is where you prove you've operated pipelines in production, not just built them.

Orchestration: DAGs, Not Scripts

A production pipeline isn't a Python script that runs on a cron job. It's a directed acyclic graph (DAG) of tasks with dependencies, retries, and alerting. Say 'Airflow' or 'Dagster' - the specific tool matters less than demonstrating you understand task dependencies. If the ingestion task fails, the transformation task shouldn't run. If the transformation task fails, the quality checks should still execute to capture what went wrong.
1# Airflow DAG skeleton - the structure interviewers want to see
2ingest_task >> [
3 validate_schema_task,
4 check_row_counts_task
5] >> transform_bronze_to_silver \
6 >> transform_silver_to_gold \
7 >> [
8 run_dq_checks_task,
9 refresh_materialized_views_task
10] >> notify_stakeholders_task

Data Quality: Trust as a Feature

Data quality is not optional. In interviews, mention at least two types of checks: row count validation (did we get roughly the expected number of records?) and schema validation (did the source change column types?). More advanced: null rate thresholds, freshness checks, and cross-table referential integrity. Tools like Great Expectations and dbt tests make this concrete, but the concept matters more than the tool.
1/* dbt-style data quality checks */
2/* Row count sanity: flag if daily events drop below 80% of 7-day average */
3SELECT
4 COUNT(*) AS today_count,
5 (
6 SELECT
7 AVG(daily_count)
8 FROM event_counts
9 WHERE dt BETWEEN CURRENT_DATE - 7
10 AND CURRENT_DATE - 1
11 ) AS avg_7d
12FROM silver.clickstream_events
13WHERE event_date = CURRENT_DATE
14HAVING today_count < avg_7d * 0.8
TIP
Mention data quality before the interviewer asks. Say: 'Between silver and gold, I'd run row count validation and null rate checks. If today's event count drops below 80% of the 7-day average, the pipeline halts and pages on-call.' Proactive quality discussion is the strongest senior signal in pipeline interviews.

Monitoring and Alerting

Monitoring answers: is the pipeline healthy right now? Track three things: task duration (is the transform taking 3x longer than usual?), data freshness (when was the gold table last updated?), and data volume (did we process the expected number of records?). Alert on anomalies, not thresholds. A 2-hour pipeline that suddenly takes 6 hours is a problem even if it finishes before the SLA.
What Interviewers Ask: Meta Layer
  • "What happens when a task fails at 3 AM?" - tests retry strategy and alerting
  • "How do you know the data is correct?" - tests data quality awareness
  • "How do you handle backfills?" - tests idempotency and orchestration design

Backfills and Idempotency

Backfills are the hidden test of pipeline maturity. When a bug corrupts three days of data, can you reprocess those three days without affecting the rest? The answer requires idempotent pipelines: running the same date twice produces the same output. Use partition-level overwrites (OVERWRITE PARTITION) rather than appends, so re-execution replaces bad data instead of duplicating it.
Do
  • Mention orchestration (Airflow/Dagster) and describe task dependencies
  • State at least two data quality checks: row counts and schema validation
  • Design for idempotency: partition-level overwrites, not appends
  • Describe monitoring: task duration, data freshness, volume anomalies
Don't
  • Say 'we'd run it on a cron job' - this signals no production experience
  • Skip data quality entirely - interviewers notice and it caps your score
  • Assume backfills won't happen - they always do
KEY TAKEAWAYS
Orchestration means DAGs with dependencies, retries, and alerts - not cron jobs
Data quality checks between layers are the #1 senior signal
Monitor duration, freshness, and volume - alert on anomalies, not fixed thresholds
Idempotent design enables backfills: partition-level overwrites, not appends

The universal pipeline architecture question

Category
Pipeline Architecture
Difficulty
intermediate
Duration
35 minutes
Challenges
0 hands-on challenges

Topics covered: Decomposing the Prompt, The Ingestion Layer, The Transformation Layer, The Serving Layer, The Meta Layer

Lesson Sections

  1. Decomposing the Prompt (concepts: paBatchVsStreaming)

    When an interviewer says "design a pipeline to ingest clickstream data from our mobile app into our analytics warehouse," they are not asking you to start writing Spark code. They're asking: can you think in layers? The single biggest mistake candidates make is diving into implementation before establishing scope. The Five-Layer Framework Every pipeline decomposes into five layers. Naming them explicitly in the first 60 seconds of your answer signals seniority. Say: "I'll walk through this in fi

  2. The Ingestion Layer (concepts: paCdc)

    When the interviewer asks how data gets into your pipeline, they are probing for a real choice, not a list. The three source patterns (file drops, API pulls, and Change Data Capture) have different reliability, latency, and cost profiles. Name which one you chose and why before they have to ask. File-Based Ingestion The simplest and most common pattern: a source system drops files (CSV, JSON, Parquet) into cloud storage (S3, GCS, ADLS). Your pipeline picks them up on a schedule. This is the defa

  3. The Transformation Layer (concepts: paEltVsEtl)

    The transformation layer is where the interview is won or lost. This is where interviewers spend the most time probing, because it reveals whether you understand data modeling, partitioning, and the ELT vs ETL tradeoff - the single most tested concept in pipeline interviews. ELT vs ETL: The #1 Tested Concept ETL (Extract-Transform-Load) transforms data before loading it into the warehouse. ELT (Extract-Load-Transform) loads raw data first, then transforms it inside the warehouse. This isn't ju

  4. The Serving Layer (concepts: paMedallion)

    The serving layer is where most candidates go thin. They spend 25 minutes on ingestion and transformation, then say 'and then analysts query it.' That's a missed opportunity. How data is consumed drives the entire upstream design - and interviewers know it. Consumer Archetypes Different consumers need different data shapes. Analysts writing SQL dashboards need pre-aggregated, denormalized gold tables with low query latency. Data scientists building ML features need wide tables with historical

  5. The Meta Layer (concepts: paDagOrchestration)

    Orchestration, data quality, and monitoring are the meta layer - the infrastructure that makes a pipeline a pipeline instead of a script. Candidates who skip this layer cap themselves at 'hire.' Candidates who treat it as first-class get 'strong hire.' The meta layer is where you prove you've operated pipelines in production, not just built them. Orchestration: DAGs, Not Scripts A production pipeline isn't a Python script that runs on a cron job. It's a directed acyclic graph (DAG) of tasks wi