Learn Practice Interview Discuss Daily Jobs

A nightly batch pipeline on the canvas reads 18 million orders per day from Postgres, joins with a p

A medium Pipeline Design mock interview question on DataDriven. Practice with AI-powered feedback, real code execution, and a hire/no-hire decision.

Domain: Pipeline Design
Difficulty: medium

Interview Prompt

A nightly batch pipeline on the canvas reads 18 million orders per day from Postgres, joins with a product dimension, and writes a fact_daily_orders table. The runtime has stretched from 3 hours to 11 hours after a volume increase, and the 6am SLA slips most mornings to noon. The executive dashboard reads tier-4 daily freshness; the marketing team has built a shadow streaming pipeline because they need tier-2 (under 15-minute) freshness. Apply the diagnosis-first redesign this section just walked through. Do not migrate everything to a Flink streaming pipeline (the wrong instinct; 20x cost, 9-month engineering, unjustified). Apply the right diagnosis: volume outgrew cadence (run more often, not differently) and consumers have different freshness needs (split the paths by tier). Replace the single nightly batch with two paths: (1) an hourly micro-batch path for the executive dashboard using batch tools (plain Spark, PySpark, or dbt) tagged with slaFreshness < 1h on its warehouse table, and (2) a streaming micro-batch path for the marketing dashboard using Spark Structured Streaming or Flink with a 1-minute trigger, tagged with slaFreshness < 15min on its serving store. Both paths share the same Postgres source.

How This Interview Works

Read the vague prompt (just like a real interview)
Ask clarifying questions to the AI interviewer
Write your pipeline design solution with real code execution
Get instant feedback and a hire/no-hire decision

Related

All Mock Interviews
Practice Mode (untimed)
System Design Interview Questions
Data Engineering Interview Prep Guide
Practice Problems
Daily Challenge