A nightly batch pipeline on the canvas reads 18 million orders per day from Postgres, joins with a p
A medium Pipeline Design mock interview question on DataDriven. Practice with AI-powered feedback, real code execution, and a hire/no-hire decision.
- Domain
- Pipeline Design
- Difficulty
- medium
Interview Prompt
A nightly batch pipeline on the canvas reads 18 million orders per day from Postgres, joins with a product dimension, and writes a fact_daily_orders table. The runtime has stretched from 3 hours to 11 hours after a volume increase, and the 6am SLA slips most mornings to noon. The executive dashboard reads tier-4 daily freshness; the marketing team has built a shadow streaming pipeline because they need tier-2 (under 15-minute) freshness. Apply the diagnosis-first redesign this section just walked through. Do not migrate everything to a Flink streaming pipeline (the wrong instinct; 20x cost, 9-month engineering, unjustified). Apply the right diagnosis: volume outgrew cadence (run more often, not differently) and consumers have different freshness needs (split the paths by tier). Replace the single nightly batch with two paths: (1) an hourly micro-batch path for the executive dashboard using batch tools (plain Spark, PySpark, or dbt) tagged with slaFreshness < 1h on its warehouse table, and (2) a streaming micro-batch path for the marketing dashboard using Spark Structured Streaming or Flink with a 1-minute trigger, tagged with slaFreshness < 15min on its serving store. Both paths share the same Postgres source.
How This Interview Works
- Read the vague prompt (just like a real interview)
- Ask clarifying questions to the AI interviewer
- Write your pipeline design solution with real code execution
- Get instant feedback and a hire/no-hire decision