A nightly batch pipeline on the canvas reads 18 million orders per day from Postgres, joins with a p
A medium Pipeline Design interview practice problem on DataDriven. Write and execute real pipeline design code with instant grading.
- Domain
- Pipeline Design
- Difficulty
- medium
Problem
A nightly batch pipeline on the canvas reads 18 million orders per day from Postgres, joins with a product dimension, and writes a fact_daily_orders table. The runtime has stretched from 3 hours to 11 hours after a volume increase, and the 6am SLA slips most mornings to noon. The executive dashboard reads tier-4 daily freshness; the marketing team has built a shadow streaming pipeline because they need tier-2 (under 15-minute) freshness. Apply the diagnosis-first redesign this section just walked through. Do not migrate everything to a Flink streaming pipeline (the wrong instinct; 20x cost, 9-month engineering, unjustified). Apply the right diagnosis: volume outgrew cadence (run more often, not differently) and consumers have different freshness needs (split the paths by tier). Replace the single nightly batch with two paths: (1) an hourly micro-batch path for the executive dashboard using batch tools (plain Spark, PySpark, or dbt) tagged with slaFreshness < 1h on its warehouse table, and (2) a streaming micro-batch path for the marketing dashboard using Spark Structured Streaming or Flink with a 1-minute trigger, tagged with slaFreshness < 15min on its serving store. Both paths share the same Postgres source.
Practice This Problem
Solve this Pipeline Design problem with real code execution. DataDriven runs your solution and grades it automatically.