DataDriven
LearnPracticeInterviewDiscussDailyJobs

A growth-stage observability company ingests 50 billion log events per day from a Kafka topic

A medium Pipeline Design interview practice problem on DataDriven. Write and execute real pipeline design code with instant grading.

Domain
Pipeline Design
Difficulty
medium

Problem

A growth-stage observability company ingests 50 billion log events per day from a Kafka topic. The canvas has the source and the three consumers, each with a distinct freshness tier: incident pager (tier 1, sub-30-second), live ops dashboard (tier 2, rolling 5-minute aggregates), billing capacity report (tier 4, daily). Apply the entire intermediate tier: (i-s0) name which axis constrains each branch (latency for paging, throughput for billing); (i-s1) use micro-batch on the dashboard path with a 1-minute trigger; (i-s2) keep streaming only where the latency has dollar value (the pager); (i-s3) stateful transforms on the streaming branches require a state store; (i-s4) split paths by tier rather than imposing one rhythm on three different consumers. Carve the single Kafka stream into three branches with the right engine per branch: Flink with a state store (RocksDB or S3 checkpoint) for the paging branch tagged real-time or < 1min; Spark Structured Streaming for the dashboard branch tagged < 15min; plain Spark or PySpark or dbt nightly batch for the billing branch tagged < 24h. Add a shared bronze raw layer in object storage (S3, GCS, or ADLS) so Kafka has one outgoing edge and all three branches read from bronze, not from Kafka directly.

Practice This Problem

Solve this Pipeline Design problem with real code execution. DataDriven runs your solution and grades it automatically.

Related

  • All Practice Problems
  • Mock Interview Mode
  • System Design Interview Questions
  • Data Engineering Interview Prep Guide
  • Daily Challenge
  • Data Engineering Lessons