A growth-stage observability company ingests 50 billion log events per day from a Kafka topic
A medium Pipeline Design mock interview question on DataDriven. Practice with AI-powered feedback, real code execution, and a hire/no-hire decision.
- Domain
- Pipeline Design
- Difficulty
- medium
Interview Prompt
A growth-stage observability company ingests 50 billion log events per day from a Kafka topic. The canvas has the source and the three consumers, each with a distinct freshness tier: incident pager (tier 1, sub-30-second), live ops dashboard (tier 2, rolling 5-minute aggregates), billing capacity report (tier 4, daily). Apply the entire intermediate tier: (i-s0) name which axis constrains each branch (latency for paging, throughput for billing); (i-s1) use micro-batch on the dashboard path with a 1-minute trigger; (i-s2) keep streaming only where the latency has dollar value (the pager); (i-s3) stateful transforms on the streaming branches require a state store; (i-s4) split paths by tier rather than imposing one rhythm on three different consumers. Carve the single Kafka stream into three branches with the right engine per branch: Flink with a state store (RocksDB or S3 checkpoint) for the paging branch tagged real-time or < 1min; Spark Structured Streaming for the dashboard branch tagged < 15min; plain Spark or PySpark or dbt nightly batch for the billing branch tagged < 24h. Add a shared bronze raw layer in object storage (S3, GCS, or ADLS) so Kafka has one outgoing edge and all three branches read from bronze, not from Kafka directly.
How This Interview Works
- Read the vague prompt (just like a real interview)
- Ask clarifying questions to the AI interviewer
- Write your pipeline design solution with real code execution
- Get instant feedback and a hire/no-hire decision