How Data Moves: Advanced
The interviewer has moved past definitions and into system design territory. Now they want to see how you handle the ugly parts: 'What happens if this pipeline runs twice?' 'What if the consumer cannot keep up?' 'What about late-arriving data?' These questions test whether you have operated a production pipeline or just read about one.
Idempotent Pipelines
Answer the idempotency question with production patterns
When you hear these in an interview, this is the concept being tested
- ▸"What happens if this pipeline runs twice?"
- ▸"How do you make retries safe?"
- ▸"Define idempotency in the context of data pipelines."
What They Want to Hear
Three Idempotent Patterns
- MERGE/UPSERT: Insert new rows, update existing ones based on a key. Best for slowly changing data.
- Partition REPLACE: Delete the entire partition for a date, then insert fresh data. Best for daily batch pipelines.
- Tombstone + append: Write with a unique run_id, deduplicate on read. Best when you cannot modify existing data.
After your initial answer, expect these probes
- ▸"MERGE is slow on a billion-row table. What do you do?" Partition REPLACE is faster: delete the partition, write fresh data. Or stage the delta into a temp table and MERGE only against the relevant partition.
- ▸"What if your pipeline has side effects, like sending emails?" Side effects break idempotency. Move side effects to a separate step with its own dedup logic (check if the email was already sent before sending).
- ▸"How do you test idempotency?" Run the pipeline twice on the same input and assert the output tables are byte-for-byte identical. If they differ, the pipeline is not idempotent.
Backpressure
Explain backpressure strategies with confidence
When you hear these in an interview, this is the concept being tested
- ▸"What if the consumer cannot keep up?"
- ▸"How do you prevent overwhelming downstream systems?"
- ▸"Explain backpressure."
What They Want to Hear
| Strategy | How It Works | Use When |
|---|---|---|
| Buffer | Queue events until consumer catches up | Short spikes, temporary slowdowns |
| Drop oldest | Discard the oldest events in queue | Metrics where only latest matters |
| Throttle producer | Signal upstream to slow down | Kafka consumer groups, TCP flow control |
| Scale consumer | Add more consumer instances | Sustained high throughput, elastic infra |
After your initial answer, expect these probes
- ▸"You are buffering but the buffer is full. Now what?" Spill to disk as a last resort, or start dropping the oldest events if the use case tolerates it. Never silently drop data without logging and alerting.
- ▸"How does Kafka handle backpressure?" Consumer groups. Each consumer processes a subset of partitions. If you add more consumers (up to the number of partitions), throughput scales linearly.
- ▸"What about batch pipelines? Do they have backpressure?" Yes. If a batch job writes faster than the warehouse can handle, you get query timeouts and locks. Throttle the write rate or stage writes in batches.
Late-Arriving Data
Handle the late-arriving data question
When you hear these in an interview, this is the concept being tested
- ▸"What happens when events arrive out of order?"
- ▸"How do you handle late data?"
- ▸"What is a watermark in stream processing?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"How do you set the watermark?" Based on observed data lateness. If 99% of events arrive within 10 minutes, set the watermark to 10 minutes. Generous watermarks (1 hour) delay results but catch more late data. Tight watermarks (1 minute) give fresh results but drop more.
- ▸"What about the events that get dropped?" Side output to a late partition. A daily batch reconciliation job merges late events into the canonical table. Nothing is permanently lost.
- ▸"Does this affect downstream aggregations?" Yes. A count that included late events yesterday might change tomorrow after reconciliation. Communicate to consumers that aggregations are preliminary until the reconciliation window closes.
Dead Letter Queues
Design a DLQ strategy for production pipelines
When you hear these in an interview, this is the concept being tested
- ▸"What happens to events that fail processing?"
- ▸"How do you handle poison messages?"
- ▸"What is a dead letter queue?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"What if nobody monitors the DLQ?" Then you have silent data loss with extra steps. DLQs require an operational commitment: alert when depth exceeds a threshold, review weekly, and have a runbook for common failure types.
- ▸"How do you replay events from the DLQ after fixing the bug?" Read from the DLQ, reprocess through the fixed pipeline, and write to the target. The pipeline must be idempotent so replayed events do not create duplicates.
- ▸"What is the difference between a DLQ and error logging?" Logs tell you what went wrong. A DLQ preserves the original event so you can reprocess it. You need both.
Cost of Freshness
Answer cost-of-freshness questions with authority
When you hear these in an interview, this is the concept being tested
- ▸"How do you think about pipeline cost?"
- ▸"Is real-time always better?"
- ▸"Walk me through the cost model of batch vs streaming."
What They Want to Hear
After your initial answer, expect these probes
- ▸"How do you decide if the freshness is worth the cost?" Ask the business: 'If this data is 1 hour old instead of 1 second old, how much revenue do we lose?' If nobody can quantify it, batch is the right choice.
- ▸"What is the biggest cost lever?" Whether compute runs continuously or on-demand. A daily batch job uses 1 hour of compute per day. Streaming uses 24 hours. Everything else is rounding error.
- ▸"How do you reduce streaming costs?" Micro-batch instead of true streaming (spin up and shut down every 5 minutes). Auto-scaling consumer groups that scale to zero during low-traffic windows.
- Quantify the business value of freshness before building streaming
- Start batch, add streaming only when the business case is proven
- Track cost per pipeline-hour as a metric
- Build streaming because it is technically interesting
- Assume real-time is always better
- Ignore on-call burden when comparing architectures
Handle the depth probes: idempotency, backpressure, and cost
- Category
- Pipeline Architecture
- Difficulty
- advanced
- Duration
- 30 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Idempotent Pipelines, Backpressure, Late-Arriving Data, Dead Letter Queues, Cost of Freshness
Lesson Sections
- Idempotent Pipelines (concepts: paBatchProcessing)
What They Want to Hear 'An idempotent pipeline produces the same result whether it runs once or five times on the same input. I achieve this with MERGE statements that upsert on a primary key, or by replacing entire partitions on each run. This means every retry, every backfill, and every re-run is safe.' This is the answer that shows production experience. Candidates who say 'just make it transactional' are missing the point. Three Idempotent Patterns
- Backpressure (concepts: paStreamProcessing)
What They Want to Hear 'Backpressure happens when downstream cannot process data as fast as upstream produces it. Without handling it, you either drop data, run out of memory, or queue indefinitely. My approach: buffer short spikes, throttle the producer for sustained overload, and auto-scale the consumer if the infrastructure supports it.' Then name the four strategies.
- Late-Arriving Data (concepts: paStreamProcessing)
What They Want to Hear 'Late data is normal, not exceptional. A mobile device loses connectivity, reconnects, and sends a burst of events from 30 minutes ago. I handle this with watermarks: the system's estimate of how far behind reality the data might be. Events arriving after the watermark go to a side output. A daily batch reconciliation job picks up anything the streaming layer dropped.'
- Dead Letter Queues (concepts: paApiIngestion)
What They Want to Hear 'A DLQ is where events go when they cannot be processed: malformed JSON, schema violations, unhandled exceptions. Instead of crashing the pipeline or silently dropping the event, I route it to a separate queue with the original payload plus the error metadata. Someone reviews the DLQ and either fixes the root cause or discards the events.'
- Cost of Freshness (concepts: paBatchVsStreaming)
What They Want to Hear 'Real-time costs 3-5x more than batch for the same throughput. The cost is not just infrastructure: it is operational complexity, debugging difficulty, and on-call burden. A streaming pipeline that fails at 3 AM requires someone to fix it at 3 AM. A batch pipeline that fails at 3 AM can wait until morning.' This answer shows you think about total cost of ownership, not just compute bills.