How Data Moves: Advanced

The interviewer has moved past definitions and into system design territory. Now they want to see how you handle the ugly parts: 'What happens if this pipeline runs twice?' 'What if the consumer cannot keep up?' 'What about late-arriving data?' These questions test whether you have operated a production pipeline or just read about one.

Idempotent Pipelines

Daily Life
Interviews

Answer the idempotency question with production patterns

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What happens if this pipeline runs twice?"
  • "How do you make retries safe?"
  • "Define idempotency in the context of data pipelines."

What They Want to Hear

'An idempotent pipeline produces the same result whether it runs once or five times on the same input. I achieve this with MERGE statements that upsert on a primary key, or by replacing entire partitions on each run. This means every retry, every backfill, and every re-run is safe.' This is the answer that shows production experience. Candidates who say 'just make it transactional' are missing the point.
What to Whiteboard
MERGE or REPLACEpost-write check
Read Source Data
Deterministic input
Transform
No side effects, no randomness
Write with MERGE
Upsert: insert new, update existing
Verify
Row count and checksum match

Three Idempotent Patterns

Pick the right one for the scenario
  • MERGE/UPSERT: Insert new rows, update existing ones based on a key. Best for slowly changing data.
  • Partition REPLACE: Delete the entire partition for a date, then insert fresh data. Best for daily batch pipelines.
  • Tombstone + append: Write with a unique run_id, deduplicate on read. Best when you cannot modify existing data.
The Curveball Follow-ups

After your initial answer, expect these probes

  • "MERGE is slow on a billion-row table. What do you do?" Partition REPLACE is faster: delete the partition, write fresh data. Or stage the delta into a temp table and MERGE only against the relevant partition.
  • "What if your pipeline has side effects, like sending emails?" Side effects break idempotency. Move side effects to a separate step with its own dedup logic (check if the email was already sent before sending).
  • "How do you test idempotency?" Run the pipeline twice on the same input and assert the output tables are byte-for-byte identical. If they differ, the pipeline is not idempotent.
KEY TAKEAWAYS
Say: 'Running twice produces the same result as running once. I use MERGE or partition REPLACE.'
This is THE question that tests production experience. Lead with idempotency when asked 'what makes a pipeline production-ready?'
Testing idempotency: run twice on the same input, assert identical output

Backpressure

Daily Life
Interviews

Explain backpressure strategies with confidence

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What if the consumer cannot keep up?"
  • "How do you prevent overwhelming downstream systems?"
  • "Explain backpressure."

What They Want to Hear

'Backpressure happens when downstream cannot process data as fast as upstream produces it. Without handling it, you either drop data, run out of memory, or queue indefinitely. My approach: buffer short spikes, throttle the producer for sustained overload, and auto-scale the consumer if the infrastructure supports it.' Then name the four strategies.
Backpressure States
producer speeds upconsumer catches upbuffer limit hitshed load or scale
Healthy
Consumer keeps up
Under Pressure
Buffer growing, latency rising
Critical
Buffer full, must act now
StrategyHow It WorksUse When
BufferQueue events until consumer catches upShort spikes, temporary slowdowns
Drop oldestDiscard the oldest events in queueMetrics where only latest matters
Throttle producerSignal upstream to slow downKafka consumer groups, TCP flow control
Scale consumerAdd more consumer instancesSustained high throughput, elastic infra
The Curveball Follow-ups

After your initial answer, expect these probes

  • "You are buffering but the buffer is full. Now what?" Spill to disk as a last resort, or start dropping the oldest events if the use case tolerates it. Never silently drop data without logging and alerting.
  • "How does Kafka handle backpressure?" Consumer groups. Each consumer processes a subset of partitions. If you add more consumers (up to the number of partitions), throughput scales linearly.
  • "What about batch pipelines? Do they have backpressure?" Yes. If a batch job writes faster than the warehouse can handle, you get query timeouts and locks. Throttle the write rate or stage writes in batches.
KEY TAKEAWAYS
Say: 'Buffer short spikes, throttle the producer for sustained overload, scale the consumer if possible.'
Name the four strategies: buffer, drop oldest, throttle, scale
The key insight: every backpressure strategy sacrifices either freshness, completeness, or cost

Late-Arriving Data

Daily Life
Interviews

Handle the late-arriving data question

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What happens when events arrive out of order?"
  • "How do you handle late data?"
  • "What is a watermark in stream processing?"

What They Want to Hear

'Late data is normal, not exceptional. A mobile device loses connectivity, reconnects, and sends a burst of events from 30 minutes ago. I handle this with watermarks: the system's estimate of how far behind reality the data might be. Events arriving after the watermark go to a side output. A daily batch reconciliation job picks up anything the streaming layer dropped.'
What to Whiteboard
compare timestampspast allowed lateness
Event Arrives
event_time: 14:30
Check Watermark
Current watermark: 15:00
Late?
Yes: 30 minutes past watermark
Side Output
Route to late buffer
The Curveball Follow-ups

After your initial answer, expect these probes

  • "How do you set the watermark?" Based on observed data lateness. If 99% of events arrive within 10 minutes, set the watermark to 10 minutes. Generous watermarks (1 hour) delay results but catch more late data. Tight watermarks (1 minute) give fresh results but drop more.
  • "What about the events that get dropped?" Side output to a late partition. A daily batch reconciliation job merges late events into the canonical table. Nothing is permanently lost.
  • "Does this affect downstream aggregations?" Yes. A count that included late events yesterday might change tomorrow after reconciliation. Communicate to consumers that aggregations are preliminary until the reconciliation window closes.
KEY TAKEAWAYS
Say: 'Watermarks define how late is too late. Events past the watermark go to a side output. A batch reconciliation job catches what streaming dropped.'
The money phrase: 'Late data is normal, not exceptional. I design for it from the start.'
Generous watermarks = accurate but delayed. Tight watermarks = fresh but drops more late events.

Dead Letter Queues

Daily Life
Interviews

Design a DLQ strategy for production pipelines

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What happens to events that fail processing?"
  • "How do you handle poison messages?"
  • "What is a dead letter queue?"

What They Want to Hear

'A DLQ is where events go when they cannot be processed: malformed JSON, schema violations, unhandled exceptions. Instead of crashing the pipeline or silently dropping the event, I route it to a separate queue with the original payload plus the error metadata. Someone reviews the DLQ and either fixes the root cause or discards the events.'
What to Whiteboard
validinvalid or error
Ingest Event
Process
Transform, validate, enrich
Write to Target
Happy path
Dead Letter Queue
Failed events + error metadata
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What if nobody monitors the DLQ?" Then you have silent data loss with extra steps. DLQs require an operational commitment: alert when depth exceeds a threshold, review weekly, and have a runbook for common failure types.
  • "How do you replay events from the DLQ after fixing the bug?" Read from the DLQ, reprocess through the fixed pipeline, and write to the target. The pipeline must be idempotent so replayed events do not create duplicates.
  • "What is the difference between a DLQ and error logging?" Logs tell you what went wrong. A DLQ preserves the original event so you can reprocess it. You need both.
KEY TAKEAWAYS
Say: 'Route failed events to a DLQ with the original payload and error metadata. Alert on depth.'
DLQ without monitoring = silent data loss. Always mention the operational commitment.
DLQ + idempotent pipeline = safe event replay after bug fixes

Cost of Freshness

Daily Life
Interviews

Answer cost-of-freshness questions with authority

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How do you think about pipeline cost?"
  • "Is real-time always better?"
  • "Walk me through the cost model of batch vs streaming."

What They Want to Hear

'Real-time costs 3-5x more than batch for the same throughput. The cost is not just infrastructure: it is operational complexity, debugging difficulty, and on-call burden. A streaming pipeline that fails at 3 AM requires someone to fix it at 3 AM. A batch pipeline that fails at 3 AM can wait until morning.' This answer shows you think about total cost of ownership, not just compute bills.
01
3-5x more
Streaming vs batch compute cost
02
24/7 running
Streaming never releases compute
03
10x debug time
Streaming failures are harder to reproduce
The Curveball Follow-ups

After your initial answer, expect these probes

  • "How do you decide if the freshness is worth the cost?" Ask the business: 'If this data is 1 hour old instead of 1 second old, how much revenue do we lose?' If nobody can quantify it, batch is the right choice.
  • "What is the biggest cost lever?" Whether compute runs continuously or on-demand. A daily batch job uses 1 hour of compute per day. Streaming uses 24 hours. Everything else is rounding error.
  • "How do you reduce streaming costs?" Micro-batch instead of true streaming (spin up and shut down every 5 minutes). Auto-scaling consumer groups that scale to zero during low-traffic windows.
Do
  • Quantify the business value of freshness before building streaming
  • Start batch, add streaming only when the business case is proven
  • Track cost per pipeline-hour as a metric
Don't
  • Build streaming because it is technically interesting
  • Assume real-time is always better
  • Ignore on-call burden when comparing architectures
KEY TAKEAWAYS
Say: 'Streaming costs 3-5x more and 10x more debugging time. The freshness must justify that.'
The power move: 'My first question would be: what revenue impact does stale data have?'
Total cost = compute + storage + operations + on-call + debugging time

Handle the depth probes: idempotency, backpressure, and cost

Category
Pipeline Architecture
Difficulty
advanced
Duration
30 minutes
Challenges
0 hands-on challenges

Topics covered: Idempotent Pipelines, Backpressure, Late-Arriving Data, Dead Letter Queues, Cost of Freshness

Lesson Sections

  1. Idempotent Pipelines (concepts: paBatchProcessing)

    What They Want to Hear 'An idempotent pipeline produces the same result whether it runs once or five times on the same input. I achieve this with MERGE statements that upsert on a primary key, or by replacing entire partitions on each run. This means every retry, every backfill, and every re-run is safe.' This is the answer that shows production experience. Candidates who say 'just make it transactional' are missing the point. Three Idempotent Patterns

  2. Backpressure (concepts: paStreamProcessing)

    What They Want to Hear 'Backpressure happens when downstream cannot process data as fast as upstream produces it. Without handling it, you either drop data, run out of memory, or queue indefinitely. My approach: buffer short spikes, throttle the producer for sustained overload, and auto-scale the consumer if the infrastructure supports it.' Then name the four strategies.

  3. Late-Arriving Data (concepts: paStreamProcessing)

    What They Want to Hear 'Late data is normal, not exceptional. A mobile device loses connectivity, reconnects, and sends a burst of events from 30 minutes ago. I handle this with watermarks: the system's estimate of how far behind reality the data might be. Events arriving after the watermark go to a side output. A daily batch reconciliation job picks up anything the streaming layer dropped.'

  4. Dead Letter Queues (concepts: paApiIngestion)

    What They Want to Hear 'A DLQ is where events go when they cannot be processed: malformed JSON, schema violations, unhandled exceptions. Instead of crashing the pipeline or silently dropping the event, I route it to a separate queue with the original payload plus the error metadata. Someone reviews the DLQ and either fixes the root cause or discards the events.'

  5. Cost of Freshness (concepts: paBatchVsStreaming)

    What They Want to Hear 'Real-time costs 3-5x more than batch for the same throughput. The cost is not just infrastructure: it is operational complexity, debugging difficulty, and on-call burden. A streaming pipeline that fails at 3 AM requires someone to fix it at 3 AM. A batch pipeline that fails at 3 AM can wait until morning.' This answer shows you think about total cost of ownership, not just compute bills.