How Data Moves: Advanced

The interviewer has moved past definitions and into system design territory. Now they want to see how you handle the ugly parts: 'What happens if this pipeline runs twice?' 'What if the consumer cannot keep up?' 'What about late-arriving data?' These questions test whether you have operated a production pipeline or just read about one.

What you will be able to do

Answer 'What if it runs twice?' with the idempotency pattern

Explain backpressure handling without blanking

Design a late-data strategy that does not corrupt downstream tables

Idempotent Pipelines

Daily Life

Interviews

Answer the idempotency question with production patterns

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What happens if this pipeline runs twice?"
▸"How do you make retries safe?"
▸"Define idempotency in the context of data pipelines."

What They Want to Hear

'An idempotent pipeline produces the same result whether it runs once or five times on the same input. I achieve this with MERGE statements that upsert on a primary key, or by replacing entire partitions on each run. This means every retry, every backfill, and every re-run is safe.' This is the answer that shows production experience. Candidates who say 'just make it transactional' are missing the point.

Source

Read Source Data

Transform

Write with MERGE

Consumer

Verify

What to Whiteboard

Three Idempotent Patterns

Pick the right one for the scenario

MERGE/UPSERT: Insert new rows, update existing ones based on a key. Best for slowly changing data.
Partition REPLACE: Delete the entire partition for a date, then insert fresh data. Best for daily batch pipelines.
Tombstone + append: Write with a unique run_id, deduplicate on read. Best when you cannot modify existing data.

The Curveball Follow-ups

After your initial answer, expect these probes

▸"MERGE is slow on a billion-row table. What do you do?" Partition REPLACE is faster: delete the partition, write fresh data. Or stage the delta into a temp table and MERGE only against the relevant partition.
▸"What if your pipeline has side effects, like sending emails?" Side effects break idempotency. Move side effects to a separate step with its own dedup logic (check if the email was already sent before sending).
▸"How do you test idempotency?" Run the pipeline twice on the same input and assert the output tables are byte-for-byte identical. If they differ, the pipeline is not idempotent.

KEY TAKEAWAYS

Say: 'Running twice produces the same result as running once. I use MERGE or partition REPLACE.'

This is THE question that tests production experience. Lead with idempotency when asked 'what makes a pipeline production-ready?'

Testing idempotency: run twice on the same input, assert identical output

Backpressure

Daily Life

Interviews

Explain backpressure strategies with confidence

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What if the consumer cannot keep up?"
▸"How do you prevent overwhelming downstream systems?"
▸"Explain backpressure."

What They Want to Hear

'Backpressure happens when downstream cannot process data as fast as upstream produces it. Without handling it, you either drop data, run out of memory, or queue indefinitely. My approach: buffer short spikes, throttle the producer for sustained overload, and auto-scale the consumer if the infrastructure supports it.' Then name the four strategies.

Transform

Healthy

Transform

Under Pressure

Transform

Critical

Backpressure States

Strategy	How It Works	Use When
Buffer	Queue events until consumer catches up	Short spikes, temporary slowdowns
Drop oldest	Discard the oldest events in queue	Metrics where only latest matters
Throttle producer	Signal upstream to slow down	Kafka consumer groups, TCP flow control
Scale consumer	Add more consumer instances	Sustained high throughput, elastic infra

The Curveball Follow-ups

After your initial answer, expect these probes

▸"You are buffering but the buffer is full. Now what?" Spill to disk as a last resort, or start dropping the oldest events if the use case tolerates it. Never silently drop data without logging and alerting.
▸"How does Kafka handle backpressure?" Consumer groups. Each consumer processes a subset of partitions. If you add more consumers (up to the number of partitions), throughput scales linearly.
▸"What about batch pipelines? Do they have backpressure?" Yes. If a batch job writes faster than the warehouse can handle, you get query timeouts and locks. Throttle the write rate or stage writes in batches.

KEY TAKEAWAYS

Say: 'Buffer short spikes, throttle the producer for sustained overload, scale the consumer if possible.'

Name the four strategies: buffer, drop oldest, throttle, scale

The key insight: every backpressure strategy sacrifices either freshness, completeness, or cost

Late-Arriving Data

Daily Life

Interviews

Handle the late-arriving data question

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What happens when events arrive out of order?"
▸"How do you handle late data?"
▸"What is a watermark in stream processing?"

What They Want to Hear

'Late data is normal, not exceptional. A mobile device loses connectivity, reconnects, and sends a burst of events from 30 minutes ago. I handle this with watermarks: the system's estimate of how far behind reality the data might be. Events arriving after the watermark go to a side output. A daily batch reconciliation job picks up anything the streaming layer dropped.'

Source

Event Arrives

Quality

Check Watermark

Quality

Late?

Consumer

Side Output

What to Whiteboard

The Curveball Follow-ups

After your initial answer, expect these probes

▸"How do you set the watermark?" Based on observed data lateness. If 99% of events arrive within 10 minutes, set the watermark to 10 minutes. Generous watermarks (1 hour) delay results but catch more late data. Tight watermarks (1 minute) give fresh results but drop more.
▸"What about the events that get dropped?" Side output to a late partition. A daily batch reconciliation job merges late events into the canonical table. Nothing is permanently lost.
▸"Does this affect downstream aggregations?" Yes. A count that included late events yesterday might change tomorrow after reconciliation. Communicate to consumers that aggregations are preliminary until the reconciliation window closes.

KEY TAKEAWAYS

Say: 'Watermarks define how late is too late. Events past the watermark go to a side output. A batch reconciliation job catches what streaming dropped.'

The money phrase: 'Late data is normal, not exceptional. I design for it from the start.'

Generous watermarks = accurate but delayed. Tight watermarks = fresh but drops more late events.

Dead Letter Queues

Daily Life

Interviews

Design a DLQ strategy for production pipelines

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What happens to events that fail processing?"
▸"How do you handle poison messages?"
▸"What is a dead letter queue?"

What They Want to Hear

'A DLQ is where events go when they cannot be processed: malformed JSON, schema violations, unhandled exceptions. Instead of crashing the pipeline or silently dropping the event, I route it to a separate queue with the original payload plus the error metadata. Someone reviews the DLQ and either fixes the root cause or discards the events.'

Source

Ingest Event

Transform

Process

Consumer

Write to Target

Queue

Dead Letter Queue

What to Whiteboard

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What if nobody monitors the DLQ?" Then you have silent data loss with extra steps. DLQs require an operational commitment: alert when depth exceeds a threshold, review weekly, and have a runbook for common failure types.
▸"How do you replay events from the DLQ after fixing the bug?" Read from the DLQ, reprocess through the fixed pipeline, and write to the target. The pipeline must be idempotent so replayed events do not create duplicates.
▸"What is the difference between a DLQ and error logging?" Logs tell you what went wrong. A DLQ preserves the original event so you can reprocess it. You need both.

KEY TAKEAWAYS

Say: 'Route failed events to a DLQ with the original payload and error metadata. Alert on depth.'

DLQ without monitoring = silent data loss. Always mention the operational commitment.

DLQ + idempotent pipeline = safe event replay after bug fixes

Cost of Freshness

Daily Life

Interviews

Answer cost-of-freshness questions with authority

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"How do you think about pipeline cost?"
▸"Is real-time always better?"
▸"Walk me through the cost model of batch vs streaming."

What They Want to Hear

'Real-time costs 3-5x more than batch for the same throughput. The cost is not just infrastructure: it is operational complexity, debugging difficulty, and on-call burden. A streaming pipeline that fails at 3 AM requires someone to fix it at 3 AM. A batch pipeline that fails at 3 AM can wait until morning.' This answer shows you think about total cost of ownership, not just compute bills.

3-5x more

Streaming vs batch compute cost

24/7 running

Streaming never releases compute

10x debug time

Streaming failures are harder to reproduce

The Curveball Follow-ups

After your initial answer, expect these probes

▸"How do you decide if the freshness is worth the cost?" Ask the business: 'If this data is 1 hour old instead of 1 second old, how much revenue do we lose?' If nobody can quantify it, batch is the right choice.
▸"What is the biggest cost lever?" Whether compute runs continuously or on-demand. A daily batch job uses 1 hour of compute per day. Streaming uses 24 hours. Everything else is rounding error.
▸"How do you reduce streaming costs?" Micro-batch instead of true streaming (spin up and shut down every 5 minutes). Auto-scaling consumer groups that scale to zero during low-traffic windows.

✓Do

Quantify the business value of freshness before building streaming
Start batch, add streaming only when the business case is proven
Track cost per pipeline-hour as a metric

✗Don't

Build streaming because it is technically interesting
Assume real-time is always better
Ignore on-call burden when comparing architectures

KEY TAKEAWAYS

Say: 'Streaming costs 3-5x more and 10x more debugging time. The freshness must justify that.'

The power move: 'My first question would be: what revenue impact does stale data have?'

Total cost = compute + storage + operations + on-call + debugging time

Handle the depth probes: idempotency, backpressure, and cost

Category: Pipeline Architecture
Difficulty: advanced
Duration: 30 minutes
Challenges: 0 hands-on challenges

Topics covered: Idempotent Pipelines, Backpressure, Late-Arriving Data, Dead Letter Queues, Cost of Freshness

Lesson Sections

Idempotent Pipelines (concepts: paIdempotency)
What They Want to Hear 'An idempotent pipeline produces the same result whether it runs once or five times on the same input. I achieve this with MERGE statements that upsert on a primary key, or by replacing entire partitions on each run. This means every retry, every backfill, and every re-run is safe.' This is the answer that shows production experience. Candidates who say 'just make it transactional' are missing the point. Three Idempotent Patterns
Backpressure (concepts: paStreamProcessing)
What They Want to Hear 'Backpressure happens when downstream cannot process data as fast as upstream produces it. Without handling it, you either drop data, run out of memory, or queue indefinitely. My approach: buffer short spikes, throttle the producer for sustained overload, and auto-scale the consumer if the infrastructure supports it.' Then name the four strategies.
Late-Arriving Data (concepts: paLateData)
What They Want to Hear 'Late data is normal, not exceptional. A mobile device loses connectivity, reconnects, and sends a burst of events from 30 minutes ago. I handle this with watermarks: the system's estimate of how far behind reality the data might be. Events arriving after the watermark go to a side output. A daily batch reconciliation job picks up anything the streaming layer dropped.'
Dead Letter Queues (concepts: paDeadLetterQueue)
What They Want to Hear 'A DLQ is where events go when they cannot be processed: malformed JSON, schema violations, unhandled exceptions. Instead of crashing the pipeline or silently dropping the event, I route it to a separate queue with the original payload plus the error metadata. Someone reviews the DLQ and either fixes the root cause or discards the events.'
Cost of Freshness (concepts: paBatchVsStreaming)
What They Want to Hear 'Real-time costs 3-5x more than batch for the same throughput. The cost is not just infrastructure: it is operational complexity, debugging difficulty, and on-call burden. A streaming pipeline that fails at 3 AM requires someone to fix it at 3 AM. A batch pipeline that fails at 3 AM can wait until morning.' This answer shows you think about total cost of ownership, not just compute bills.