How Data Moves: Intermediate

You answered 'batch vs streaming.' The interviewer nods. Now comes the real test: 'How does your batch job handle incremental loads?' or 'What does exactly-once mean and how do you achieve it?' These follow-ups are where most candidates fall apart. Here is how to survive them.

What you will be able to do

Answer 'How do you handle incremental loads?' without hesitation

Explain exactly-once delivery without hand-waving

Compare Lambda vs Kappa when the interviewer pushes for hybrid

Batch Mechanics

Daily Life

Interviews

Answer incremental loading questions with the high-water mark pattern

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"How do you handle incremental loads?"
▸"Full reload vs incremental: when do you use each?"
▸"What happens when a batch job fails mid-run?"

What They Want to Hear

'I use a high-water mark pattern. The pipeline records the maximum timestamp from the last successful run. On the next run, it only reads rows with a timestamp after that mark. This means we process 50,000 changed rows instead of re-reading 500 million.' That is the core answer. Then add depth: 'I run incremental daily and a full reload weekly as a safety net to catch anything the incremental logic missed.'

Source

Read High-Water Mark

Source

Extract WHERE updated at > HWM

Transform

Transform + Merge

Consumer

Update High-Water Mark

What to Whiteboard

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What about deleted rows? Incremental loading misses those."Correct. Incremental with timestamps cannot detect hard deletes. You need CDC (Change Data Capture) from the source, which captures inserts, updates, AND deletes. Or soft deletes with a deleted_at flag.
▸"The source table has no updated_at column. Now what?" Full reload is the only safe option. Or push the source team to add CDC. Some teams hash rows and compare, but that is expensive.
▸"What if the high-water mark update fails after the data writes?" The pipeline re-processes some rows on the next run. That is why idempotent writes (MERGE/UPSERT) are mandatory: re-processing the same rows produces the same result.

KEY TAKEAWAYS

Say: 'High-water mark pattern. Read only rows newer than the last successful run timestamp.'

Always mention the weakness: incremental misses hard deletes. CDC solves that.

Weekly full reload as a safety net catches anything incremental missed

Stream Guarantees

Daily Life

Interviews

Explain delivery guarantees without hand-waving

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What does exactly-once mean?"
▸"How do you prevent duplicate processing?"
▸"At-least-once vs exactly-once: what is the difference?"

What They Want to Hear

'Exactly-once means every event is processed exactly one time, with no loss and no duplicates. In practice, most production systems use at-least-once delivery with idempotent writes downstream, because true exactly-once is expensive.' This is the answer that shows real-world experience. Candidates who say 'just use exactly-once' without mentioning the cost are waving a red flag.

Delivery Guarantees

At-mostFire and forget; May lose events; Fastest and cheapest
At-leastRetry until confirmed; May duplicate events; Most common default
ExactlyCheckpoints + transactions; No loss, no duplicates; Slowest and most expensive

The Curveball Follow-ups

After your initial answer, expect these probes

▸"How does Flink achieve exactly-once?" Checkpointing. Flink periodically snapshots the processor state. On failure, it restores the snapshot and replays from that point. No event is processed twice because the state includes what has already been processed.
▸"If at-least-once duplicates events, how do you deduplicate?" Idempotent writes downstream. MERGE/UPSERT with a unique event ID. The consumer handles duplicates, not the message broker.
▸"When would you actually pay for exactly-once?" Financial transactions, billing systems, anything where double-counting costs real money. For metrics dashboards, at-least-once with dedup is fine and much cheaper.

Guarantee	Cost	Use When
At-most-once	Cheapest	Metrics that tolerate gaps
At-least-once	Moderate	Most pipelines (dedup downstream)
Exactly-once	Expensive	Financial transactions, billing

KEY TAKEAWAYS

Say: 'At-least-once with idempotent writes is the practical default. Exactly-once for financial data.'

Checkpointing is how Flink achieves exactly-once. Know the one-liner.

The money quote: 'True exactly-once is expensive. Most systems achieve the same result with at-least-once plus downstream dedup.'

File Format Depth

Daily Life

Interviews

Nail the 'Why Parquet?' screening question

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"Why Parquet over CSV?"
▸"What compression do you use?"
▸"Row-oriented vs columnar: explain the difference."

The #1 Screening Question

'Why Parquet?' is asked in over half of pipeline interviews as a screening question. Here is the three-word answer: columnar, compressed, predicate-pushdown. Then expand: 'Parquet stores each column separately, so analytical queries that only need 3 out of 100 columns read 97% less data. Same-type values grouped together compress 10x better than mixed-type rows. And row group statistics let the engine skip entire chunks without reading them.'

10x smaller

Parquet vs CSV for typical analytics data

30x faster

Columnar scans on analytical queries

97% less I/O

Column pruning on a 100-column table reading 3

The Curveball Follow-ups

After your initial answer, expect these probes

▸"When would you NOT use Parquet?" Small files exchanged between teams (CSV is human-readable). Streaming ingestion (Avro with a schema registry is better). Data under 10 MB (overhead of columnar format outweighs benefits).
▸"What about write performance?" Parquet is slower to write because each column must be sorted and compressed separately. For high-throughput streaming writes, Avro is faster.
▸"Snappy or Gzip?" Snappy for interactive queries (fast decompression). Gzip for archival (better compression, slower reads). Zstd is the emerging best-of-both.

KEY TAKEAWAYS

The three-word answer: columnar, compressed, predicate-pushdown

Expand: 'Stores columns separately, compresses 10x, skips irrelevant data without reading it'

Know the exception: Avro for streaming, CSV for small human-readable exchanges

API Patterns

Daily Life

Interviews

Answer advanced API ingestion probes

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"How do you handle paginated APIs at scale?"
▸"What happens when the API changes its schema?"
▸"How do you backfill historical data from an API?"

What They Want to Hear

'I paginate with cursors, not page numbers, because data can shift between pages during extraction. I store the raw API response before transforming so schema changes do not destroy historical data. For backfill, I request a bulk export from the vendor rather than crawling months of paginated responses through a rate-limited API.'

Source

Configuration

Transform

Cursor-based Fetch

Quality

Schema Validation

Source

Stage Raw Response

Production API Ingestion

The Curveball Follow-ups

After your initial answer, expect these probes

▸"Why cursors over page numbers?" If records are inserted or deleted between page fetches, page-number pagination can skip records or return duplicates. Cursors are stable.
▸"The API returns 429 (rate limited). What does your pipeline do?" Exponential backoff with jitter. Read the Retry-After header. Never hammer a rate-limited endpoint in a tight loop.
▸"Your vendor's API goes down for 6 hours. What happens?" The pipeline should retry with backoff, then alert on-call after N failures. On recovery, backfill the gap using the last successful timestamp.

KEY TAKEAWAYS

Say: 'Cursor-based pagination, raw response storage, exponential backoff on rate limits.'

The raw response storage point is what separates juniors from seniors in interviews

For backfill: bulk export for history, API for incremental daily pulls

Hybrid Architectures

Daily Life

Interviews

Navigate the hybrid architecture follow-up question

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What is the Lambda architecture?"
▸"Micro-batch vs true streaming: when do you use each?"
▸"Can you combine batch and streaming?"

What They Want to Hear

'Lambda runs batch and streaming in parallel: a batch layer for accurate historical data and a speed layer for fresh approximate data. The serving layer merges both. The problem is maintaining two separate codebases. Kappa eliminates the batch layer entirely by replaying the event log when you need to reprocess. The tradeoff is that Kappa requires a durable event log (Kafka) large enough to hold your full history.'

✓Lambda

Batch layer for accuracy
Speed layer for freshness
Two codebases to maintain
Most complex, most flexible

✗Kappa

Streaming only, replay for reprocessing
One codebase, simpler ops
Requires durable event log (Kafka)
Simpler but needs replayable history

The Curveball Follow-ups

After your initial answer, expect these probes

▸"Which would you choose?" For most teams, neither. Micro-batch (Spark Structured Streaming) gives near-real-time freshness with batch simplicity. Lambda and Kappa are for when you truly need both sub-second freshness AND accurate historical reprocessing.
▸"What is micro-batch?" Spark Structured Streaming processes data in small time windows (seconds to minutes) rather than event-by-event. You give up sub-second latency but gain batch-like simplicity.
▸"Is Lambda outdated?" The original pattern is, but the concept (batch for accuracy + streaming for speed) lives on in every modern platform that runs both.

Pattern	Freshness	Complexity	Best For
Pure Batch	Hours	Low	Daily reports, ML training
Micro-batch	Seconds to minutes	Medium	Near-real-time dashboards
Lambda	Seconds + accurate	High	Both freshness and accuracy needed
Kappa	Seconds	Medium	Event-sourced systems with replay

KEY TAKEAWAYS

Say: 'Lambda is batch + streaming in parallel. Kappa is streaming-only with replay. Micro-batch is the pragmatic middle.'

The smart answer when asked to choose: 'Micro-batch handles 90% of use cases at a fraction of the complexity.'

Know why Lambda is dying: maintaining two codebases for the same logic is operationally expensive

Survive the follow-up probes on batch, streaming, and hybrid

Category: Pipeline Architecture
Difficulty: intermediate
Duration: 25 minutes
Challenges: 0 hands-on challenges

Topics covered: Batch Mechanics, Stream Guarantees, File Format Depth, API Patterns, Hybrid Architectures

Lesson Sections

Batch Mechanics (concepts: paFullVsIncremental)
What They Want to Hear 'I use a high-water mark pattern. The pipeline records the maximum timestamp from the last successful run. On the next run, it only reads rows with a timestamp after that mark. This means we process 50,000 changed rows instead of re-reading 500 million.' That is the core answer. Then add depth: 'I run incremental daily and a full reload weekly as a safety net to catch anything the incremental logic missed.'
Stream Guarantees (concepts: paIdempotency)
What They Want to Hear 'Exactly-once means every event is processed exactly one time, with no loss and no duplicates. In practice, most production systems use at-least-once delivery with idempotent writes downstream, because true exactly-once is expensive.' This is the answer that shows real-world experience. Candidates who say 'just use exactly-once' without mentioning the cost are waving a red flag.
File Format Depth (concepts: paColumnarVsRow)
The #1 Screening Question 'Why Parquet?' is asked in over half of pipeline interviews as a screening question. Here is the three-word answer: columnar, compressed, predicate-pushdown. Then expand: 'Parquet stores each column separately, so analytical queries that only need 3 out of 100 columns read 97% less data. Same-type values grouped together compress 10x better than mixed-type rows. And row group statistics let the engine skip entire chunks without reading them.'
API Patterns (concepts: paApiIngestion)
What They Want to Hear 'I paginate with cursors, not page numbers, because data can shift between pages during extraction. I store the raw API response before transforming so schema changes do not destroy historical data. For backfill, I request a bulk export from the vendor rather than crawling months of paginated responses through a rate-limited API.'
Hybrid Architectures (concepts: paLambdaArch)
What They Want to Hear 'Lambda runs batch and streaming in parallel: a batch layer for accurate historical data and a speed layer for fresh approximate data. The serving layer merges both. The problem is maintaining two separate codebases. Kappa eliminates the batch layer entirely by replaying the event log when you need to reprocess. The tradeoff is that Kappa requires a durable event log (Kafka) large enough to hold your full history.'