How Data Moves: Intermediate

You answered 'batch vs streaming.' The interviewer nods. Now comes the real test: 'How does your batch job handle incremental loads?' or 'What does exactly-once mean and how do you achieve it?' These follow-ups are where most candidates fall apart. Here is how to survive them.

Batch Mechanics

Daily Life
Interviews

Answer incremental loading questions with the high-water mark pattern

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How do you handle incremental loads?"
  • "Full reload vs incremental: when do you use each?"
  • "What happens when a batch job fails mid-run?"

What They Want to Hear

'I use a high-water mark pattern. The pipeline records the maximum timestamp from the last successful run. On the next run, it only reads rows with a timestamp after that mark. This means we process 50,000 changed rows instead of re-reading 500 million.' That is the core answer. Then add depth: 'I run incremental daily and a full reload weekly as a safety net to catch anything the incremental logic missed.'
What to Whiteboard
filter predicatedelta rowson success
Read High-Water Mark
Last run: 2024-03-19 07:00
Extract WHERE updated_at > HWM
Only new/changed rows
Transform + Merge
Upsert into target table
Update High-Water Mark
Set to max(updated_at)
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What about deleted rows? Incremental loading misses those."Correct. Incremental with timestamps cannot detect hard deletes. You need CDC (Change Data Capture) from the source, which captures inserts, updates, AND deletes. Or soft deletes with a deleted_at flag.
  • "The source table has no updated_at column. Now what?" Full reload is the only safe option. Or push the source team to add CDC. Some teams hash rows and compare, but that is expensive.
  • "What if the high-water mark update fails after the data writes?" The pipeline re-processes some rows on the next run. That is why idempotent writes (MERGE/UPSERT) are mandatory: re-processing the same rows produces the same result.
KEY TAKEAWAYS
Say: 'High-water mark pattern. Read only rows newer than the last successful run timestamp.'
Always mention the weakness: incremental misses hard deletes. CDC solves that.
Weekly full reload as a safety net catches anything incremental missed

Stream Guarantees

Daily Life
Interviews

Explain delivery guarantees without hand-waving

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What does exactly-once mean?"
  • "How do you prevent duplicate processing?"
  • "At-least-once vs exactly-once: what is the difference?"

What They Want to Hear

'Exactly-once means every event is processed exactly one time, with no loss and no duplicates. In practice, most production systems use at-least-once delivery with idempotent writes downstream, because true exactly-once is expensive.' This is the answer that shows real-world experience. Candidates who say 'just use exactly-once' without mentioning the cost are waving a red flag.
Delivery Guarantees
At-most
Fire and forgetMay lose eventsFastest and cheapest
At-least
Retry until confirmedMay duplicate eventsMost common default
Exactly
Checkpoints + transactionsNo loss, no duplicatesSlowest and most expensive
The Curveball Follow-ups

After your initial answer, expect these probes

  • "How does Flink achieve exactly-once?" Checkpointing. Flink periodically snapshots the processor state. On failure, it restores the snapshot and replays from that point. No event is processed twice because the state includes what has already been processed.
  • "If at-least-once duplicates events, how do you deduplicate?" Idempotent writes downstream. MERGE/UPSERT with a unique event ID. The consumer handles duplicates, not the message broker.
  • "When would you actually pay for exactly-once?" Financial transactions, billing systems, anything where double-counting costs real money. For metrics dashboards, at-least-once with dedup is fine and much cheaper.
GuaranteeCostUse When
At-most-onceCheapestMetrics that tolerate gaps
At-least-onceModerateMost pipelines (dedup downstream)
Exactly-onceExpensiveFinancial transactions, billing
KEY TAKEAWAYS
Say: 'At-least-once with idempotent writes is the practical default. Exactly-once for financial data.'
Checkpointing is how Flink achieves exactly-once. Know the one-liner.
The money quote: 'True exactly-once is expensive. Most systems achieve the same result with at-least-once plus downstream dedup.'

File Format Depth

Daily Life
Interviews

Nail the 'Why Parquet?' screening question

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Why Parquet over CSV?"
  • "What compression do you use?"
  • "Row-oriented vs columnar: explain the difference."

The #1 Screening Question

'Why Parquet?' is asked in over half of pipeline interviews as a screening question. Here is the three-word answer: columnar, compressed, predicate-pushdown. Then expand: 'Parquet stores each column separately, so analytical queries that only need 3 out of 100 columns read 97% less data. Same-type values grouped together compress 10x better than mixed-type rows. And row group statistics let the engine skip entire chunks without reading them.'
01
10x smaller
Parquet vs CSV for typical analytics data
02
30x faster
Columnar scans on analytical queries
03
97% less I/O
Column pruning on a 100-column table reading 3
The Curveball Follow-ups

After your initial answer, expect these probes

  • "When would you NOT use Parquet?" Small files exchanged between teams (CSV is human-readable). Streaming ingestion (Avro with a schema registry is better). Data under 10 MB (overhead of columnar format outweighs benefits).
  • "What about write performance?" Parquet is slower to write because each column must be sorted and compressed separately. For high-throughput streaming writes, Avro is faster.
  • "Snappy or Gzip?" Snappy for interactive queries (fast decompression). Gzip for archival (better compression, slower reads). Zstd is the emerging best-of-both.
KEY TAKEAWAYS
The three-word answer: columnar, compressed, predicate-pushdown
Expand: 'Stores columns separately, compresses 10x, skips irrelevant data without reading it'
Know the exception: Avro for streaming, CSV for small human-readable exchanges

API Patterns

Daily Life
Interviews

Answer advanced API ingestion probes

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How do you handle paginated APIs at scale?"
  • "What happens when the API changes its schema?"
  • "How do you backfill historical data from an API?"

What They Want to Hear

'I paginate with cursors, not page numbers, because data can shift between pages during extraction. I store the raw API response before transforming so schema changes do not destroy historical data. For backfill, I request a bulk export from the vendor rather than crawling months of paginated responses through a rate-limited API.'
Production API Ingestion
initializeeach pagestore raw
Configuration
Endpoint, auth, rate limits
Cursor-based Fetch
Loop through all pages
Schema Validation
Alert if schema drifted
Stage Raw Response
Store full JSON
The Curveball Follow-ups

After your initial answer, expect these probes

  • "Why cursors over page numbers?" If records are inserted or deleted between page fetches, page-number pagination can skip records or return duplicates. Cursors are stable.
  • "The API returns 429 (rate limited). What does your pipeline do?" Exponential backoff with jitter. Read the Retry-After header. Never hammer a rate-limited endpoint in a tight loop.
  • "Your vendor's API goes down for 6 hours. What happens?" The pipeline should retry with backoff, then alert on-call after N failures. On recovery, backfill the gap using the last successful timestamp.
KEY TAKEAWAYS
Say: 'Cursor-based pagination, raw response storage, exponential backoff on rate limits.'
The raw response storage point is what separates juniors from seniors in interviews
For backfill: bulk export for history, API for incremental daily pulls

Hybrid Architectures

Daily Life
Interviews

Navigate the hybrid architecture follow-up question

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What is the Lambda architecture?"
  • "Micro-batch vs true streaming: when do you use each?"
  • "Can you combine batch and streaming?"

What They Want to Hear

'Lambda runs batch and streaming in parallel: a batch layer for accurate historical data and a speed layer for fresh approximate data. The serving layer merges both. The problem is maintaining two separate codebases. Kappa eliminates the batch layer entirely by replaying the event log when you need to reprocess. The tradeoff is that Kappa requires a durable event log (Kafka) large enough to hold your full history.'
Lambda vs Kappa
Lambda
Batch layer for accuracy
Speed layer for freshness
Two codebases to maintain
Most complex, most flexible
Kappa
Streaming only, replay for reprocessing
One codebase, simpler ops
Requires durable event log (Kafka)
Simpler but needs replayable history
The Curveball Follow-ups

After your initial answer, expect these probes

  • "Which would you choose?" For most teams, neither. Micro-batch (Spark Structured Streaming) gives near-real-time freshness with batch simplicity. Lambda and Kappa are for when you truly need both sub-second freshness AND accurate historical reprocessing.
  • "What is micro-batch?" Spark Structured Streaming processes data in small time windows (seconds to minutes) rather than event-by-event. You give up sub-second latency but gain batch-like simplicity.
  • "Is Lambda outdated?" The original pattern is, but the concept (batch for accuracy + streaming for speed) lives on in every modern platform that runs both.
PatternFreshnessComplexityBest For
Pure BatchHoursLowDaily reports, ML training
Micro-batchSeconds to minutesMediumNear-real-time dashboards
LambdaSeconds + accurateHighBoth freshness and accuracy needed
KappaSecondsMediumEvent-sourced systems with replay
KEY TAKEAWAYS
Say: 'Lambda is batch + streaming in parallel. Kappa is streaming-only with replay. Micro-batch is the pragmatic middle.'
The smart answer when asked to choose: 'Micro-batch handles 90% of use cases at a fraction of the complexity.'
Know why Lambda is dying: maintaining two codebases for the same logic is operationally expensive

Survive the follow-up probes on batch, streaming, and hybrid

Category
Pipeline Architecture
Difficulty
intermediate
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: Batch Mechanics, Stream Guarantees, File Format Depth, API Patterns, Hybrid Architectures

Lesson Sections

  1. Batch Mechanics (concepts: paBatchProcessing)

    What They Want to Hear 'I use a high-water mark pattern. The pipeline records the maximum timestamp from the last successful run. On the next run, it only reads rows with a timestamp after that mark. This means we process 50,000 changed rows instead of re-reading 500 million.' That is the core answer. Then add depth: 'I run incremental daily and a full reload weekly as a safety net to catch anything the incremental logic missed.'

  2. Stream Guarantees (concepts: paStreamProcessing)

    What They Want to Hear 'Exactly-once means every event is processed exactly one time, with no loss and no duplicates. In practice, most production systems use at-least-once delivery with idempotent writes downstream, because true exactly-once is expensive.' This is the answer that shows real-world experience. Candidates who say 'just use exactly-once' without mentioning the cost are waving a red flag.

  3. File Format Depth (concepts: paFileIngestion)

    The #1 Screening Question 'Why Parquet?' is asked in over half of pipeline interviews as a screening question. Here is the three-word answer: columnar, compressed, predicate-pushdown. Then expand: 'Parquet stores each column separately, so analytical queries that only need 3 out of 100 columns read 97% less data. Same-type values grouped together compress 10x better than mixed-type rows. And row group statistics let the engine skip entire chunks without reading them.'

  4. API Patterns (concepts: paApiIngestion)

    What They Want to Hear 'I paginate with cursors, not page numbers, because data can shift between pages during extraction. I store the raw API response before transforming so schema changes do not destroy historical data. For backfill, I request a bulk export from the vendor rather than crawling months of paginated responses through a rate-limited API.'

  5. Hybrid Architectures (concepts: paBatchVsStreaming)

    What They Want to Hear 'Lambda runs batch and streaming in parallel: a batch layer for accurate historical data and a speed layer for fresh approximate data. The serving layer merges both. The problem is maintaining two separate codebases. Kappa eliminates the batch layer entirely by replaying the event log when you need to reprocess. The tradeoff is that Kappa requires a durable event log (Kafka) large enough to hold your full history.'