How Data Moves: Intermediate
You answered 'batch vs streaming.' The interviewer nods. Now comes the real test: 'How does your batch job handle incremental loads?' or 'What does exactly-once mean and how do you achieve it?' These follow-ups are where most candidates fall apart. Here is how to survive them.
Batch Mechanics
Answer incremental loading questions with the high-water mark pattern
When you hear these in an interview, this is the concept being tested
- ▸"How do you handle incremental loads?"
- ▸"Full reload vs incremental: when do you use each?"
- ▸"What happens when a batch job fails mid-run?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"What about deleted rows? Incremental loading misses those."Correct. Incremental with timestamps cannot detect hard deletes. You need CDC (Change Data Capture) from the source, which captures inserts, updates, AND deletes. Or soft deletes with a deleted_at flag.
- ▸"The source table has no updated_at column. Now what?" Full reload is the only safe option. Or push the source team to add CDC. Some teams hash rows and compare, but that is expensive.
- ▸"What if the high-water mark update fails after the data writes?" The pipeline re-processes some rows on the next run. That is why idempotent writes (MERGE/UPSERT) are mandatory: re-processing the same rows produces the same result.
Stream Guarantees
Explain delivery guarantees without hand-waving
When you hear these in an interview, this is the concept being tested
- ▸"What does exactly-once mean?"
- ▸"How do you prevent duplicate processing?"
- ▸"At-least-once vs exactly-once: what is the difference?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"How does Flink achieve exactly-once?" Checkpointing. Flink periodically snapshots the processor state. On failure, it restores the snapshot and replays from that point. No event is processed twice because the state includes what has already been processed.
- ▸"If at-least-once duplicates events, how do you deduplicate?" Idempotent writes downstream. MERGE/UPSERT with a unique event ID. The consumer handles duplicates, not the message broker.
- ▸"When would you actually pay for exactly-once?" Financial transactions, billing systems, anything where double-counting costs real money. For metrics dashboards, at-least-once with dedup is fine and much cheaper.
| Guarantee | Cost | Use When |
|---|---|---|
| At-most-once | Cheapest | Metrics that tolerate gaps |
| At-least-once | Moderate | Most pipelines (dedup downstream) |
| Exactly-once | Expensive | Financial transactions, billing |
File Format Depth
Nail the 'Why Parquet?' screening question
When you hear these in an interview, this is the concept being tested
- ▸"Why Parquet over CSV?"
- ▸"What compression do you use?"
- ▸"Row-oriented vs columnar: explain the difference."
The #1 Screening Question
After your initial answer, expect these probes
- ▸"When would you NOT use Parquet?" Small files exchanged between teams (CSV is human-readable). Streaming ingestion (Avro with a schema registry is better). Data under 10 MB (overhead of columnar format outweighs benefits).
- ▸"What about write performance?" Parquet is slower to write because each column must be sorted and compressed separately. For high-throughput streaming writes, Avro is faster.
- ▸"Snappy or Gzip?" Snappy for interactive queries (fast decompression). Gzip for archival (better compression, slower reads). Zstd is the emerging best-of-both.
API Patterns
Answer advanced API ingestion probes
When you hear these in an interview, this is the concept being tested
- ▸"How do you handle paginated APIs at scale?"
- ▸"What happens when the API changes its schema?"
- ▸"How do you backfill historical data from an API?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"Why cursors over page numbers?" If records are inserted or deleted between page fetches, page-number pagination can skip records or return duplicates. Cursors are stable.
- ▸"The API returns 429 (rate limited). What does your pipeline do?" Exponential backoff with jitter. Read the Retry-After header. Never hammer a rate-limited endpoint in a tight loop.
- ▸"Your vendor's API goes down for 6 hours. What happens?" The pipeline should retry with backoff, then alert on-call after N failures. On recovery, backfill the gap using the last successful timestamp.
Hybrid Architectures
Navigate the hybrid architecture follow-up question
When you hear these in an interview, this is the concept being tested
- ▸"What is the Lambda architecture?"
- ▸"Micro-batch vs true streaming: when do you use each?"
- ▸"Can you combine batch and streaming?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"Which would you choose?" For most teams, neither. Micro-batch (Spark Structured Streaming) gives near-real-time freshness with batch simplicity. Lambda and Kappa are for when you truly need both sub-second freshness AND accurate historical reprocessing.
- ▸"What is micro-batch?" Spark Structured Streaming processes data in small time windows (seconds to minutes) rather than event-by-event. You give up sub-second latency but gain batch-like simplicity.
- ▸"Is Lambda outdated?" The original pattern is, but the concept (batch for accuracy + streaming for speed) lives on in every modern platform that runs both.
| Pattern | Freshness | Complexity | Best For |
|---|---|---|---|
| Pure Batch | Hours | Low | Daily reports, ML training |
| Micro-batch | Seconds to minutes | Medium | Near-real-time dashboards |
| Lambda | Seconds + accurate | High | Both freshness and accuracy needed |
| Kappa | Seconds | Medium | Event-sourced systems with replay |
Survive the follow-up probes on batch, streaming, and hybrid
- Category
- Pipeline Architecture
- Difficulty
- intermediate
- Duration
- 25 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Batch Mechanics, Stream Guarantees, File Format Depth, API Patterns, Hybrid Architectures
Lesson Sections
- Batch Mechanics (concepts: paBatchProcessing)
What They Want to Hear 'I use a high-water mark pattern. The pipeline records the maximum timestamp from the last successful run. On the next run, it only reads rows with a timestamp after that mark. This means we process 50,000 changed rows instead of re-reading 500 million.' That is the core answer. Then add depth: 'I run incremental daily and a full reload weekly as a safety net to catch anything the incremental logic missed.'
- Stream Guarantees (concepts: paStreamProcessing)
What They Want to Hear 'Exactly-once means every event is processed exactly one time, with no loss and no duplicates. In practice, most production systems use at-least-once delivery with idempotent writes downstream, because true exactly-once is expensive.' This is the answer that shows real-world experience. Candidates who say 'just use exactly-once' without mentioning the cost are waving a red flag.
- File Format Depth (concepts: paFileIngestion)
The #1 Screening Question 'Why Parquet?' is asked in over half of pipeline interviews as a screening question. Here is the three-word answer: columnar, compressed, predicate-pushdown. Then expand: 'Parquet stores each column separately, so analytical queries that only need 3 out of 100 columns read 97% less data. Same-type values grouped together compress 10x better than mixed-type rows. And row group statistics let the engine skip entire chunks without reading them.'
- API Patterns (concepts: paApiIngestion)
What They Want to Hear 'I paginate with cursors, not page numbers, because data can shift between pages during extraction. I store the raw API response before transforming so schema changes do not destroy historical data. For backfill, I request a bulk export from the vendor rather than crawling months of paginated responses through a rate-limited API.'
- Hybrid Architectures (concepts: paBatchVsStreaming)
What They Want to Hear 'Lambda runs batch and streaming in parallel: a batch layer for accurate historical data and a speed layer for fresh approximate data. The serving layer merges both. The problem is maintaining two separate codebases. Kappa eliminates the batch layer entirely by replaying the event log when you need to reprocess. The tradeoff is that Kappa requires a durable event log (Kafka) large enough to hold your full history.'