Loading lesson...
Survive the follow-up probes on batch, streaming, and hybrid
Survive the follow-up probes on batch, streaming, and hybrid
Topics covered: Batch Mechanics, Stream Guarantees, File Format Depth, API Patterns, Hybrid Architectures
What They Want to Hear 'I use a high-water mark pattern. The pipeline records the maximum timestamp from the last successful run. On the next run, it only reads rows with a timestamp after that mark. This means we process 50,000 changed rows instead of re-reading 500 million.' That is the core answer. Then add depth: 'I run incremental daily and a full reload weekly as a safety net to catch anything the incremental logic missed.'
What They Want to Hear 'Exactly-once means every event is processed exactly one time, with no loss and no duplicates. In practice, most production systems use at-least-once delivery with idempotent writes downstream, because true exactly-once is expensive.' This is the answer that shows real-world experience. Candidates who say 'just use exactly-once' without mentioning the cost are waving a red flag.
The #1 Screening Question 'Why Parquet?' is asked in over half of pipeline interviews as a screening question. Here is the three-word answer: columnar, compressed, predicate-pushdown. Then expand: 'Parquet stores each column separately, so analytical queries that only need 3 out of 100 columns read 97% less data. Same-type values grouped together compress 10x better than mixed-type rows. And row group statistics let the engine skip entire chunks without reading them.'
What They Want to Hear 'I paginate with cursors, not page numbers, because data can shift between pages during extraction. I store the raw API response before transforming so schema changes do not destroy historical data. For backfill, I request a bulk export from the vendor rather than crawling months of paginated responses through a rate-limited API.'
What They Want to Hear 'Lambda runs batch and streaming in parallel: a batch layer for accurate historical data and a speed layer for fresh approximate data. The serving layer merges both. The problem is maintaining two separate codebases. Kappa eliminates the batch layer entirely by replaying the event log when you need to reprocess. The tradeoff is that Kappa requires a durable event log (Kafka) large enough to hold your full history.'