Loading lesson...
Answer the Lambda vs Kappa and cost optimization questions
What They Want to Hear 'Lambda architecture runs two parallel paths. The batch layer processes all historical data for accuracy. The speed layer processes real-time events for freshness. A serving layer merges both views so consumers get fast results that are eventually corrected by the batch layer.' That is the answer. Two paths, one for accuracy, one for speed, merged at the serving layer.
What They Want to Hear 'Kappa uses a single streaming pipeline for everything. Real-time events are processed as they arrive. Historical reprocessing is done by replaying the event log through the same pipeline. This eliminates the dual-codebase problem of Lambda: one codebase, one path, one version of the truth.' That is the answer. Single path, replay for reprocessing, no duplicate logic.
What They Want to Hear 'Predicate pushdown moves filter conditions as close to the storage layer as possible so that irrelevant data is never read. If my table is partitioned by date and I query WHERE date = '2025-03-15', the engine skips all other date partitions entirely. Within a Parquet file, row group statistics (min/max values) let the engine skip entire row groups without reading individual rows. The result: a query that would scan 1TB reads only 10GB.' That is the answer. Push filters to
What They Want to Hear 'I optimize cost at three levels. Storage: use columnar formats (Parquet) and compression to reduce data size. Lifecycle policies move cold data to cheaper storage tiers. Compute: right-size clusters based on actual usage, not peak capacity. Use spot instances for non-critical batch jobs at 60-70% discount. Query: predicate pushdown and partition pruning avoid reading unnecessary data, which reduces both compute time and I/O costs.' That is the answer. Three levels: storag
What They Want to Hear 'An idempotent pipeline produces the same result whether it runs once or ten times on the same input. I achieve this by using MERGE/UPSERT for row-level updates or partition REPLACE for partition-level writes. This makes retries, backfills, and re-runs safe. It is the single most important property of a production pipeline.' That is the answer. Same result on re-run. MERGE or REPLACE. Safety for operations.