Kafka system design interview questions for data engineer roles. Partition strategy for high-throughput ingest. Consumer groups and rebalance protocol. Exactly-once with transactional writes and read_committed. Replication factor and ISR. Log retention and compaction. Kafka Streams for in-Kafka processing.
Kafka system design questions in data engineer interviews test seven recurring concerns. Partition strategy: how many partitions, which key. Throughput targeting drives partition count (typically 10-20 MB/sec per partition for safe headroom; 580 MB/sec peak ingest needs 29-58 partitions). Key choice determines ordering guarantees (all events for one user_id land on one partition if keyed by user_id; ordering preserved within partition). Partition count is hard to change post-creation; size for 2-3x growth.
Consumer groups and rebalance protocol. A consumer group is the unit of parallelism for downstream processing; each partition is consumed by exactly one consumer in the group. Rebalance happens when consumers join or leave, redistributing partitions. Cooperative rebalancing (Kafka 2.4+) reduces rebalance pain by not stopping all consumers during the transition. For data engineer interview rounds: the rebalance story matters because it determines processing latency during scale events. Sticky partition assignment minimizes data movement during rebalance.
Exactly-once semantics. Kafka transactional API plus consumer isolation_level = read_committed gives end-to-end exactly-once within a Kafka cluster. Producer writes to multiple topics in a single transaction; consumer reads only committed messages. Combined with idempotent producer (enable.idempotence = true), this provides at-least-once delivery promoted to exactly-once effect. Limitation: works within the Kafka cluster boundary; sinks outside Kafka need their own idempotency (MERGE INTO, dedup on composite key).
Replication and ISR. Replication factor 3 is the default for production data engineer pipelines. ISR (in-sync replicas) is the set of replicas currently caught up to the leader. Producer acks: acks = 0 (fire and forget), acks = 1 (leader confirms), acks = all (all ISR confirm). For data engineer pipelines, acks = all with min.insync.replicas = 2 provides durability against single-broker failure. Unclean leader election (allowing out-of-sync replicas to become leader) trades durability for availability; default is false in modern Kafka.
Log retention and compaction. Two retention modes. Time-based (retention.ms): events older than N hours/days are deleted. Default 7 days for most data engineer use cases (allows replay window). Compaction (cleanup.policy = compact): only the latest value per key is retained; older values are garbage-collected. Useful for change-log topics (the latest state per primary key is what matters). Both modes can compose: compacted topics with time-based retention for the compacted state.
Kafka Streams for in-Kafka processing. Kafka Streams is the embedded Java library for transformations and aggregations within Kafka topics without an external compute cluster. Useful for lightweight processing (enrichment, filtering, simple aggregation). State is stored in RocksDB locally on each instance and changelog topics in Kafka for fault tolerance. Trade-off: Kafka Streams is Kafka-only and Java-only (or Scala). Larger workloads use Flink or Spark Structured Streaming.
Companies that emphasize Kafka heavily in data engineer interviews: Stripe (financial-data pipelines with transactional Kafka), Netflix (Kafka for streaming ingest, Mantis on top), Uber (Kafka for ride-dispatch analytics, Kafka Streams for some flows), LinkedIn (Kafka was created at LinkedIn; deep Kafka expertise expected). Confluent Cloud and Amazon MSK are common managed offerings.
Kafka System Design Interview Questions
Kafka design problems for data engineer interview prep.
123 practice problems matching this filter. Difficulty: medium (57), hard (66).
Pipeline Architecture (123)
- 45 Minutes Turned Into 3.5 Hours - medium - Spark jobs are running. Just not fast enough.
- 600 Million Events a Day - hard - 600 million events a day. Two years of retention.
- A Clean Number for Every Merchant - hard - Raw payment logs in. Clean merchant summaries out.
- A Million Cars Phoning Home - hard - Every vehicle is a sensor. Deploy the pipeline to catch it all.
- Analysts Are Slowing the Store Down - medium - Orders placed. Data warehouse hungry.
- A New Column on a Billion Rows - hard - Add and backfill a new column to a billion-row production table with zero downtime.
- A Shared Drive Full of Contracts - medium - Buried in PDFs. The data is in there somewhere.
- A Stream All Day and a File at Midnight - hard - Real-time and batch. Same pipeline. No compromises.
- Badging Items That Already Sold Out - hard - Same-day delivery. The features have to be faster.
- Basel, CCAR, and Monday Morning - medium - The regulator does not accept 'eventually consistent.'
Common questions
- How many Kafka partitions does a data engineer pipeline need?
- Roughly throughput-in-MBps divided by 10-20 MB/sec per partition for safe headroom. 580 MB/sec peak ingest = 29-58 partitions. Partition count is hard to change post-creation (existing consumers do not see new partitions without reconfiguration). Size for 2-3x expected growth. More partitions also mean more producer-side memory and more controller load; do not over-provision.
- How does Kafka achieve exactly-once semantics?
- Kafka transactional API plus consumer isolation_level = read_committed. Producer writes to multiple topics in a single transaction; consumer reads only committed messages. Combined with idempotent producer (enable.idempotence = true), provides exactly-once within the Kafka cluster. Sinks outside Kafka (warehouses, databases) need their own idempotency: MERGE INTO with run_id, dedup on composite key.
- What is the difference between Kafka acks 0, 1, and all?
- acks=0: producer does not wait for any acknowledgment. Fire and forget. Risk of data loss on broker failure. acks=1: producer waits for the leader broker to confirm. Loss possible if leader fails before replication. acks=all (also acks=-1): producer waits for all in-sync replicas to confirm. Combined with min.insync.replicas=2 and replication factor 3, survives single-broker failure with zero data loss. Default for data engineer production pipelines.
- What is a Kafka consumer group and how does rebalance work?
- Consumer group is the unit of parallelism. Each partition is consumed by exactly one consumer in the group. When consumers join or leave, Kafka rebalances by reassigning partitions. Cooperative rebalancing (Kafka 2.4+) does not stop all consumers; sticky assignment minimizes data movement. Rebalance latency matters for SLA during scale events; faster autoscale means more frequent rebalances.
- When does a data engineer use Kafka log compaction?
- When the topic represents the latest state per key rather than a sequence of events. Change-log topics (CDC, materialized views). Compaction garbage-collects older values for the same key, retaining only the latest. Combined with time-based retention for the compacted state. Useful for building state-rebuild logic in Kafka Streams or Flink.
- What is the difference between Kafka and Kinesis from a data engineer interview perspective?
- Functionally similar: distributed, partitioned, durable, append-only log. Kafka is open-source with the largest ecosystem (Kafka Connect, Schema Registry, ksqlDB, Confluent). Kinesis is AWS-managed with simpler ops and tighter AWS integration. Kafka has lower latency, higher throughput per partition, more flexibility. Kinesis has lower operational overhead. Choose based on company stack.
- What is Kafka Streams and when does a data engineer use it?
- Kafka Streams is the embedded Java library for transformations within Kafka topics without an external compute cluster. Useful for lightweight processing: enrichment, filtering, simple aggregation. State in RocksDB locally with changelog topics for fault tolerance. Trade-off: Kafka-only and Java-only. Larger workloads use Flink or Spark Structured Streaming. For pipelines that need cross-system joins or write to non-Kafka sinks, pick Flink or Spark.