Brokers, Topics, and Partitions
Apache Kafka shows up in most data engineering interviews involving real-time data. Interviewers test how partitions distribute load, how consumer groups coordinate, what happens when brokers fail, and how exactly-once delivery works in production.
TL;DR: Kafka Interview Questions
Apache Kafka is a distributed event streaming platform built around an append-only partitioned log. Data engineering interviews focus on six areas: brokers and partitions, consumer groups and offsets, replication and ISR, delivery semantics (acks, exactly-once), Kafka Connect and Schema Registry, and operational scenarios like consumer lag, broker failures, and capacity planning.
For most roles you need to explain how partitions are assigned to consumers, when to use acks=all vs acks=1, what the in-sync replica set guarantees, and how exactly-once differs from at-least-once with idempotent writes. Senior roles add cross-datacenter replication, schema evolution strategy, and capacity planning for high-volume topics.
What Interviewers Expect by Seniority
| Level | Expected Coverage |
|---|---|
| Junior | Topics, partitions, consumer groups, offsets. What happens when a consumer joins or leaves a group. |
| Mid | Replication, ISR, delivery guarantees, Schema Registry. Designing a Kafka pipeline and explaining your partition key strategy. |
| Senior | Production scenarios: consumer lag diagnosis, broker failures, cross-datacenter replication, schema evolution strategies, capacity planning for millions of events per second. |
| Staff+ | Tradeoff arguments at the architectural level: Kafka vs Pulsar vs Kinesis, when to introduce Flink vs Kafka Streams, multi-region replication strategy, cost-driven partition design. |
Every problem comes from a real interview report. Run code in your browser.
Kafka Architecture: Core Concepts Interviewers Test
Most Kafka interviews lean on these six concepts.
Brokers, Topics, and Partitions
A Kafka cluster is a set of brokers. Each topic is split into partitions, and each partition lives on one broker (with replicas on others). Partitions are the unit of parallelism: more partitions means more consumers can read concurrently. Interviewers test whether you understand that ordering is guaranteed only within a single partition, not across partitions.
Consumer Groups and Offsets
A consumer group is a set of consumers that cooperate to consume a topic. Each partition is assigned to exactly one consumer in the group. Offsets track how far each consumer has read. Committing offsets too early risks data loss; committing too late causes duplicates. This tradeoff is the core of Kafka reliability questions.
Replication, ISR, and Fault Tolerance
Each partition has a leader and zero or more follower replicas. The in-sync replica set (ISR) is the set of replicas that are caught up with the leader. If the leader fails, a new leader is elected from the ISR. Setting min.insync.replicas to 2 with acks=all requires at least two replicas to confirm every write, preventing data loss on broker failure.
Exactly-Once Semantics
Kafka supports exactly-once delivery through idempotent producers (dedup by sequence number) and transactional APIs (atomic writes across partitions). For end-to-end exactly-once in stream processing, you need the consume-transform-produce pattern with transactions. Interviewers want you to distinguish between broker-level guarantees and application-level guarantees.
Kafka Connect
Kafka Connect is a framework for streaming data between Kafka and external systems. Source connectors ingest data into Kafka; sink connectors write data out. It handles offset management, serialization, and fault tolerance. Interviewers test whether you have used Connect in production versus rolling custom consumers.
Schema Registry
Schema Registry stores Avro, Protobuf, or JSON schemas and enforces compatibility rules (backward, forward, full). Producers register schemas; consumers look them up by ID embedded in the message. This prevents breaking changes from taking down downstream consumers. Interviewers ask about compatibility modes and what happens when a producer tries to register an incompatible schema.
Producer acks Decision Matrix
One of the most-asked configuration questions in Kafka interviews. Worth knowing cold.
| acks | Latency | Durability | Use case |
|---|---|---|---|
| acks=0 | Lowest | No guarantee | Metrics, click events, anything tolerant of loss |
| acks=1 | Medium | Loss if leader dies before replication | Logs, intermediate streaming jobs |
| acks=all | Highest | No loss with min.insync.replicas>=2 | Financial events, orders, anything where loss is unacceptable |
15 Kafka Interview Questions with Model Answers
Drawn from data engineer interviews at companies like Stripe, Airbnb, Netflix, Meta, Uber, and Snowflake. Answers cover what an interviewer expects from a strong candidate.
A producer sends messages to a topic with 12 partitions. You have 4 consumers in a consumer group. How are partitions assigned, and what happens if a consumer dies?
Each consumer gets 3 partitions (12 / 4). When a consumer dies, a rebalance is triggered. The remaining 3 consumers split 12 partitions: two get 4 and one gets 4 (round-robin or range assignment). During rebalance, consumption pauses briefly. A strong answer mentions the rebalance protocol (eager vs cooperative), partition assignment strategies (range, round-robin, sticky), and the impact of session.timeout.ms and heartbeat.interval.ms on detection speed.
Explain the difference between acks=0, acks=1, and acks=all. When would you use each?
acks=0: producer does not wait for any acknowledgment. Fastest, but messages can be lost if the broker crashes before writing to disk. acks=1: producer waits for the leader to write the message. If the leader crashes before replication, the message is lost. acks=all: producer waits for all in-sync replicas to write. Slowest, but no data loss as long as min.insync.replicas is set correctly. Use acks=all for financial data, user events, and anything where loss is unacceptable. Use acks=1 for metrics or logs where occasional loss is tolerable.
How does Kafka guarantee message ordering? What are the limitations?
Kafka guarantees ordering within a single partition. Messages with the same key always go to the same partition (via hash partitioning), so per-key ordering is guaranteed. There is no ordering guarantee across partitions. If you need global ordering, you need a single partition, which limits throughput to one consumer. A strong answer mentions that enabling retries with max.in.flight.requests.per.connection > 1 can break ordering unless you enable idempotent producers.
What is consumer lag and how do you monitor it? At what point should you be concerned?
Consumer lag is the difference between the latest offset produced and the latest offset consumed. It is measured per partition. Monitor it with Kafka's consumer group describe command, Burrow, or the __consumer_offsets topic. Be concerned when lag is growing over time (consumers cannot keep up) or when lag spikes correlate with processing errors. A strong answer distinguishes between transient lag (a consumer restart) and sustained lag (throughput problem), and mentions scaling consumers or increasing partition count as solutions.
You need to replay all messages from a Kafka topic from three days ago. How do you do it?
Reset the consumer group offsets to a timestamp three days ago using kafka-consumer-groups.sh with the reset-offsets flag and the to-datetime option. Alternatively, use the seek() method in the consumer API to set offsets per partition. The topic must have a retention period of at least three days (retention.ms). A strong answer mentions that the consumer application must be stopped before resetting offsets (or use a new consumer group), and discusses how downstream systems handle reprocessed (duplicate) data.
Explain compacted topics. How do they differ from regular topics and when would you use them?
Compacted topics retain only the latest message per key. Kafka periodically removes older messages with the same key, keeping only the most recent value. Regular topics retain messages based on time or size. Use compacted topics for changelog/CDC data where you only need the current state of each entity (for example, user profile updates, product catalog). A strong answer mentions that tombstone messages (null value) delete a key, and that compaction is not immediate: old duplicates may be visible until the log cleaner runs.
How would you design a Kafka-based pipeline that needs exactly-once processing from source database to target data warehouse?
Use a CDC connector (Debezium) as a Kafka Connect source to capture database changes. Enable idempotent producers on the connector. Use a stream processor (Kafka Streams or Flink) with the exactly-once configuration for any transformations. For the sink, use a Kafka Connect sink connector with upsert semantics (insert-or-update by primary key). A strong answer notes that exactly-once between Kafka and an external system requires idempotent writes to the target: either upserts, deduplication tables, or transactional commits with offset tracking.
What happens when a Kafka broker runs out of disk space? How do you prevent this?
The broker stops accepting writes to partitions on the full disk. If the broker is a partition leader, producers get errors and may fail or retry. Prevention: set retention.ms and retention.bytes to control topic size. Monitor disk usage with alerts at 70% and 80% thresholds. Use log.retention.check.interval.ms to control how frequently Kafka checks for expired segments. A strong answer mentions that different topics can have different retention settings, and that compacted topics need enough disk for both active and compacted segments during the cleaning process.
Describe the difference between Kafka Streams and a standalone Kafka consumer for stream processing. When would you choose each?
Kafka Streams is a library that provides stateful stream processing (windowing, joins, aggregations) with exactly-once semantics, built on top of the consumer API. A standalone consumer is simpler and appropriate for stateless transformations (filtering, enrichment from a cache, routing). Choose Kafka Streams when you need state management, event-time processing, or stream-table joins. Choose a standalone consumer when you just need to read, transform, and write without maintaining state. A strong answer mentions that Kafka Streams applications are just JVM processes (no separate cluster needed), unlike Flink or Spark Streaming.
How do you choose the partition count for a new Kafka topic?
Start with target throughput. If the topic needs to handle 100 MB/s and a single partition can handle 10 MB/s on your hardware, you need at least 10 partitions. Then double for headroom. Constraints: total partitions per broker should stay under ~4,000 (controller overhead), and you cannot decrease partition count without recreating the topic. A strong answer notes that more partitions hurts end-to-end latency (more files, more file handles), and that consumer parallelism is capped at the partition count.
Why is Kafka often described as 'log-structured'? What are the implications for performance?
Each partition is an append-only log on disk. Writes are sequential (extremely fast on both HDDs and SSDs). Reads are also sequential, leveraging the OS page cache. There is no random access by message ID; consumers read by offset (which maps to a position in the log). Implications: Kafka throughput is bounded by network and disk-write speed, not CPU. A strong answer notes that this is why Kafka can saturate a 10 Gbps NIC on commodity hardware.
Walk me through a Kafka architecture diagram for an e-commerce order pipeline. What runs where?
Producers in the checkout service publish to an 'orders' topic, partitioned by customer_id (per-customer ordering). The Kafka cluster has 3+ brokers with replication factor 3. A Kafka Connect cluster runs JDBC source connectors capturing inventory changes, and S3 sink connectors archiving orders. Schema Registry stores Avro schemas for orders. Consumer groups: the inventory service updates stock, the email service sends confirmations, and a Flink job aggregates revenue. A strong answer covers data flow, replication, separation of compute (Connect) from storage (brokers), and which subsystems are stateful.
How does Kafka handle a network partition between two brokers in the ISR?
If a follower stops fetching from the leader for replica.lag.time.max.ms (default 30 seconds), it falls out of the ISR. The leader continues serving writes from the remaining ISR. If the partitioned follower had been the only ISR member besides the leader, and acks=all is required with min.insync.replicas=2, the producer will see NotEnoughReplicasException and writes pause. When the network heals, the follower fetches missed messages and rejoins the ISR. A strong answer covers the unclean.leader.election.enable tradeoff: false guarantees no data loss but may cause unavailability; true allows a stale follower to become leader and lose data.
What is the difference between Kafka and a message queue like RabbitMQ?
Kafka is a distributed log: messages persist for a configured retention, and multiple consumer groups can read the same topic independently. Each consumer tracks its own offset. RabbitMQ is a traditional broker: messages are removed once consumed. Kafka excels at high-throughput event streaming, replay, and fan-out (many consumers reading the same data). RabbitMQ excels at task queues, priority routing, and complex dead-letter handling. A strong answer notes that the right choice depends on whether you need replay (Kafka) or per-message acknowledgment with routing logic (RabbitMQ).
Explain how schema evolution works with Schema Registry and what happens when a producer changes a field type.
Schema Registry enforces compatibility before allowing a new schema version. Backward compatible: new schema can read old messages (added optional fields are fine, removing fields is not). Forward compatible: old schema can read new messages (removing fields is fine). Full compatibility: both. Changing a field type is generally NOT compatible: producers register a new schema and Schema Registry rejects it under backward or full compatibility. A strong answer mentions compatibility levels (NONE, BACKWARD, FORWARD, FULL, plus _TRANSITIVE variants) and the deployment dance: bump consumers first under backward, producers first under forward.
Worked Example: Consumer Group Offset Management
# At-least-once: commit AFTER processing
# If the consumer crashes after processing but before commit,
# the message will be reprocessed on restart.
for message in consumer:
process(message) # Step 1: process
consumer.commit() # Step 2: commit offset
# At-most-once: commit BEFORE processing
# If the consumer crashes after commit but before processing,
# the message is lost.
for message in consumer:
consumer.commit() # Step 1: commit offset
process(message) # Step 2: process
# Exactly-once: use transactions (Kafka Streams pattern)
# Atomically commit the output AND the consumer offset.
producer.init_transactions()
producer.begin_transaction()
producer.send(output_topic, result)
producer.send_offsets_to_transaction(offsets, consumer_group)
producer.commit_transaction()Most production systems use at-least-once with idempotent writes to the target. Exactly-once adds latency and complexity. Knowing when each pattern is appropriate is the senior-level signal.
8 Common Mistakes in Kafka Interviews
- Saying Kafka guarantees exactly-once delivery out of the box without mentioning the configuration requirements (idempotent producers, transactional API, consumer isolation level)
- Confusing partition count with replication factor: partitions control parallelism, replication controls durability
- Not understanding that consumer group rebalances pause all consumption, which affects latency-sensitive applications
- Claiming you can decrease the partition count of a topic (you cannot without recreating it)
- Ignoring Schema Registry in a production architecture, leading to serialization failures when schemas evolve
- Treating Kafka as a database: it is a log, not a query engine, and random access by key is not its purpose
- Mixing up at-least-once and exactly-once: at-least-once is the default and requires idempotent downstream writes
- Forgetting that more partitions = more file handles, more controller load, and worse end-to-end latency
Kafka Interview Questions: FAQ
How many Kafka questions should I expect in a data engineering interview?+
Do I need hands-on Kafka experience to answer these questions?+
Should I learn Kafka Streams or Apache Flink for interviews?+
Is Kafka being replaced by newer technologies in 2026?+
What is the difference between a Kafka topic and a partition?+
How does Kafka achieve fault tolerance?+
What is a Kafka consumer group in simple terms?+
Can I have more consumers than partitions in a Kafka consumer group?+
Practice Kafka Interview Questions
Run streaming pipeline problems in your browser. Each problem includes test cases, instant grading, and notes on what an interviewer is listening for.
Related Interview Guides
Stream and batch processing engine that pairs with Kafka
The orchestrator that schedules Kafka-backed batch jobs
Architecture, cost, and ops comparison for streaming interviews
When to pick Kafka streaming over batch, with the tradeoffs
How to whiteboard a Kafka pipeline in a system design interview
Complete guide to all five data engineering interview rounds