Brokers, Topics, and Partitions

Apache Kafka shows up in most data engineering interviews involving real-time data. Interviewers test how partitions distribute load, how consumer groups coordinate, what happens when brokers fail, and how exactly-once delivery works in production.

TL;DR: Kafka Interview Questions

Apache Kafka is a distributed event streaming platform built around an append-only partitioned log. Data engineering interviews focus on six areas: brokers and partitions, consumer groups and offsets, replication and ISR, delivery semantics (acks, exactly-once), Kafka Connect and Schema Registry, and operational scenarios like consumer lag, broker failures, and capacity planning.

For most roles you need to explain how partitions are assigned to consumers, when to use acks=all vs acks=1, what the in-sync replica set guarantees, and how exactly-once differs from at-least-once with idempotent writes. Senior roles add cross-datacenter replication, schema evolution strategy, and capacity planning for high-volume topics.

What Interviewers Expect by Seniority

Level	Expected Coverage
Junior	Topics, partitions, consumer groups, offsets. What happens when a consumer joins or leaves a group.
Mid	Replication, ISR, delivery guarantees, Schema Registry. Designing a Kafka pipeline and explaining your partition key strategy.
Senior	Production scenarios: consumer lag diagnosis, broker failures, cross-datacenter replication, schema evolution strategies, capacity planning for millions of events per second.
Staff+	Tradeoff arguments at the architectural level: Kafka vs Pulsar vs Kinesis, when to introduce Flink vs Kafka Streams, multi-region replication strategy, cost-driven partition design.

Kafka Architecture: Core Concepts Interviewers Test

Most Kafka interviews lean on these six concepts.

Architecture

Brokers, Topics, and Partitions

A Kafka cluster is a set of brokers. Each topic is split into partitions, and each partition lives on one broker (with replicas on others). Partitions are the unit of parallelism: more partitions means more consumers can read concurrently. Interviewers test whether you understand that ordering is guaranteed only within a single partition, not across partitions.

Consumption

Consumer Groups and Offsets

A consumer group is a set of consumers that cooperate to consume a topic. Each partition is assigned to exactly one consumer in the group. Offsets track how far each consumer has read. Committing offsets too early risks data loss; committing too late causes duplicates. This tradeoff is the core of Kafka reliability questions.

Reliability

Replication, ISR, and Fault Tolerance

Each partition has a leader and zero or more follower replicas. The in-sync replica set (ISR) is the set of replicas that are caught up with the leader. If the leader fails, a new leader is elected from the ISR. Setting min.insync.replicas to 2 with acks=all requires at least two replicas to confirm every write, preventing data loss on broker failure.

Delivery

Exactly-Once Semantics

Kafka supports exactly-once delivery through idempotent producers (dedup by sequence number) and transactional APIs (atomic writes across partitions). For end-to-end exactly-once in stream processing, you need the consume-transform-produce pattern with transactions. Interviewers want you to distinguish between broker-level guarantees and application-level guarantees.

Integration

Kafka Connect

Kafka Connect is a framework for streaming data between Kafka and external systems. Source connectors ingest data into Kafka; sink connectors write data out. It handles offset management, serialization, and fault tolerance. Interviewers test whether you have used Connect in production versus rolling custom consumers.

Schemas

Schema Registry

Schema Registry stores Avro, Protobuf, or JSON schemas and enforces compatibility rules (backward, forward, full). Producers register schemas; consumers look them up by ID embedded in the message. This prevents breaking changes from taking down downstream consumers. Interviewers ask about compatibility modes and what happens when a producer tries to register an incompatible schema.

Producer acks Decision Matrix

One of the most-asked configuration questions in Kafka interviews. Worth knowing cold.

acks	Latency	Durability	Use case
acks=0	Lowest	No guarantee	Metrics, click events, anything tolerant of loss
acks=1	Medium	Loss if leader dies before replication	Logs, intermediate streaming jobs
acks=all	Highest	No loss with min.insync.replicas>=2	Financial events, orders, anything where loss is unacceptable

15 Kafka Interview Questions with Model Answers

Drawn from data engineer interviews at companies like Stripe, Airbnb, Netflix, Meta, Uber, and Snowflake. Answers cover what an interviewer expects from a strong candidate.

A producer sends messages to a topic with 12 partitions. You have 4 consumers in a consumer group. How are partitions assigned, and what happens if a consumer dies?

Each consumer gets 3 partitions (12 / 4). When a consumer dies, a rebalance is triggered. The remaining 3 consumers split 12 partitions: two get 4 and one gets 4 (round-robin or range assignment). During rebalance, consumption pauses briefly. A strong answer mentions the rebalance protocol (eager vs cooperative), partition assignment strategies (range, round-robin, sticky), and the impact of session.timeout.ms and heartbeat.interval.ms on detection speed.

Explain the difference between acks=0, acks=1, and acks=all. When would you use each?

acks=0: producer does not wait for any acknowledgment. Fastest, but messages can be lost if the broker crashes before writing to disk. acks=1: producer waits for the leader to write the message. If the leader crashes before replication, the message is lost. acks=all: producer waits for all in-sync replicas to write. Slowest, but no data loss as long as min.insync.replicas is set correctly. Use acks=all for financial data, user events, and anything where loss is unacceptable. Use acks=1 for metrics or logs where occasional loss is tolerable.

How does Kafka guarantee message ordering? What are the limitations?

Kafka guarantees ordering within a single partition. Messages with the same key always go to the same partition (via hash partitioning), so per-key ordering is guaranteed. There is no ordering guarantee across partitions. If you need global ordering, you need a single partition, which limits throughput to one consumer. A strong answer mentions that enabling retries with max.in.flight.requests.per.connection > 1 can break ordering unless you enable idempotent producers.

What is consumer lag and how do you monitor it? At what point should you be concerned?

Consumer lag is the difference between the latest offset produced and the latest offset consumed. It is measured per partition. Monitor it with Kafka's consumer group describe command, Burrow, or the __consumer_offsets topic. Be concerned when lag is growing over time (consumers cannot keep up) or when lag spikes correlate with processing errors. A strong answer distinguishes between transient lag (a consumer restart) and sustained lag (throughput problem), and mentions scaling consumers or increasing partition count as solutions.

You need to replay all messages from a Kafka topic from three days ago. How do you do it?

Reset the consumer group offsets to a timestamp three days ago using kafka-consumer-groups.sh with the reset-offsets flag and the to-datetime option. Alternatively, use the seek() method in the consumer API to set offsets per partition. The topic must have a retention period of at least three days (retention.ms). A strong answer mentions that the consumer application must be stopped before resetting offsets (or use a new consumer group), and discusses how downstream systems handle reprocessed (duplicate) data.

Explain compacted topics. How do they differ from regular topics and when would you use them?

Compacted topics retain only the latest message per key. Kafka periodically removes older messages with the same key, keeping only the most recent value. Regular topics retain messages based on time or size. Use compacted topics for changelog/CDC data where you only need the current state of each entity (for example, user profile updates, product catalog). A strong answer mentions that tombstone messages (null value) delete a key, and that compaction is not immediate: old duplicates may be visible until the log cleaner runs.

How would you design a Kafka-based pipeline that needs exactly-once processing from source database to target data warehouse?

Use a CDC connector (Debezium) as a Kafka Connect source to capture database changes. Enable idempotent producers on the connector. Use a stream processor (Kafka Streams or Flink) with the exactly-once configuration for any transformations. For the sink, use a Kafka Connect sink connector with upsert semantics (insert-or-update by primary key). A strong answer notes that exactly-once between Kafka and an external system requires idempotent writes to the target: either upserts, deduplication tables, or transactional commits with offset tracking.

What happens when a Kafka broker runs out of disk space? How do you prevent this?

The broker stops accepting writes to partitions on the full disk. If the broker is a partition leader, producers get errors and may fail or retry. Prevention: set retention.ms and retention.bytes to control topic size. Monitor disk usage with alerts at 70% and 80% thresholds. Use log.retention.check.interval.ms to control how frequently Kafka checks for expired segments. A strong answer mentions that different topics can have different retention settings, and that compacted topics need enough disk for both active and compacted segments during the cleaning process.

Describe the difference between Kafka Streams and a standalone Kafka consumer for stream processing. When would you choose each?

Kafka Streams is a library that provides stateful stream processing (windowing, joins, aggregations) with exactly-once semantics, built on top of the consumer API. A standalone consumer is simpler and appropriate for stateless transformations (filtering, enrichment from a cache, routing). Choose Kafka Streams when you need state management, event-time processing, or stream-table joins. Choose a standalone consumer when you just need to read, transform, and write without maintaining state. A strong answer mentions that Kafka Streams applications are just JVM processes (no separate cluster needed), unlike Flink or Spark Streaming.

How do you choose the partition count for a new Kafka topic?

Start with target throughput. If the topic needs to handle 100 MB/s and a single partition can handle 10 MB/s on your hardware, you need at least 10 partitions. Then double for headroom. Constraints: total partitions per broker should stay under ~4,000 (controller overhead), and you cannot decrease partition count without recreating the topic. A strong answer notes that more partitions hurts end-to-end latency (more files, more file handles), and that consumer parallelism is capped at the partition count.

Why is Kafka often described as 'log-structured'? What are the implications for performance?

Each partition is an append-only log on disk. Writes are sequential (extremely fast on both HDDs and SSDs). Reads are also sequential, leveraging the OS page cache. There is no random access by message ID; consumers read by offset (which maps to a position in the log). Implications: Kafka throughput is bounded by network and disk-write speed, not CPU. A strong answer notes that this is why Kafka can saturate a 10 Gbps NIC on commodity hardware.

Walk me through a Kafka architecture diagram for an e-commerce order pipeline. What runs where?

Producers in the checkout service publish to an 'orders' topic, partitioned by customer_id (per-customer ordering). The Kafka cluster has 3+ brokers with replication factor 3. A Kafka Connect cluster runs JDBC source connectors capturing inventory changes, and S3 sink connectors archiving orders. Schema Registry stores Avro schemas for orders. Consumer groups: the inventory service updates stock, the email service sends confirmations, and a Flink job aggregates revenue. A strong answer covers data flow, replication, separation of compute (Connect) from storage (brokers), and which subsystems are stateful.

How does Kafka handle a network partition between two brokers in the ISR?

If a follower stops fetching from the leader for replica.lag.time.max.ms (default 30 seconds), it falls out of the ISR. The leader continues serving writes from the remaining ISR. If the partitioned follower had been the only ISR member besides the leader, and acks=all is required with min.insync.replicas=2, the producer will see NotEnoughReplicasException and writes pause. When the network heals, the follower fetches missed messages and rejoins the ISR. A strong answer covers the unclean.leader.election.enable tradeoff: false guarantees no data loss but may cause unavailability; true allows a stale follower to become leader and lose data.

What is the difference between Kafka and a message queue like RabbitMQ?

Kafka is a distributed log: messages persist for a configured retention, and multiple consumer groups can read the same topic independently. Each consumer tracks its own offset. RabbitMQ is a traditional broker: messages are removed once consumed. Kafka excels at high-throughput event streaming, replay, and fan-out (many consumers reading the same data). RabbitMQ excels at task queues, priority routing, and complex dead-letter handling. A strong answer notes that the right choice depends on whether you need replay (Kafka) or per-message acknowledgment with routing logic (RabbitMQ).

Explain how schema evolution works with Schema Registry and what happens when a producer changes a field type.

Schema Registry enforces compatibility before allowing a new schema version. Backward compatible: new schema can read old messages (added optional fields are fine, removing fields is not). Forward compatible: old schema can read new messages (removing fields is fine). Full compatibility: both. Changing a field type is generally NOT compatible: producers register a new schema and Schema Registry rejects it under backward or full compatibility. A strong answer mentions compatibility levels (NONE, BACKWARD, FORWARD, FULL, plus _TRANSITIVE variants) and the deployment dance: bump consumers first under backward, producers first under forward.

Worked Example: Consumer Group Offset Management

# At-least-once: commit AFTER processing
# If the consumer crashes after processing but before commit,
# the message will be reprocessed on restart.
for message in consumer:
    process(message)           # Step 1: process
    consumer.commit()          # Step 2: commit offset

# At-most-once: commit BEFORE processing
# If the consumer crashes after commit but before processing,
# the message is lost.
for message in consumer:
    consumer.commit()          # Step 1: commit offset
    process(message)           # Step 2: process

# Exactly-once: use transactions (Kafka Streams pattern)
# Atomically commit the output AND the consumer offset.
producer.init_transactions()
producer.begin_transaction()
producer.send(output_topic, result)
producer.send_offsets_to_transaction(offsets, consumer_group)
producer.commit_transaction()

Most production systems use at-least-once with idempotent writes to the target. Exactly-once adds latency and complexity. Knowing when each pattern is appropriate is the senior-level signal.

8 Common Mistakes in Kafka Interviews

Saying Kafka guarantees exactly-once delivery out of the box without mentioning the configuration requirements (idempotent producers, transactional API, consumer isolation level)
Confusing partition count with replication factor: partitions control parallelism, replication controls durability
Not understanding that consumer group rebalances pause all consumption, which affects latency-sensitive applications
Claiming you can decrease the partition count of a topic (you cannot without recreating it)
Ignoring Schema Registry in a production architecture, leading to serialization failures when schemas evolve
Treating Kafka as a database: it is a log, not a query engine, and random access by key is not its purpose
Mixing up at-least-once and exactly-once: at-least-once is the default and requires idempotent downstream writes
Forgetting that more partitions = more file handles, more controller load, and worse end-to-end latency

Kafka Interview Questions: FAQ

How many Kafka questions should I expect in a data engineering interview?+

Typically 2 to 4 questions in a system design or tools round. If the role involves real-time pipelines, expect deeper coverage. If the role is batch-focused, you may get one conceptual question. Check the job description for keywords like streaming, real-time, event-driven, or Kafka.

Do I need hands-on Kafka experience to answer these questions?+

For junior roles, conceptual understanding is sufficient. For mid-level and senior roles, interviewers expect you to describe real scenarios: a time you debugged consumer lag, how you chose partition counts, or how you handled schema evolution. Having run Kafka locally with docker-compose is the minimum to answer confidently.

Should I learn Kafka Streams or Apache Flink for interviews?+

Learn whichever the company uses. If you do not know, Kafka Streams is simpler and sufficient for most interview questions. Flink is more powerful but adds complexity. Knowing the conceptual differences (library vs framework, state management, windowing) is more important than deep expertise in either.

Is Kafka being replaced by newer technologies in 2026?+

Kafka remains the dominant event streaming platform in 2026. Alternatives like Redpanda (Kafka-compatible, C++) and Pulsar (multi-tenancy, tiered storage) exist but are less common in interviews. If a company uses Redpanda, Kafka knowledge transfers directly because Redpanda implements the Kafka protocol. Focus your preparation on Kafka.

What is the difference between a Kafka topic and a partition?+

A topic is a logical channel for messages. A partition is a physical, ordered, append-only log within a topic. Topics are split into one or more partitions for parallelism. Producers write to topics; Kafka places each message into a specific partition based on the key (or round-robin if no key). Consumers read from partitions, not directly from topics.

How does Kafka achieve fault tolerance?+

Through partition replication. Each partition has one leader and N-1 follower replicas (where N is the replication factor). Producers and consumers talk only to the leader. If the leader broker fails, a follower from the in-sync replica set (ISR) is elected as the new leader. With replication factor 3 and min.insync.replicas=2, you can lose any one broker without data loss or unavailability.

What is a Kafka consumer group in simple terms?+

A consumer group is a set of consumers that share work on a topic. Kafka assigns each partition to exactly one consumer in the group. Add more consumers (up to the partition count) to increase throughput. Different consumer groups read the same data independently, which is how you fan out a single event stream to multiple downstream services.

Can I have more consumers than partitions in a Kafka consumer group?+

You can, but extra consumers will sit idle because Kafka assigns each partition to exactly one consumer. To increase parallelism, increase the partition count for the topic (you can add partitions, but not remove them without recreating the topic).

02 / Why practice

Practice Kafka Interview Questions

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Open the problems

Related Interview Guides

Spark Interview Questions→

Stream and batch processing engine that pairs with Kafka

Airflow Interview Questions→

The orchestrator that schedules Kafka-backed batch jobs

Kafka vs Kinesis→

Architecture, cost, and ops comparison for streaming interviews

Batch vs Streaming→

When to pick Kafka streaming over batch, with the tradeoffs

System Design Round→

How to whiteboard a Kafka pipeline in a system design interview

DE Interview Prep→

Complete guide to all five data engineering interview rounds