Data Engineering Interview Prep
Kafka appears in every data engineering interview that involves real-time data. Interviewers test your understanding of distributed messaging: how data is partitioned, how consumers coordinate, what happens when things fail, and how you guarantee delivery semantics.
Covers Kafka 3.x, Kafka Connect, Schema Registry, and the exactly-once semantics that interviewers love to ask about.
Kafka questions test whether you understand distributed messaging as a concept, not just whether you can start a consumer. The strongest candidates explain the tradeoffs behind every configuration decision.
Junior candidates should explain topics, partitions, consumer groups, and offsets. You should know what happens when a consumer joins or leaves a group.
Mid-level candidates need to discuss replication, ISR, delivery guarantees, and Schema Registry. You should be able to design a Kafka-based pipeline and explain your partition key strategy.
Senior candidates must handle production scenarios: consumer lag diagnosis, broker failures, cross-datacenter replication, schema evolution strategies, and capacity planning for topics that receive millions of events per second.
A Kafka cluster is a set of brokers. Each topic is split into partitions, and each partition lives on one broker (with replicas on others). Partitions are the unit of parallelism: more partitions means more consumers can read concurrently. Interviewers test whether you understand that ordering is guaranteed only within a single partition, not across partitions.
A consumer group is a set of consumers that cooperate to consume a topic. Each partition is assigned to exactly one consumer in the group. Offsets track how far each consumer has read. Committing offsets too early risks data loss; committing too late causes duplicates. This tradeoff is the core of Kafka reliability questions.
Each partition has a leader and zero or more follower replicas. The in-sync replica set (ISR) is the set of replicas that are caught up with the leader. If the leader fails, a new leader is elected from the ISR. Setting min.insync.replicas to 2 with acks=all requires at least two replicas to confirm every write, preventing data loss on broker failure.
Kafka supports exactly-once delivery through idempotent producers (dedup by sequence number) and transactional APIs (atomic writes across partitions). For end-to-end exactly-once in stream processing, you need the consume-transform-produce pattern with transactions. Interviewers want you to distinguish between broker-level guarantees and application-level guarantees.
Kafka Connect is a framework for streaming data between Kafka and external systems. Source connectors ingest data into Kafka; sink connectors write data out. It handles offset management, serialization, and fault tolerance. Interviewers test whether you have used Connect in production versus rolling custom consumers.
Schema Registry stores Avro, Protobuf, or JSON schemas and enforces compatibility rules (backward, forward, full). Producers register schemas; consumers look them up by ID embedded in the message. This prevents breaking changes from taking down downstream consumers. Interviewers ask about compatibility modes and what happens when a producer tries to register an incompatible schema.
A producer sends messages to a topic with 12 partitions. You have 4 consumers in a consumer group. How are partitions assigned, and what happens if a consumer dies?
Each consumer gets 3 partitions (12 / 4). When a consumer dies, a rebalance is triggered. The remaining 3 consumers split 12 partitions: two get 4 and one gets 4 (round-robin or range assignment). During rebalance, consumption pauses briefly. A strong answer mentions the rebalance protocol (eager vs cooperative), partition assignment strategies (range, round-robin, sticky), and the impact of session.timeout.ms and heartbeat.interval.ms on detection speed.
Explain the difference between acks=0, acks=1, and acks=all. When would you use each?
acks=0: producer does not wait for any acknowledgment. Fastest, but messages can be lost if the broker crashes before writing to disk. acks=1: producer waits for the leader to write the message. If the leader crashes before replication, the message is lost. acks=all: producer waits for all in-sync replicas to write. Slowest, but no data loss as long as min.insync.replicas is set correctly. Use acks=all for financial data, user events, and anything where loss is unacceptable. Use acks=1 for metrics or logs where occasional loss is tolerable.
How does Kafka guarantee message ordering? What are the limitations?
Kafka guarantees ordering within a single partition. Messages with the same key always go to the same partition (via hash partitioning), so per-key ordering is guaranteed. There is no ordering guarantee across partitions. If you need global ordering, you need a single partition, which limits throughput to one consumer. A strong answer mentions that enabling retries with max.in.flight.requests.per.connection > 1 can break ordering unless you enable idempotent producers.
What is consumer lag and how do you monitor it? At what point should you be concerned?
Consumer lag is the difference between the latest offset produced and the latest offset consumed. It is measured per partition. Monitor it with Kafka's consumer group describe command, Burrow, or the __consumer_offsets topic. Be concerned when lag is growing over time (consumers cannot keep up) or when lag spikes correlate with processing errors. A strong answer distinguishes between transient lag (a consumer restart) and sustained lag (throughput problem), and mentions scaling consumers or increasing partition count as solutions.
You need to replay all messages from a Kafka topic from three days ago. How do you do it?
Reset the consumer group offsets to a timestamp three days ago using kafka-consumer-groups.sh with the reset-offsets flag and the to-datetime option. Alternatively, use the seek() method in the consumer API to set offsets per partition. The topic must have a retention period of at least three days (retention.ms). A strong answer mentions that the consumer application must be stopped before resetting offsets (or use a new consumer group), and discusses how downstream systems handle reprocessed (duplicate) data.
Explain compacted topics. How do they differ from regular topics and when would you use them?
Compacted topics retain only the latest message per key. Kafka periodically removes older messages with the same key, keeping only the most recent value. Regular topics retain messages based on time or size. Use compacted topics for changelog/CDC data where you only need the current state of each entity (e.g., user profile updates, product catalog). A strong answer mentions that tombstone messages (null value) delete a key, and that compaction is not immediate: old duplicates may be visible until the log cleaner runs.
How would you design a Kafka-based pipeline that needs exactly-once processing from source database to target data warehouse?
Use a CDC connector (Debezium) as a Kafka Connect source to capture database changes. Enable idempotent producers on the connector. Use a stream processor (Kafka Streams or Flink) with the exactly-once configuration for any transformations. For the sink, use a Kafka Connect sink connector with upsert semantics (insert-or-update by primary key). A strong answer notes that exactly-once between Kafka and an external system requires idempotent writes to the target: either upserts, deduplication tables, or transactional commits with offset tracking.
What happens when a Kafka broker runs out of disk space? How do you prevent this?
The broker stops accepting writes to partitions on the full disk. If the broker is a partition leader, producers get errors and may fail or retry. Prevention: set retention.ms and retention.bytes to control topic size. Monitor disk usage with alerts at 70% and 80% thresholds. Use log.retention.check.interval.ms to control how frequently Kafka checks for expired segments. A strong answer mentions that different topics can have different retention settings, and that compacted topics need enough disk for both active and compacted segments during the cleaning process.
Describe the difference between Kafka Streams and a standalone Kafka consumer for stream processing. When would you choose each?
Kafka Streams is a library that provides stateful stream processing (windowing, joins, aggregations) with exactly-once semantics, built on top of the consumer API. A standalone consumer is simpler and appropriate for stateless transformations (filtering, enrichment from a cache, routing). Choose Kafka Streams when you need state management, event-time processing, or stream-table joins. Choose a standalone consumer when you just need to read, transform, and write without maintaining state. A strong answer mentions that Kafka Streams applications are just JVM processes (no separate cluster needed), unlike Flink or Spark Streaming.
This example shows the tradeoff between at-least-once and at-most-once delivery, which is the most common Kafka interview question pattern.
# At-least-once: commit AFTER processing
# If the consumer crashes after processing but before commit,
# the message will be reprocessed on restart.
for message in consumer:
process(message) # Step 1: process
consumer.commit() # Step 2: commit offset
# At-most-once: commit BEFORE processing
# If the consumer crashes after commit but before processing,
# the message is lost.
for message in consumer:
consumer.commit() # Step 1: commit offset
process(message) # Step 2: process
# Exactly-once: use transactions (Kafka Streams pattern)
# Atomically commit the output AND the consumer offset.
producer.init_transactions()
producer.begin_transaction()
producer.send(output_topic, result)
producer.send_offsets_to_transaction(offsets, consumer_group)
producer.commit_transaction()Most production systems use at-least-once with idempotent writes to the target. Exactly-once adds latency and complexity. Knowing when each pattern is appropriate is what separates strong candidates from average ones.
Saying Kafka guarantees exactly-once delivery out of the box without mentioning the configuration requirements (idempotent producers, transactional API, consumer isolation level)
Confusing partition count with replication factor: partitions control parallelism, replication controls durability
Not understanding that consumer group rebalances pause all consumption, which affects latency-sensitive applications
Claiming you can decrease the partition count of a topic (you cannot without recreating it)
Ignoring Schema Registry in a production architecture, leading to serialization failures when schemas evolve
Treating Kafka as a database: it is a log, not a query engine, and random access by key is not its purpose
Understand distributed messaging deeply enough to answer any question an interviewer throws at you. Practice with real scenarios.