Kafka vs Kinesis, from someone who has paged for both
- 01The two platforms solve the same problem (durable, ordered, partitioned streams) but ship with different operational contracts. Kafka gives you primitives. Kinesis gives you a service. The right answer in an interview is to articulate which contract fits the team and the workload, not to pick a winner.
- 02Per-unit throughput is wildly different. A Kinesis shard caps at 1 MB/sec write and 2 MB/sec read. A Kafka partition is bounded by the broker, not the partition itself, and routinely sustains 50 to 100 MB/sec. That single fact drives most of the cost and design differences downstream.
- 03Exactly-once semantics exist on both, but the implementations are different beasts. Kafka uses transactional producers and idempotent writes. Kinesis leans on the consumer side via KCL checkpointing against sequence numbers. Naming the mechanism is the L5 ceiling.
- 04The cost picture inverts around 50 MB/sec sustained. Below that, Kinesis (and especially Firehose) is cheaper TCO once you price ops time. Above that, Kafka (and at the high end, self-hosted Kafka) wins decisively. MSK splits the difference in AWS-native shops.
- 05Replay, backfill, and schema evolution are where teams actually get burned. Kinesis caps retention at 365 days and resharding invalidates KCL checkpoints. Kafka retains as long as you pay for disk but requires you to operate Schema Registry yourself. Real designs commit to one of these worlds.
- 06Interviewers do not grade you on which is better. They grade you on whether you can name the trade-offs out loud under pressure: ordering, rebalance behavior, hot-shard mitigation, exactly-once mechanics, cross-region replication, and what happens at 3 AM.
What interviewers actually grade on
They do not grade you on which platform is better. They grade you on whether you can articulate the trade-offs out loud under pressure. These are the five surfaces that separate L4 from L5 answers.
Per-partition vs per-shard ordering
Exactly-once is not the same on both
Consumer rebalance behavior
Replay and backfill
Schema evolution discipline
Self-hosted Kafka vs Kinesis Data Streams vs MSK
Three options, eleven dimensions. The MSK column inherits most of Kafka and that is the point: you get the API, AWS handles the brokers.
| Dimension | Self-hosted Kafka | AWS Kinesis Data Streams | AWS MSK |
|---|---|---|---|
| Operational model | You manage everything (brokers, KRaft/ZK, disk, security, mirroring) | AWS-managed end-to-end (sharding is your job in Provisioned mode) | AWS manages brokers; you manage topics, ACLs, Schema Registry, Connect |
| API and ecosystem | Native Kafka API; full Connect, Streams, ksqlDB, Schema Registry | Kinesis-specific; KCL/KPL or AWS SDK; Glue Schema Registry | Native Kafka API; Connect and Streams compatible; Glue or Confluent SR |
| Multi-cloud | Yes | AWS only | AWS only (but the API is portable, so app code is) |
| Per-unit write ceiling | Bounded by broker (typically 50-100 MB/sec per partition) | 1 MB/sec per shard (hard cap) | Same as Kafka |
| Per-unit read ceiling | Bounded by broker; many consumers per partition with no extra cost | 2 MB/sec per shard, 5 GetRecords/sec/shard; Enhanced Fan-Out is extra | Same as Kafka |
| Retention | Configurable; days to forever (you pay for disk) | 1 to 365 days | Configurable; days to forever |
| Ordering guarantee | Per-partition, sticky to partition assignment | Per-shard, with hash-range remapping on resharding | Per-partition (same as Kafka) |
| Exactly-once | Producer-side: idempotent + transactions + read-committed | Consumer-side: KCL checkpoints in DynamoDB + idempotent app logic | Same as Kafka |
| Schema evolution | Confluent Schema Registry (de-facto standard) | AWS Glue Schema Registry (thinner client integration) | Either Confluent SR or Glue SR |
| Cross-region replication | MirrorMaker 2 (mature) | Custom via Lambda or Firehose (brittle at scale) | MirrorMaker 2 between MSK clusters (mature) |
| Pricing model | EC2 + EBS + ops headcount | Per shard hour + per PUT payload + Enhanced Fan-Out | Per broker hour + storage + data in/out |
One sentence to remember
“Kafka is a database that pretends to be a queue. Kinesis is a queue that pretends to be a database. Pick which lie matches your operational tolerance.”
What you will actually be asked
These five appear in the streaming portion of cloud-native Data Engineer loops. Each entry is a sample answer outline at L5 ceiling.
When would you pick MSK over self-hosted Kafka?
How do you guarantee exactly-once from Kinesis to S3?
Your producer is dropping messages during a broker restart. Diagnose.
Walk me through scaling Kinesis from 100 to 10,000 events/sec.
Compare Kafka Connect to Kinesis Firehose for a CDC pipeline.
What you actually pay at three scales
2026 list prices in us-east-1. Numbers exclude data transfer between AZs (which on Kafka can dominate). Treat as order-of-magnitude, not bid-quality estimates.
| Scale | Kinesis Data Streams | MSK Provisioned | Self-hosted Kafka (EC2) | Reading |
|---|---|---|---|---|
| 1 MB/sec sustained | ~$22/mo (1 shard) + $14/mo PUTs | n/a (below 3-broker minimum effective scale) | ~$430/mo (3x m7g.large brokers) + storage | Self-hosted noise; AZ-spread EBS dominates |
| 100 MB/sec sustained | ~$2.2K/mo (100 shards) + ~$1.4K/mo PUTs | ~$3.0K/mo (3x m7g.xlarge) + ~$700/mo storage | ~$1.7K/mo (3x m7g.2xlarge on-demand) + EBS + ops | MSK and self-hosted converge; Kinesis pays the managed premium |
| 1 GB/sec sustained | ~$22K/mo (1000 shards) + ~$14K/mo PUTs | ~$11K/mo (6x m7g.4xlarge) + storage | ~$6K/mo (6x m7g.4xlarge reserved) + EBS + ops | Kafka wins decisively; Kinesis economics break on per-shard pricing |
Decision matrix
If your situation looks like the left column, the middle column is the default and the right column is the reason you would defend in an interview.
A streaming pipeline problem to chew on
Architecture interviews put real numbers on the table and ask you to defend a design. This one is the closest analog to the Kafka vs Kinesis discussion.
Billions of clicks. One tiny code. Two very different clocks.
What people think vs what actually happens
Five myths that show up in interview answers and hallway debates. The reality column is what a senior interviewer expects you to say instead.
What goes wrong when no one is watching
The honest test of any streaming platform is what it looks like under failure. These are the runbooks you wish you had read before you needed them.
Five failure modes you have to know cold to defend a Kafka design.
- ZooKeeper quorum loss on pre-KRaft clusters. One AZ blips, the ZK ensemble loses majority, Kafka brokers survive but cannot accept metadata changes. Producers keep writing for a while, then start failing on leader-not-available. Fix: KRaft. If you cannot move yet, run ZK across three AZs with its own observability and treat ZK incidents as P1 even when Kafka looks fine.
- Rebalance storms from consumer churn. A pod restart loop in a stateful consumer group triggers eager rebalances every cycle; the group stops processing each time. Fix: cooperative-sticky assignor (KIP-429), longer session.timeout.ms, and stop deploying with rolling restarts that bounce too fast.
- ISR shrinkage under load. Replicas fall behind because broker disk I/O is saturated; ISR drops to 1; min.insync.replicas=2 with acks=all blocks producers. The page is “producers are stuck”; the cause is replication fetcher lag. Fix: bigger EBS IOPS, segment.bytes tuning, or fewer partitions per broker.
- Log compaction lag on compacted topics with high write rate. The cleaner cannot keep up; your changelog topic grows unbounded. Fix: log.cleaner.threads, log.cleaner.io.max.bytes.per.second, and accepting that compacted topics are not free.
- Controller failover after a broker JVM pause. The new controller has to refresh metadata for every partition; producers see leader-not-available for the duration. On a 5K-partition cluster, that is minutes of degraded write availability. Mitigation: smaller clusters, or KRaft (faster controller failover).
Five throttling and lease pathologies that page Kinesis-on-call.
- PROVISIONED_THROUGHPUT_EXCEEDED on writes. The producer hits the per-shard 1 MB/sec or 1000 records/ sec cap. The math was right on average but a hot key (a single high-traffic user_id) lands all writes on one shard. Fix: salt the partition key with a hash mod-N suffix, re-aggregate downstream. On-Demand mode does not save you; it scales total capacity, not per-key fanout.
- GetRecords throttling on reads. The 5 GetRecords/sec/shard ceiling kicks in when too many consumers share a shard. Fix: Enhanced Fan-Out (one HTTP/2 stream per consumer per shard, billed separately) or fewer consumers.
- Stale iterator exception. A consumer holds a shard iterator longer than 5 minutes. KCL consumers that lag past their lease get this and have to recover. Fix: faster polling, or accept the iterator refresh churn.
- Resharding mid-incident. You split a hot shard to recover from throttling; KCL leases for the parent shard are invalidated for the affected hash range; the application restarts processing from a new sequence number. If your downstream is not idempotent, you double-process. Plan for this; do not discover it during the page.
- On-Demand ramp lag. A promo launch 10x your traffic in 30 seconds. On-Demand mode ramps over roughly 15 minutes per doubling, so the first few minutes throttle. Fix: pre-warm by running synthetic load, or use Provisioned with explicit pre-splits ahead of the launch.
How each platform actually delivers exactly-once
The phrase is meaningless without naming the mechanism. Here is the mechanism.
- Idempotent producer (enable.idempotence=true). Each producer gets a producer ID and sequence numbers per partition; the broker dedupes on retry within the same session. Default since 3.0.
- Transactions wrap multi-partition writes (and consumed offsets) in an atomic commit via the transaction coordinator. A consumer reading with isolation.level=read_committed only sees committed data, never aborted batches.
- Read-process-write loops (the Kafka Streams pattern) commit input offsets and output records in the same transaction, so a failure between reading and writing is rolled back atomically.
- Cost: latency from coordinator round-trips, plus the operational weight of running a transaction coordinator. Worth it for stream processing; overkill for fire-and-forget event ingest.
- KCL checkpointing stores the last-processed sequence number per shard lease in DynamoDB. On restart, the consumer resumes from the checkpoint, not from where it died.
- Idempotent sinks are your responsibility. KCL guarantees at-least-once; exactly-once requires the application to dedupe on sequence number or a producer-supplied id.
- Resharding interaction. When you split or merge shards, the parent shard lease ends and KCL takes leases on the children. The checkpoint boundary is preserved, but the application has to handle the new sequence number space cleanly.
- Flink as the alternative. If you do not want to manage KCL state, a Flink job with the Kinesis source connector and exactly-once sinks moves the guarantee back into the framework.
MSK and Confluent Cloud: when each fits
The third option that wins more interviews than naming Kafka or Kinesis alone.
MSK Provisioned is the AWS-native default once your throughput crosses where Kinesis economics break (roughly 50 MB/sec sustained) and you need the Kafka API. AWS manages broker patching and placement; you keep ownership of topics, partitions, consumer groups, ACLs, MirrorMaker, and Schema Registry. If you already run Connect or Streams, MSK is the lift-and- shift option.
MSK Serverless goes one step further and absorbs broker capacity planning. You pay per partition-hour and per GB ingested. The trade-off is fewer dials: you do not pick instance types, and some advanced configs are not exposed. Good fit for variable workloads where the alternative was over-provisioning MSK Provisioned.
Confluent Cloud is the right answer for multi-cloud shops, for teams that need Confluent-specific features (Stream Designer, fully managed Connect with hundreds of source/sink connectors, Stream Lineage), or for organizations where the Kafka commit log is foundational and the team will not own broker ops. Available on AWS, GCP, and Azure. The cost is higher than MSK at equivalent throughput; the value is portability and ecosystem.
In an interview, naming MSK Serverless and Confluent Cloud as known options signals that you have looked past the two-option framing the question often comes wrapped in. That is a senior cue.
How this decision fits the rest of the cluster
Streaming platform choice is upstream of stream processing, downstream of pipeline architecture, and adjacent to cloud platform choice.
Streaming platform choice is foundational for Kafka and Flink interview prep roles and for the system design framework for data engineers framework. The Flink stateful streaming interview prep guide covers the most-tested stream processor that works with both Kafka and Kinesis as sources.
For cloud platform decisions, see Glue, Redshift, Kinesis, EMR interview prep (Kinesis is AWS-only; MSK is the AWS-native Kafka), BigQuery and Dataflow interview prep (GCP equivalent: Pub/Sub, with similar managed-queue semantics to Kinesis), and Synapse, Data Factory, Fabric interview prep (Azure equivalent: Event Hubs, which speaks the Kafka protocol and is the rare cloud-native service that does).
More streaming and pipeline practice
Three problems that exercise the same muscle groups: hot keys, skew, and reading the physical plan.
Kafka vs Kinesis FAQ
Is Kafka or Kinesis more popular in 2026?+
Which is harder to operate, self-hosted Kafka or Kinesis?+
Can I switch from Kinesis to Kafka later?+
Is Kafka Streams or KCL better for stream processing?+
Does the Data Engineer interview always ask about Kafka vs Kinesis?+
Are there other streaming alternatives worth naming?+
What is the cost difference between Kafka and Kinesis at 100 MB/sec?+
How important is exactly-once semantics in the interview?+
Should I name Confluent Cloud in an interview?+
What is the right shard count for X events/sec on Kinesis?+
Drill streaming system design
DataDriven covers SQL, Python, system design, and data modeling at interview difficulty. Practice the streaming patterns interviewers actually grade on.
Adjacent interview prep
The full streaming role framework with Kafka and Kinesis at the core.
Stream processor that works equally well with both Kafka and Kinesis.
Pillar guide covering every round in the Data Engineer loop, end to end.
More data engineer interview prep guides
Data Engineer vs AE roles, daily work, comp, skills, and which to target.
Data Engineer vs MLE roles, where the boundary lives, comp differences, and how to switch.
Data Engineer vs backend roles, daily work, comp, interview differences, and crossover paths.
When SQL wins, when Python wins, and how Data Engineer roles use both.
dbt vs Airflow, where they overlap, where they don't, and how teams use both.
Snowflake vs Databricks, interview differences, role differences, and how to choose.