Kafka vs Kinesis, from someone who has paged for both
Apache Kafka vs AWS Kinesis is one of the most-asked system design decisions in streaming DE interviews. Same job (durable, ordered, partitioned streams), different operational contracts. This guide unpacks the throughput math, real cost at three scales, ordering and exactly-once mechanics, on-call failure modes, and what interviewers grade on.
What this guide actually says
Both platforms solve the same problem (durable, ordered, partitioned streams) with different operational contracts. Kafka gives you primitives; Kinesis gives you a service. The right interview answer articulates which contract fits the team and the workload. Per-unit throughput differs wildly — a Kinesis shard caps at 1 MB/sec write; a Kafka partition routinely sustains 50-100 MB/sec. That single fact drives most of the cost and design differences. Exactly-once exists on both but the implementations differ: Kafka transactional producers vs Kinesis consumer-side KCL checkpointing. Cost inverts around 50 MB/sec sustained.
What interviewers actually grade on
They don't grade you on which is better. They grade on whether you can articulate trade-offs out loud under pressure. Five surfaces that separate L4 from L5 answers.
Per-partition vs per-shard ordering
Both order within a unit (Kafka partition, Kinesis shard), not across them. L5 nuance: in Kinesis, resharding (split or merge) ends ordering for affected hash ranges and forces consumers to recover from a new sequence number. In Kafka, partition count is sticky once set; increasing it changes which key lands on which partition for hash-mod producers, silently breaking ordering for the same key across the boundary. Articulating this without prompting is what interviewers grade on.
Exactly-once isn't the same on both
Kafka exactly-once is a producer story: idempotent writes + transactional semantics across partitions and topics, paired with read-committed consumers. Kinesis exactly-once is a consumer story: KCL checkpoints sequence numbers in DynamoDB and the application is responsible for keeping side-effects idempotent. If you say exactly-once without specifying which side, an L5 interviewer will follow up.
Consumer rebalance behavior
Kafka group rebalances historically stop the world for the entire group (eager protocol); cooperative-sticky assignors (KIP-429) reduce that to incremental moves but only if all consumers in the group support it. Kinesis has no consumer group protocol in the Kafka sense; KCL leases are renewed in DynamoDB and a worker death triggers a single lease takeover, not a fleet-wide rebalance. Matters for any design that talks about graceful deploys.
Replay and backfill
Kafka lets you reset a consumer group to any offset within retention and replay arbitrarily. Kinesis lets you start consumers from TRIM_HORIZON, AT_TIMESTAMP, or AFTER_SEQUENCE_NUMBER, but is bounded by the 365-day retention ceiling and resharding history. For multi-year backfill, neither is the right answer; you store raw events in S3 and replay from there.
Schema evolution discipline
Kafka has Confluent Schema Registry as a de facto standard for Avro/Protobuf/JSON Schema with compatibility checks. Kinesis Data Streams has AWS Glue Schema Registry, which works but has thinner client integration. Most production Kinesis pipelines bolt schema validation in at the producer or downstream Lambda rather than at the broker. Naming this gap honestly is a senior signal.
Self-hosted Kafka vs Kinesis Data Streams vs MSK
Three options, eleven dimensions. MSK inherits most of Kafka; you get the API, AWS handles brokers.
| Dimension | Self-hosted Kafka | AWS Kinesis Data Streams | AWS MSK |
|---|---|---|---|
| Operational model | You manage everything (brokers, KRaft/ZK, disk, security, mirroring) | AWS-managed end-to-end (sharding is your job in Provisioned mode) | AWS manages brokers; you manage topics, ACLs, Schema Registry, Connect |
| API and ecosystem | Native Kafka API; full Connect, Streams, ksqlDB, Schema Registry | Kinesis-specific; KCL/KPL or AWS SDK; Glue Schema Registry | Native Kafka API; Connect and Streams compatible; Glue or Confluent SR |
| Multi-cloud | Yes | AWS only | AWS only (but the API is portable, so app code is) |
| Per-unit write ceiling | Bounded by broker (typically 50-100 MB/sec per partition) | 1 MB/sec per shard (hard cap) | Same as Kafka |
| Per-unit read ceiling | Bounded by broker; many consumers per partition with no extra cost | 2 MB/sec per shard, 5 GetRecords/sec/shard; Enhanced Fan-Out is extra | Same as Kafka |
| Retention | Configurable; days to forever (you pay for disk) | 1 to 365 days | Configurable; days to forever |
| Ordering guarantee | Per-partition, sticky to partition assignment | Per-shard, with hash-range remapping on resharding | Per-partition (same as Kafka) |
| Exactly-once | Producer-side: idempotent + transactions + read-committed | Consumer-side: KCL checkpoints in DynamoDB + idempotent app logic | Same as Kafka |
| Schema evolution | Confluent Schema Registry (de facto standard) | AWS Glue Schema Registry (thinner client integration) | Either Confluent SR or Glue SR |
| Cross-region replication | MirrorMaker 2 (mature) | Custom via Lambda or Firehose (brittle at scale) | MirrorMaker 2 between MSK clusters (mature) |
| Pricing model | EC2 + EBS + ops headcount | Per shard hour + per PUT payload + Enhanced Fan-Out | Per broker hour + storage + data in/out |
What you'll actually be asked
Five questions from the streaming portion of cloud-native DE loops.
When would you pick MSK over self-hosted Kafka?
Strong: when the team needs the Kafka API and ecosystem but doesn't have headcount to operate brokers, ZooKeeper or KRaft, and disk. MSK absorbs broker patching, AZ-aware placement, encryption-at-rest, and basic monitoring. You still own topics, partitions, ACLs, mirroring, Schema Registry, and Connect clusters. Mention MSK Serverless as the further step that absorbs broker capacity planning. Trap: claiming MSK is fully managed Kafka. It's not. AWS manages the bottom half; your topics, partitions, and consumer groups are still yours.
How do you guarantee exactly-once from Kinesis to S3?
Two real paths. (1) Firehose with a deduplication key derived from the Kinesis sequence number, plus idempotent writes to a partitioned S3 prefix and a downstream consumer treating S3 object keys as the dedupe boundary. Firehose buffers and writes batches, so dedupe lives at the object level. (2) A Flink job sourced from Kinesis with the two-phase commit S3 sink (StreamingFileSink with checkpointing) that writes via .pending then .committed. Path 1 simpler; path 2 if you also need stateful processing.
Your producer is dropping messages during a broker restart. Diagnose.
Walk the stack. (1) Producer config: acks=1 or acks=0 silently drops on leader failure. Fix: acks=all paired with min.insync.replicas=2. (2) retries and delivery.timeout.ms — defaults are high but delivery.timeout caps total time including retries. (3) If the restarted broker was the controller, failover delays affect metadata refresh and the producer times out resolving leader. (4) ISR shrinkage: if the restarted broker was the only in-sync replica, min.insync.replicas=2 with acks=all blocks producers. L5 reads as: name the config, name the failure mode, name what you'd change.
Walk me through scaling Kinesis from 100 to 10,000 events/sec.
Sizing first. At 1 KB average, 10K events/sec is 10 MB/sec write. Per-shard ceiling is 1 MB/sec, floor is 10 shards. Size for hot-key headroom and the per-shard 1000 records/sec ceiling: 12-15 shards. Then Provisioned vs On-Demand: On-Demand auto-scales but ramps over 15 minutes per doubling, so a synthetic spike will throttle with PROVISIONED_THROUGHPUT_EXCEEDED. Provisioned with explicit splits reacts faster if you can predict the ramp. Mention resharding invalidates KCL leases for the affected hash range, so you plan deploys around it. Closing signal: name a non-uniform partition key as the actual risk, not total throughput.
Compare Kafka Connect to Kinesis Firehose for a CDC pipeline.
Kafka Connect with the Debezium Postgres connector reads the WAL, emits change events to Kafka topics with full before/after images, supports schema evolution via Schema Registry. Firehose is not a CDC tool; it's a delivery layer. The AWS-native CDC story is Database Migration Service (DMS) writing to Kinesis Data Streams, optionally to Firehose for landing in S3. Honest comparison: Debezium is the mature, open-source, multi-database CDC standard. DMS + Kinesis works but has thinner schema-evolution semantics and weaker handling of long-running transactions. For non-trivial CDC, Kafka Connect + Debezium is the default; Firehose enters only as a sink.
What you actually pay at three scales
2026 list prices in us-east-1. Excludes inter-AZ data transfer (which on Kafka can dominate). Order-of-magnitude, not bid-quality.
| Scale | Kinesis Data Streams | MSK Provisioned | Self-hosted Kafka (EC2) | Reading |
|---|---|---|---|---|
| 1 MB/sec sustained | ~$22/mo (1 shard) + $14/mo PUTs | n/a (below 3-broker minimum effective scale) | ~$430/mo (3x m7g.large brokers) + storage | Self-hosted noise; AZ-spread EBS dominates |
| 100 MB/sec sustained | ~$2.2K/mo (100 shards) + ~$1.4K/mo PUTs | ~$3.0K/mo (3x m7g.xlarge) + ~$700/mo storage | ~$1.7K/mo (3x m7g.2xlarge on-demand) + EBS + ops | MSK and self-hosted converge; Kinesis pays the managed premium |
| 1 GB/sec sustained | ~$22K/mo (1000 shards) + ~$14K/mo PUTs | ~$11K/mo (6x m7g.4xlarge) + storage | ~$6K/mo (6x m7g.4xlarge reserved) + EBS + ops | Kafka wins decisively; Kinesis economics break on per-shard pricing |
Decision matrix
| Situation | Pick | Reason |
|---|---|---|
| AWS-only, under 100 MB/sec, fewer than 5 streaming engineers | Kinesis Data Streams | Lowest ops; fits the team and the throughput budget. |
| AWS-only, just landing data in S3, no streaming compute | Kinesis Firehose | Zero-code delivery with built-in batching and Parquet conversion. |
| AWS-only, over 100 MB/sec, need Kafka Connect or Schema Registry | MSK | Kafka API and ecosystem with AWS handling broker plumbing. |
| Multi-cloud or already running Kafka somewhere | Self-hosted Kafka or Confluent Cloud | Portability; avoid Kinesis re-platform later. |
| Need Kafka Streams or ksqlDB for stateful processing | MSK or self-hosted Kafka | Kinesis has no native equivalent; Flink is the closest. |
| Need exactly-once with stateful stream processing | Either source + Apache Flink | Flink's checkpoint barriers work over both Kafka and Kinesis sources. |
| Multi-region active-active replication required | Kafka with MirrorMaker 2 or Confluent Cloud | Kinesis cross-region replication is custom and brittle. |
| Lambda-triggered processing of every event | Kinesis Data Streams | Native Lambda event source with built-in batch and retry. |
| Sustained throughput above 1 GB/sec at lowest cost | Self-hosted Kafka on EC2 with reserved instances | Per-shard Kinesis economics break; managed-service premium dominates. |
Myth vs reality
Myth: Kinesis is fully managed, so you don't have to worry about scaling
Reality: Provisioned mode is manual sharding; you split and merge shards explicitly, and resharding invalidates KCL checkpoints in the affected hash range. On-Demand auto-scales but ramps slowly (roughly doubling every 15 minutes), so a synthetic 10x spike will throttle. Either way you own a runbook.
Myth: Kafka beats Kinesis on cost at any scale
Reality: only past roughly 50 MB/sec sustained. Below that, the three-broker minimum and the engineer-hour tax on self-hosted Kafka make Kinesis cheaper TCO. The cost crossover is real and belongs in your interview answer when someone says 'cheaper' without naming a scale.
Myth: MSK means AWS manages your Kafka
Reality: AWS manages broker hosts, patching, and AZ-aware placement. AWS does not manage your topics, partitions, consumer groups, ACLs, MirrorMaker, Schema Registry, or Connect clusters. The delta vs self-hosted is real but smaller than the marketing suggests. MSK Serverless absorbs broker capacity, which is meaningful.
Myth: Exactly-once is a checkbox both platforms tick the same way
Reality: Kafka exactly-once is producer-side (idempotent producers + transactions + read-committed consumers). Kinesis exactly-once is consumer-side (KCL checkpoints sequence numbers; the application keeps side-effects idempotent). Different mechanics, different failure modes. Saying 'exactly-once' without naming the side is an L4 ceiling.
Myth: Firehose handles ordering and exactly-once into S3 for free
Reality: Firehose buffers by size or time and writes batched objects. Within an object, records preserve arrival order from Kinesis; across objects, ordering is best-effort. Firehose retries on failure and can produce duplicate objects on retry boundaries. Real exactly-once requires deduplication on the consumer side keyed off the Kinesis sequence number or producer-supplied id.
Kafka failure modes that page on-call
ZooKeeper quorum loss
On pre-KRaft clusters. One AZ blips, the ZK ensemble loses majority, Kafka brokers survive but cannot accept metadata changes. Producers keep writing for a while, then start failing on leader-not-available. Fix: KRaft. If you can't move yet, run ZK across three AZs with its own observability and treat ZK incidents as P1 even when Kafka looks fine.
Rebalance storms from consumer churn
A pod restart loop in a stateful consumer group triggers eager rebalances every cycle; the group stops processing each time. Fix: cooperative-sticky assignor (KIP-429), longer session.timeout.ms, and stop deploying with rolling restarts that bounce too fast.
ISR shrinkage under load
Replicas fall behind because broker disk I/O is saturated; ISR drops to 1; min.insync.replicas=2 with acks=all blocks producers. The page is 'producers are stuck'; the cause is replication fetcher lag. Fix: bigger EBS IOPS, segment.bytes tuning, or fewer partitions per broker.
Log compaction lag
On compacted topics with high write rate. The cleaner can't keep up; your changelog topic grows unbounded. Fix: log.cleaner.threads, log.cleaner.io.max.bytes.per.second, and accepting that compacted topics are not free.
Controller failover after a broker JVM pause
The new controller has to refresh metadata for every partition; producers see leader-not-available for the duration. On a 5K-partition cluster, that's minutes of degraded write availability. Mitigation: smaller clusters, or KRaft (faster controller failover).
Kinesis failure modes that page on-call
PROVISIONED_THROUGHPUT_EXCEEDED on writes
The producer hits the per-shard 1 MB/sec or 1000 records/sec cap. The math was right on average but a hot key (a single high-traffic user_id) lands all writes on one shard. Fix: salt the partition key with a hash mod-N suffix, re-aggregate downstream. On-Demand mode doesn't save you; it scales total capacity, not per-key fanout.
GetRecords throttling on reads
The 5 GetRecords/sec/shard ceiling kicks in when too many consumers share a shard. Fix: Enhanced Fan-Out (one HTTP/2 stream per consumer per shard, billed separately) or fewer consumers.
Stale iterator exception
A consumer holds a shard iterator longer than 5 minutes. KCL consumers that lag past their lease get this and have to recover. Fix: faster polling, or accept the iterator refresh churn.
Resharding mid-incident
You split a hot shard to recover from throttling; KCL leases for the parent shard are invalidated for the affected hash range; the application restarts processing from a new sequence number. If your downstream isn't idempotent, you double-process. Plan for this; don't discover it during the page.
On-Demand ramp lag
A promo launch 10x your traffic in 30 seconds. On-Demand ramps over roughly 15 minutes per doubling, so the first few minutes throttle. Fix: pre-warm by running synthetic load, or use Provisioned with explicit pre-splits ahead of the launch.
Kafka exactly-once: producer-side mechanics
Idempotent producer (enable.idempotence=true)
Each producer gets a producer ID and sequence numbers per partition; the broker dedupes on retry within the same session. Default since 3.0.
Transactions
Wrap multi-partition writes (and consumed offsets) in an atomic commit via the transaction coordinator. A consumer reading with isolation.level=read_committed only sees committed data, never aborted batches.
Read-process-write loops
The Kafka Streams pattern commits input offsets and output records in the same transaction, so a failure between reading and writing is rolled back atomically.
Cost
Latency from coordinator round-trips, plus the operational weight of running a transaction coordinator. Worth it for stream processing; overkill for fire-and-forget event ingest.
Kinesis exactly-once: consumer-side mechanics
KCL checkpointing
Stores the last-processed sequence number per shard lease in DynamoDB. On restart, the consumer resumes from the checkpoint, not from where it died.
Idempotent sinks
Your responsibility. KCL guarantees at-least-once; exactly-once requires the application to dedupe on sequence number or a producer-supplied id.
Resharding interaction
When you split or merge shards, the parent shard lease ends and KCL takes leases on the children. The checkpoint boundary is preserved, but the application has to handle the new sequence number space cleanly.
Flink as the alternative
If you don't want to manage KCL state, a Flink job with the Kinesis source connector and exactly-once sinks moves the guarantee back into the framework.
MSK and Confluent Cloud: when each fits
MSK Provisioned is the AWS-native default once throughput crosses where Kinesis economics break (~50 MB/sec sustained) and you need the Kafka API. AWS manages broker patching and placement; you keep ownership of topics, partitions, consumer groups, ACLs, MirrorMaker, and Schema Registry. If you already run Connect or Streams, MSK is the lift-and-shift option.
MSK Serverless goes further and absorbs broker capacity planning. You pay per partition-hour and per GB ingested. Trade-off: fewer dials. Good fit for variable workloads where the alternative was over-provisioning MSK Provisioned.
Confluent Cloud is the right answer for multi-cloud shops, for teams that need Confluent-specific features (Stream Designer, fully managed Connect with hundreds of source/sink connectors, Stream Lineage), or for organizations where the Kafka commit log is foundational and the team won't own broker ops. Available on AWS, GCP, Azure. Higher cost than MSK at equivalent throughput; the value is portability and ecosystem.
In an interview, naming MSK Serverless and Confluent Cloud as known options signals you've looked past the two-option framing the question often comes wrapped in. Senior cue.
Know Kafka vs Kinesis the way the interviewer who asks it knows it.
Kafka vs Kinesis FAQ
Is Kafka or Kinesis more popular in 2026?+
Which is harder to operate, self-hosted Kafka or Kinesis?+
Can I switch from Kinesis to Kafka later?+
Does the DE interview always ask about Kafka vs Kinesis?+
What's the cost difference at 100 MB/sec?+
What's the right shard count for X events/sec on Kinesis?+
Drill streaming system design
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition
Adjacent interview prep
More data engineer interview prep guides
Data Engineer vs AE roles, daily work, comp, skills, and which to target.
Data Engineer vs MLE roles, where the boundary lives, comp differences, and how to switch.
Data Engineer vs backend roles, daily work, comp, interview differences, and crossover paths.
When SQL wins, when Python wins, and how Data Engineer roles use both.
dbt vs Airflow, where they overlap, where they don't, and how teams use both.
Snowflake vs Databricks, interview differences, role differences, and how to choose.