Kafka vs Kinesis, from someone who has paged for both

Apache Kafka vs AWS Kinesis is one of the most-asked system design decisions in streaming DE interviews. Same job (durable, ordered, partitioned streams), different operational contracts. This guide unpacks the throughput math, real cost at three scales, ordering and exactly-once mechanics, on-call failure modes, and what interviewers grade on.

What this guide actually says

Both platforms solve the same problem (durable, ordered, partitioned streams) with different operational contracts. Kafka gives you primitives; Kinesis gives you a service. The right interview answer articulates which contract fits the team and the workload. Per-unit throughput differs wildly — a Kinesis shard caps at 1 MB/sec write; a Kafka partition routinely sustains 50-100 MB/sec. That single fact drives most of the cost and design differences. Exactly-once exists on both but the implementations differ: Kafka transactional producers vs Kinesis consumer-side KCL checkpointing. Cost inverts around 50 MB/sec sustained.

1 MB/s
Per Kinesis shard write ceiling
100 MB/s
Per Kafka partition write ceiling
365 d
Kinesis retention max
Forever
Kafka retention ceiling

What interviewers actually grade on

They don't grade you on which is better. They grade on whether you can articulate trade-offs out loud under pressure. Five surfaces that separate L4 from L5 answers.

Ordering

Per-partition vs per-shard ordering

Both order within a unit (Kafka partition, Kinesis shard), not across them. L5 nuance: in Kinesis, resharding (split or merge) ends ordering for affected hash ranges and forces consumers to recover from a new sequence number. In Kafka, partition count is sticky once set; increasing it changes which key lands on which partition for hash-mod producers, silently breaking ordering for the same key across the boundary. Articulating this without prompting is what interviewers grade on.

Exactly-once

Exactly-once isn't the same on both

Kafka exactly-once is a producer story: idempotent writes + transactional semantics across partitions and topics, paired with read-committed consumers. Kinesis exactly-once is a consumer story: KCL checkpoints sequence numbers in DynamoDB and the application is responsible for keeping side-effects idempotent. If you say exactly-once without specifying which side, an L5 interviewer will follow up.

Rebalance

Consumer rebalance behavior

Kafka group rebalances historically stop the world for the entire group (eager protocol); cooperative-sticky assignors (KIP-429) reduce that to incremental moves but only if all consumers in the group support it. Kinesis has no consumer group protocol in the Kafka sense; KCL leases are renewed in DynamoDB and a worker death triggers a single lease takeover, not a fleet-wide rebalance. Matters for any design that talks about graceful deploys.

Replay

Replay and backfill

Kafka lets you reset a consumer group to any offset within retention and replay arbitrarily. Kinesis lets you start consumers from TRIM_HORIZON, AT_TIMESTAMP, or AFTER_SEQUENCE_NUMBER, but is bounded by the 365-day retention ceiling and resharding history. For multi-year backfill, neither is the right answer; you store raw events in S3 and replay from there.

Schema

Schema evolution discipline

Kafka has Confluent Schema Registry as a de facto standard for Avro/Protobuf/JSON Schema with compatibility checks. Kinesis Data Streams has AWS Glue Schema Registry, which works but has thinner client integration. Most production Kinesis pipelines bolt schema validation in at the producer or downstream Lambda rather than at the broker. Naming this gap honestly is a senior signal.

Self-hosted Kafka vs Kinesis Data Streams vs MSK

Three options, eleven dimensions. MSK inherits most of Kafka; you get the API, AWS handles brokers.

DimensionSelf-hosted KafkaAWS Kinesis Data StreamsAWS MSK
Operational modelYou manage everything (brokers, KRaft/ZK, disk, security, mirroring)AWS-managed end-to-end (sharding is your job in Provisioned mode)AWS manages brokers; you manage topics, ACLs, Schema Registry, Connect
API and ecosystemNative Kafka API; full Connect, Streams, ksqlDB, Schema RegistryKinesis-specific; KCL/KPL or AWS SDK; Glue Schema RegistryNative Kafka API; Connect and Streams compatible; Glue or Confluent SR
Multi-cloudYesAWS onlyAWS only (but the API is portable, so app code is)
Per-unit write ceilingBounded by broker (typically 50-100 MB/sec per partition)1 MB/sec per shard (hard cap)Same as Kafka
Per-unit read ceilingBounded by broker; many consumers per partition with no extra cost2 MB/sec per shard, 5 GetRecords/sec/shard; Enhanced Fan-Out is extraSame as Kafka
RetentionConfigurable; days to forever (you pay for disk)1 to 365 daysConfigurable; days to forever
Ordering guaranteePer-partition, sticky to partition assignmentPer-shard, with hash-range remapping on reshardingPer-partition (same as Kafka)
Exactly-onceProducer-side: idempotent + transactions + read-committedConsumer-side: KCL checkpoints in DynamoDB + idempotent app logicSame as Kafka
Schema evolutionConfluent Schema Registry (de facto standard)AWS Glue Schema Registry (thinner client integration)Either Confluent SR or Glue SR
Cross-region replicationMirrorMaker 2 (mature)Custom via Lambda or Firehose (brittle at scale)MirrorMaker 2 between MSK clusters (mature)
Pricing modelEC2 + EBS + ops headcountPer shard hour + per PUT payload + Enhanced Fan-OutPer broker hour + storage + data in/out

What you'll actually be asked

Five questions from the streaming portion of cloud-native DE loops.

Q01

When would you pick MSK over self-hosted Kafka?

Strong: when the team needs the Kafka API and ecosystem but doesn't have headcount to operate brokers, ZooKeeper or KRaft, and disk. MSK absorbs broker patching, AZ-aware placement, encryption-at-rest, and basic monitoring. You still own topics, partitions, ACLs, mirroring, Schema Registry, and Connect clusters. Mention MSK Serverless as the further step that absorbs broker capacity planning. Trap: claiming MSK is fully managed Kafka. It's not. AWS manages the bottom half; your topics, partitions, and consumer groups are still yours.

Q02

How do you guarantee exactly-once from Kinesis to S3?

Two real paths. (1) Firehose with a deduplication key derived from the Kinesis sequence number, plus idempotent writes to a partitioned S3 prefix and a downstream consumer treating S3 object keys as the dedupe boundary. Firehose buffers and writes batches, so dedupe lives at the object level. (2) A Flink job sourced from Kinesis with the two-phase commit S3 sink (StreamingFileSink with checkpointing) that writes via .pending then .committed. Path 1 simpler; path 2 if you also need stateful processing.

Q03

Your producer is dropping messages during a broker restart. Diagnose.

Walk the stack. (1) Producer config: acks=1 or acks=0 silently drops on leader failure. Fix: acks=all paired with min.insync.replicas=2. (2) retries and delivery.timeout.ms — defaults are high but delivery.timeout caps total time including retries. (3) If the restarted broker was the controller, failover delays affect metadata refresh and the producer times out resolving leader. (4) ISR shrinkage: if the restarted broker was the only in-sync replica, min.insync.replicas=2 with acks=all blocks producers. L5 reads as: name the config, name the failure mode, name what you'd change.

Q04

Walk me through scaling Kinesis from 100 to 10,000 events/sec.

Sizing first. At 1 KB average, 10K events/sec is 10 MB/sec write. Per-shard ceiling is 1 MB/sec, floor is 10 shards. Size for hot-key headroom and the per-shard 1000 records/sec ceiling: 12-15 shards. Then Provisioned vs On-Demand: On-Demand auto-scales but ramps over 15 minutes per doubling, so a synthetic spike will throttle with PROVISIONED_THROUGHPUT_EXCEEDED. Provisioned with explicit splits reacts faster if you can predict the ramp. Mention resharding invalidates KCL leases for the affected hash range, so you plan deploys around it. Closing signal: name a non-uniform partition key as the actual risk, not total throughput.

Q05

Compare Kafka Connect to Kinesis Firehose for a CDC pipeline.

Kafka Connect with the Debezium Postgres connector reads the WAL, emits change events to Kafka topics with full before/after images, supports schema evolution via Schema Registry. Firehose is not a CDC tool; it's a delivery layer. The AWS-native CDC story is Database Migration Service (DMS) writing to Kinesis Data Streams, optionally to Firehose for landing in S3. Honest comparison: Debezium is the mature, open-source, multi-database CDC standard. DMS + Kinesis works but has thinner schema-evolution semantics and weaker handling of long-running transactions. For non-trivial CDC, Kafka Connect + Debezium is the default; Firehose enters only as a sink.

What you actually pay at three scales

2026 list prices in us-east-1. Excludes inter-AZ data transfer (which on Kafka can dominate). Order-of-magnitude, not bid-quality.

ScaleKinesis Data StreamsMSK ProvisionedSelf-hosted Kafka (EC2)Reading
1 MB/sec sustained~$22/mo (1 shard) + $14/mo PUTsn/a (below 3-broker minimum effective scale)~$430/mo (3x m7g.large brokers) + storageSelf-hosted noise; AZ-spread EBS dominates
100 MB/sec sustained~$2.2K/mo (100 shards) + ~$1.4K/mo PUTs~$3.0K/mo (3x m7g.xlarge) + ~$700/mo storage~$1.7K/mo (3x m7g.2xlarge on-demand) + EBS + opsMSK and self-hosted converge; Kinesis pays the managed premium
1 GB/sec sustained~$22K/mo (1000 shards) + ~$14K/mo PUTs~$11K/mo (6x m7g.4xlarge) + storage~$6K/mo (6x m7g.4xlarge reserved) + EBS + opsKafka wins decisively; Kinesis economics break on per-shard pricing

Decision matrix

SituationPickReason
AWS-only, under 100 MB/sec, fewer than 5 streaming engineersKinesis Data StreamsLowest ops; fits the team and the throughput budget.
AWS-only, just landing data in S3, no streaming computeKinesis FirehoseZero-code delivery with built-in batching and Parquet conversion.
AWS-only, over 100 MB/sec, need Kafka Connect or Schema RegistryMSKKafka API and ecosystem with AWS handling broker plumbing.
Multi-cloud or already running Kafka somewhereSelf-hosted Kafka or Confluent CloudPortability; avoid Kinesis re-platform later.
Need Kafka Streams or ksqlDB for stateful processingMSK or self-hosted KafkaKinesis has no native equivalent; Flink is the closest.
Need exactly-once with stateful stream processingEither source + Apache FlinkFlink's checkpoint barriers work over both Kafka and Kinesis sources.
Multi-region active-active replication requiredKafka with MirrorMaker 2 or Confluent CloudKinesis cross-region replication is custom and brittle.
Lambda-triggered processing of every eventKinesis Data StreamsNative Lambda event source with built-in batch and retry.
Sustained throughput above 1 GB/sec at lowest costSelf-hosted Kafka on EC2 with reserved instancesPer-shard Kinesis economics break; managed-service premium dominates.

Myth vs reality

Myth: Kinesis is fully managed, so you don't have to worry about scaling

Reality: Provisioned mode is manual sharding; you split and merge shards explicitly, and resharding invalidates KCL checkpoints in the affected hash range. On-Demand auto-scales but ramps slowly (roughly doubling every 15 minutes), so a synthetic 10x spike will throttle. Either way you own a runbook.

Myth: Kafka beats Kinesis on cost at any scale

Reality: only past roughly 50 MB/sec sustained. Below that, the three-broker minimum and the engineer-hour tax on self-hosted Kafka make Kinesis cheaper TCO. The cost crossover is real and belongs in your interview answer when someone says 'cheaper' without naming a scale.

Myth: MSK means AWS manages your Kafka

Reality: AWS manages broker hosts, patching, and AZ-aware placement. AWS does not manage your topics, partitions, consumer groups, ACLs, MirrorMaker, Schema Registry, or Connect clusters. The delta vs self-hosted is real but smaller than the marketing suggests. MSK Serverless absorbs broker capacity, which is meaningful.

Myth: Exactly-once is a checkbox both platforms tick the same way

Reality: Kafka exactly-once is producer-side (idempotent producers + transactions + read-committed consumers). Kinesis exactly-once is consumer-side (KCL checkpoints sequence numbers; the application keeps side-effects idempotent). Different mechanics, different failure modes. Saying 'exactly-once' without naming the side is an L4 ceiling.

Myth: Firehose handles ordering and exactly-once into S3 for free

Reality: Firehose buffers by size or time and writes batched objects. Within an object, records preserve arrival order from Kinesis; across objects, ordering is best-effort. Firehose retries on failure and can produce duplicate objects on retry boundaries. Real exactly-once requires deduplication on the consumer side keyed off the Kinesis sequence number or producer-supplied id.

Kafka failure modes that page on-call

ZooKeeper quorum loss

On pre-KRaft clusters. One AZ blips, the ZK ensemble loses majority, Kafka brokers survive but cannot accept metadata changes. Producers keep writing for a while, then start failing on leader-not-available. Fix: KRaft. If you can't move yet, run ZK across three AZs with its own observability and treat ZK incidents as P1 even when Kafka looks fine.

Rebalance storms from consumer churn

A pod restart loop in a stateful consumer group triggers eager rebalances every cycle; the group stops processing each time. Fix: cooperative-sticky assignor (KIP-429), longer session.timeout.ms, and stop deploying with rolling restarts that bounce too fast.

ISR shrinkage under load

Replicas fall behind because broker disk I/O is saturated; ISR drops to 1; min.insync.replicas=2 with acks=all blocks producers. The page is 'producers are stuck'; the cause is replication fetcher lag. Fix: bigger EBS IOPS, segment.bytes tuning, or fewer partitions per broker.

Log compaction lag

On compacted topics with high write rate. The cleaner can't keep up; your changelog topic grows unbounded. Fix: log.cleaner.threads, log.cleaner.io.max.bytes.per.second, and accepting that compacted topics are not free.

Controller failover after a broker JVM pause

The new controller has to refresh metadata for every partition; producers see leader-not-available for the duration. On a 5K-partition cluster, that's minutes of degraded write availability. Mitigation: smaller clusters, or KRaft (faster controller failover).

Kinesis failure modes that page on-call

PROVISIONED_THROUGHPUT_EXCEEDED on writes

The producer hits the per-shard 1 MB/sec or 1000 records/sec cap. The math was right on average but a hot key (a single high-traffic user_id) lands all writes on one shard. Fix: salt the partition key with a hash mod-N suffix, re-aggregate downstream. On-Demand mode doesn't save you; it scales total capacity, not per-key fanout.

GetRecords throttling on reads

The 5 GetRecords/sec/shard ceiling kicks in when too many consumers share a shard. Fix: Enhanced Fan-Out (one HTTP/2 stream per consumer per shard, billed separately) or fewer consumers.

Stale iterator exception

A consumer holds a shard iterator longer than 5 minutes. KCL consumers that lag past their lease get this and have to recover. Fix: faster polling, or accept the iterator refresh churn.

Resharding mid-incident

You split a hot shard to recover from throttling; KCL leases for the parent shard are invalidated for the affected hash range; the application restarts processing from a new sequence number. If your downstream isn't idempotent, you double-process. Plan for this; don't discover it during the page.

On-Demand ramp lag

A promo launch 10x your traffic in 30 seconds. On-Demand ramps over roughly 15 minutes per doubling, so the first few minutes throttle. Fix: pre-warm by running synthetic load, or use Provisioned with explicit pre-splits ahead of the launch.

Kafka exactly-once: producer-side mechanics

Idempotent producer (enable.idempotence=true)

Each producer gets a producer ID and sequence numbers per partition; the broker dedupes on retry within the same session. Default since 3.0.

Transactions

Wrap multi-partition writes (and consumed offsets) in an atomic commit via the transaction coordinator. A consumer reading with isolation.level=read_committed only sees committed data, never aborted batches.

Read-process-write loops

The Kafka Streams pattern commits input offsets and output records in the same transaction, so a failure between reading and writing is rolled back atomically.

Cost

Latency from coordinator round-trips, plus the operational weight of running a transaction coordinator. Worth it for stream processing; overkill for fire-and-forget event ingest.

Kinesis exactly-once: consumer-side mechanics

KCL checkpointing

Stores the last-processed sequence number per shard lease in DynamoDB. On restart, the consumer resumes from the checkpoint, not from where it died.

Idempotent sinks

Your responsibility. KCL guarantees at-least-once; exactly-once requires the application to dedupe on sequence number or a producer-supplied id.

Resharding interaction

When you split or merge shards, the parent shard lease ends and KCL takes leases on the children. The checkpoint boundary is preserved, but the application has to handle the new sequence number space cleanly.

Flink as the alternative

If you don't want to manage KCL state, a Flink job with the Kinesis source connector and exactly-once sinks moves the guarantee back into the framework.

MSK and Confluent Cloud: when each fits

MSK Provisioned is the AWS-native default once throughput crosses where Kinesis economics break (~50 MB/sec sustained) and you need the Kafka API. AWS manages broker patching and placement; you keep ownership of topics, partitions, consumer groups, ACLs, MirrorMaker, and Schema Registry. If you already run Connect or Streams, MSK is the lift-and-shift option.

MSK Serverless goes further and absorbs broker capacity planning. You pay per partition-hour and per GB ingested. Trade-off: fewer dials. Good fit for variable workloads where the alternative was over-provisioning MSK Provisioned.

Confluent Cloud is the right answer for multi-cloud shops, for teams that need Confluent-specific features (Stream Designer, fully managed Connect with hundreds of source/sink connectors, Stream Lineage), or for organizations where the Kafka commit log is foundational and the team won't own broker ops. Available on AWS, GCP, Azure. Higher cost than MSK at equivalent throughput; the value is portability and ecosystem.

In an interview, naming MSK Serverless and Confluent Cloud as known options signals you've looked past the two-option framing the question often comes wrapped in. Senior cue.

Prepare for the interview
01 / Open invite
02min.

Know Kafka vs Kinesis the way the interviewer who asks it knows it.

a Kafka vs Kinesis query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1source → bronze → silver → gold
2 ingest : CDC + Kafka
3 transform : dbt + Airflow
4 serve : Snowflake
5
Execute your solution0.4s avg.
SnowflakeInterview question
Solve a Kafka vs Kinesis problem

Kafka vs Kinesis FAQ

Is Kafka or Kinesis more popular in 2026?+
Kafka has more total deployment globally; Kinesis is more common in pure-AWS shops. MSK adoption is growing as AWS-native teams want Kafka API without operational burden. Kinesis Data Streams adoption is flat or slightly declining as MSK closes the gap. Confluent Cloud holds the high end of multi-cloud Kafka.
Which is harder to operate, self-hosted Kafka or Kinesis?+
Self-hosted Kafka is significantly harder: broker capacity planning, version upgrades, ZooKeeper or KRaft management, monitoring, security configuration, MirrorMaker. Kinesis is fully managed. MSK is in between (managed brokers; you handle topic configuration, partitions, consumer groups, ACLs, schema registry).
Can I switch from Kinesis to Kafka later?+
Yes, but it's a significant migration. APIs differ; producer and consumer code needs rewriting. Connector ecosystem differs. State migration (Kafka offsets vs Kinesis sequence numbers) is non-trivial. Plan 3-6 months for a serious migration, including dual-writes, cutover, and a backout plan.
Does the DE interview always ask about Kafka vs Kinesis?+
In streaming-heavy loops, yes. In batch-heavy or analytics-engineer loops, less common. Most cloud-native DE system design rounds reference one or both as the message broker layer.
What's the cost difference at 100 MB/sec?+
Roughly: self-hosted Kafka ~$1.7K/mo on reserved instances plus engineer time. Kinesis Data Streams ~$3.6K/mo (100 shards plus PUT charges). MSK ~$3K/mo (3x m7g.xlarge plus storage). Cross-over depends heavily on operational expertise; once you price an SRE at $25/hour-equivalent, self-hosted is only cheaper if you'd have hired that headcount anyway.
What's the right shard count for X events/sec on Kinesis?+
Floor: ceil(write_MB_per_sec / 1) shards by throughput, and ceil(records_per_sec / 1000) by record count, take the max. Add 30-50% headroom for hot keys and bursts. Example: 10 MB/sec at 1 KB = 10K records/sec = 10 shards floor, with headroom 13-15. The trap is sizing for average and getting throttled by a hot partition key, not total throughput.
02 / Why practice

Drill streaming system design

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Adjacent interview prep

More data engineer interview prep guides