Tooling Decision Guide

Kafka vs Kinesis, from someone who has paged for both

Apache Kafka vs AWS Kinesis is one of the most-asked system design decisions in streaming Data Engineer interviews. Same job (durable, ordered, partitioned streams), different operational contracts. This guide unpacks the throughput math, real cost at three scales, ordering and exactly-once mechanics, on-call failure modes, and what interviewers grade on. Pair with our data engineer interview prep hub.
The Short Answer
Pick Kinesis if you are AWS-only, under 100 MB/sec, and the team has fewer than 5 streaming engineers. Pick MSK if you are AWS-only and need the Kafka API or ecosystem (Connect, Schema Registry, Streams). Pick self-hosted Kafka or Confluent Cloud if you are multi-cloud, sustained over roughly 100 MB/sec, or already have Kafka expertise on the team. Use Firehose only as a sink, never as the stream itself.
Updated April 2026·By The DataDriven Team
1 MB/s
Per Kinesis shard write ceiling
100 MB/s
Per Kafka partition write ceiling
365 d
Kinesis retention max
Forever
Kafka retention ceiling
What this guide actually says
  1. 01The two platforms solve the same problem (durable, ordered, partitioned streams) but ship with different operational contracts. Kafka gives you primitives. Kinesis gives you a service. The right answer in an interview is to articulate which contract fits the team and the workload, not to pick a winner.
  2. 02Per-unit throughput is wildly different. A Kinesis shard caps at 1 MB/sec write and 2 MB/sec read. A Kafka partition is bounded by the broker, not the partition itself, and routinely sustains 50 to 100 MB/sec. That single fact drives most of the cost and design differences downstream.
  3. 03Exactly-once semantics exist on both, but the implementations are different beasts. Kafka uses transactional producers and idempotent writes. Kinesis leans on the consumer side via KCL checkpointing against sequence numbers. Naming the mechanism is the L5 ceiling.
  4. 04The cost picture inverts around 50 MB/sec sustained. Below that, Kinesis (and especially Firehose) is cheaper TCO once you price ops time. Above that, Kafka (and at the high end, self-hosted Kafka) wins decisively. MSK splits the difference in AWS-native shops.
  5. 05Replay, backfill, and schema evolution are where teams actually get burned. Kinesis caps retention at 365 days and resharding invalidates KCL checkpoints. Kafka retains as long as you pay for disk but requires you to operate Schema Registry yourself. Real designs commit to one of these worlds.
  6. 06Interviewers do not grade you on which is better. They grade you on whether you can name the trade-offs out loud under pressure: ordering, rebalance behavior, hot-shard mitigation, exactly-once mechanics, cross-region replication, and what happens at 3 AM.
The framing

What interviewers actually grade on

They do not grade you on which platform is better. They grade you on whether you can articulate the trade-offs out loud under pressure. These are the five surfaces that separate L4 from L5 answers.

Ordering

Per-partition vs per-shard ordering

Both platforms order within a unit (Kafka partition, Kinesis shard) and not across them. The L5 nuance: in Kinesis, resharding (split or merge) ends ordering for affected hash ranges and forces consumers to recover from a new sequence number. In Kafka, partition count is sticky once set; increasing it changes which key lands on which partition for hash-mod producers, which silently breaks ordering for the same key across the boundary. Articulating this without prompting is the signal interviewers grade on.
Exactly-once

Exactly-once is not the same on both

Kafka exactly-once is a producer story: idempotent writes plus transactional semantics across partitions and topics, paired with read-committed consumers. Kinesis exactly-once is a consumer story: KCL checkpoints sequence numbers in DynamoDB and the application is responsible for keeping side-effects idempotent. If you say exactly-once without specifying which side, an L5 interviewer will follow up.
Rebalance

Consumer rebalance behavior

Kafka group rebalances historically stop the world for the entire group (eager protocol); cooperative sticky assignors (KIP-429) reduce that to incremental moves but only if all consumers in the group support it. Kinesis has no consumer group protocol in the Kafka sense; KCL leases are renewed in DynamoDB and a worker death triggers a single lease takeover, not a fleet-wide rebalance. Knowing this matters for any design that talks about graceful deploys.
Replay

Replay and backfill

Kafka lets you reset a consumer group to any offset within retention and replay arbitrarily. Kinesis lets you start consumers from TRIM_HORIZON, AT_TIMESTAMP, or AFTER_SEQUENCE_NUMBER, but is bounded by the 365-day retention ceiling and resharding history. For multi-year backfill, neither is the right answer; you store raw events in S3 and replay from there.
Schema

Schema evolution discipline

Kafka has Confluent Schema Registry as a de-facto standard for Avro/Protobuf/JSON Schema with compatibility checks. Kinesis Data Streams has AWS Glue Schema Registry, which works but has thinner client integration. Most production Kinesis pipelines bolt schema validation in at the producer or downstream Lambda rather than at the broker. Naming this gap honestly is a senior signal.
L5 cue
When you say “exactly-once,” immediately follow with “producer-side via transactions on Kafka, or consumer-side via KCL checkpoints on Kinesis.” The qualifier is the ceiling. Without it the interviewer cannot tell whether you know the mechanism or just the marketing.
Side by side

Self-hosted Kafka vs Kinesis Data Streams vs MSK

Three options, eleven dimensions. The MSK column inherits most of Kafka and that is the point: you get the API, AWS handles the brokers.

DimensionSelf-hosted KafkaAWS Kinesis Data StreamsAWS MSK
Operational modelYou manage everything (brokers, KRaft/ZK, disk, security, mirroring)AWS-managed end-to-end (sharding is your job in Provisioned mode)AWS manages brokers; you manage topics, ACLs, Schema Registry, Connect
API and ecosystemNative Kafka API; full Connect, Streams, ksqlDB, Schema RegistryKinesis-specific; KCL/KPL or AWS SDK; Glue Schema RegistryNative Kafka API; Connect and Streams compatible; Glue or Confluent SR
Multi-cloudYesAWS onlyAWS only (but the API is portable, so app code is)
Per-unit write ceilingBounded by broker (typically 50-100 MB/sec per partition)1 MB/sec per shard (hard cap)Same as Kafka
Per-unit read ceilingBounded by broker; many consumers per partition with no extra cost2 MB/sec per shard, 5 GetRecords/sec/shard; Enhanced Fan-Out is extraSame as Kafka
RetentionConfigurable; days to forever (you pay for disk)1 to 365 daysConfigurable; days to forever
Ordering guaranteePer-partition, sticky to partition assignmentPer-shard, with hash-range remapping on reshardingPer-partition (same as Kafka)
Exactly-onceProducer-side: idempotent + transactions + read-committedConsumer-side: KCL checkpoints in DynamoDB + idempotent app logicSame as Kafka
Schema evolutionConfluent Schema Registry (de-facto standard)AWS Glue Schema Registry (thinner client integration)Either Confluent SR or Glue SR
Cross-region replicationMirrorMaker 2 (mature)Custom via Lambda or Firehose (brittle at scale)MirrorMaker 2 between MSK clusters (mature)
Pricing modelEC2 + EBS + ops headcountPer shard hour + per PUT payload + Enhanced Fan-OutPer broker hour + storage + data in/out
The pull quote

One sentence to remember

Kafka is a database that pretends to be a queue. Kinesis is a queue that pretends to be a database. Pick which lie matches your operational tolerance.
What we tell engineers who ask which to learn first
The 5 questions

What you will actually be asked

These five appear in the streaming portion of cloud-native Data Engineer loops. Each entry is a sample answer outline at L5 ceiling.

Q01

When would you pick MSK over self-hosted Kafka?

Strong answer: when the team needs the Kafka API and ecosystem but does not have headcount to operate brokers, ZooKeeper or KRaft, and disk. MSK absorbs broker patching, AZ-aware placement, encryption-at-rest, and basic monitoring. You still own topics, partitions, ACLs, mirroring, Schema Registry, and Connect clusters. Mention MSK Serverless as the further step that absorbs broker capacity planning. The trap is claiming MSK is fully managed Kafka. It is not. AWS manages the bottom half; your topics, partitions, and consumer groups are still yours.
Q02

How do you guarantee exactly-once from Kinesis to S3?

Two real paths. Path one: Kinesis Data Firehose with a deduplication key derived from the Kinesis sequence number, plus idempotent writes to a partitioned S3 prefix and a downstream consumer that treats S3 object keys as the dedupe boundary. Firehose buffers and writes batches, so dedupe lives at the object level, not the record level. Path two: a Flink job sourced from Kinesis with the two-phase commit S3 sink (StreamingFileSink with checkpointing) which writes to S3 with exactly-once via the .pending then .committed file pattern. Path one is simpler; path two is what you reach for if you also need stateful processing.
Q03

Your producer is dropping messages during a broker restart. Diagnose.

Walk the stack. First, check the producer config: acks=1 or acks=0 will silently drop on leader failure. The fix is acks=all paired with min.insync.replicas=2. Second, check retries and delivery.timeout.ms; the default retries are high but delivery.timeout caps total time including retries. Third, check whether the broker that restarted was the controller; if so, controller failover delays affect metadata refresh and the producer will time out resolving leader. Fourth, ISR shrinkage: if the restarted broker was the only in-sync replica for a partition, min.insync.replicas=2 will block the producer. The L5 reads as: name the config, name the failure mode, name what you would change.
Q04

Walk me through scaling Kinesis from 100 to 10,000 events/sec.

Sizing math first. At 1 KB average payload, 10K events/sec is 10 MB/sec write. Per-shard write ceiling is 1 MB/sec, so floor is 10 shards, but you size for hot-key headroom and the per-shard 1000 records/sec ceiling, so 12 to 15 shards. Then choose between Provisioned and On-Demand: On-Demand auto-scales but ramps over 15 minutes per doubling, so a synthetic spike will get throttled with PROVISIONED_THROUGHPUT_EXCEEDED. Provisioned with explicit splits is faster to react if you can predict the ramp. Mention that resharding invalidates KCL leases for the affected hash range, so you plan deploys around it. Closing signal: name a non-uniform partition key as the actual risk, not total throughput.
Q05

Compare Kafka Connect to Kinesis Firehose for a CDC pipeline.

Kafka Connect with the Debezium Postgres connector reads the WAL, emits change events to Kafka topics with full before-and- after images, and supports schema evolution via Schema Registry. Kinesis Firehose is not a CDC tool; it is a delivery layer. The AWS-native CDC story is Database Migration Service (DMS) writing to Kinesis Data Streams, optionally to Firehose for landing in S3. The honest comparison: Debezium is the mature, open-source, multi-database CDC standard; DMS plus Kinesis works but has thinner schema-evolution semantics and weaker handling of long-running transactions. For a CDC pipeline at any non-trivial complexity, Kafka Connect plus Debezium is the default; Firehose enters the picture only as a sink.
Cost math

What you actually pay at three scales

2026 list prices in us-east-1. Numbers exclude data transfer between AZs (which on Kafka can dominate). Treat as order-of-magnitude, not bid-quality estimates.

ScaleKinesis Data StreamsMSK ProvisionedSelf-hosted Kafka (EC2)Reading
1 MB/sec sustained~$22/mo (1 shard) + $14/mo PUTsn/a (below 3-broker minimum effective scale)~$430/mo (3x m7g.large brokers) + storageSelf-hosted noise; AZ-spread EBS dominates
100 MB/sec sustained~$2.2K/mo (100 shards) + ~$1.4K/mo PUTs~$3.0K/mo (3x m7g.xlarge) + ~$700/mo storage~$1.7K/mo (3x m7g.2xlarge on-demand) + EBS + opsMSK and self-hosted converge; Kinesis pays the managed premium
1 GB/sec sustained~$22K/mo (1000 shards) + ~$14K/mo PUTs~$11K/mo (6x m7g.4xlarge) + storage~$6K/mo (6x m7g.4xlarge reserved) + EBS + opsKafka wins decisively; Kinesis economics break on per-shard pricing
Watch out
None of these prices include the engineer-hour tax on self-hosted Kafka. Add one-third to one-half of an SRE FTE to the self-hosted column to get apples-to-apples TCO. That is what flips the under-50-MB/sec calculus toward Kinesis.
The shape of the choice

Decision matrix

If your situation looks like the left column, the middle column is the default and the right column is the reason you would defend in an interview.

If your situation is
Pick
Why
AWS-only, under 100 MB/sec, fewer than 5 streaming engineers
Kinesis Data Streams
Lowest ops; fits the team and the throughput budget.
AWS-only, just landing data in S3, no streaming compute
Kinesis Firehose
Zero-code delivery with built-in batching and Parquet conversion.
AWS-only, over 100 MB/sec, need Kafka Connect or Schema Registry
MSK
Kafka API and ecosystem with AWS handling broker plumbing.
Multi-cloud or already running Kafka somewhere
Self-hosted Kafka or Confluent Cloud
Portability; avoid Kinesis re-platform later.
Need Kafka Streams or ksqlDB for stateful processing
MSK or self-hosted Kafka
Kinesis has no native equivalent; Flink is the closest.
Need exactly-once with stateful stream processing
Either source plus Apache Flink
Flink's checkpoint barriers work over both Kafka and Kinesis sources.
Multi-region active-active replication required
Kafka with MirrorMaker 2 or Confluent Cloud
Kinesis cross-region replication is custom and brittle.
Lambda-triggered processing of every event
Kinesis Data Streams
Native Lambda event source with built-in batch and retry.
Sustained throughput above 1 GB/sec at lowest cost
Self-hosted Kafka on EC2 with reserved instances
Per-shard Kinesis economics break; managed-service premium dominates.
Practice

A streaming pipeline problem to chew on

Architecture interviews put real numbers on the table and ask you to defend a design. This one is the closest analog to the Kafka vs Kinesis discussion.

ArchitectureTry this problem
Two Hundred Million Redirects

Billions of clicks. One tiny code. Two very different clocks.

Myth busting

What people think vs what actually happens

Five myths that show up in interview answers and hallway debates. The reality column is what a senior interviewer expects you to say instead.

The Myth
Kinesis is fully managed, so you do not have to worry about scaling.
The Reality
Provisioned mode is manual sharding; you split and merge shards explicitly, and resharding invalidates KCL checkpoints in the affected hash range. On-Demand auto-scales but ramps slowly (roughly doubling every 15 minutes), so a synthetic 10x spike will throttle. Either way you own a runbook.
The Myth
Kafka beats Kinesis on cost at any scale.
The Reality
Only past roughly 50 MB/sec sustained. Below that, the three-broker minimum and the engineer-hour tax on self-hosted Kafka make Kinesis cheaper TCO. The cost crossover is real and belongs in your interview answer when someone says “cheaper” without naming a scale.
The Myth
MSK means AWS manages your Kafka.
The Reality
AWS manages broker hosts, patching, and AZ-aware placement. AWS does not manage your topics, partitions, consumer groups, ACLs, MirrorMaker, Schema Registry, or Connect clusters. The delta vs self-hosted is real but smaller than the marketing copy suggests. MSK Serverless absorbs broker capacity, which is a meaningful step further.
The Myth
Exactly-once is a checkbox both platforms tick the same way.
The Reality
Kafka exactly-once is producer-side: idempotent producers plus transactions plus read-committed consumers. Kinesis exactly-once is consumer-side: KCL checkpoints sequence numbers and the application keeps side-effects idempotent. The mechanics are different and the failure modes are different. Saying just “exactly-once” without naming the side is an L4 ceiling.
The Myth
Kinesis Firehose handles ordering and exactly-once into S3 for free.
The Reality
Firehose buffers by size or time and writes batched objects. Within an object, records preserve arrival order from Kinesis; across objects, ordering is best-effort. Firehose retries on failure and can produce duplicate objects on retry boundaries. Real exactly-once requires deduplication on the consumer side keyed off the Kinesis sequence number or producer-supplied id.
The 3 AM page

What goes wrong when no one is watching

The honest test of any streaming platform is what it looks like under failure. These are the runbooks you wish you had read before you needed them.

Kafka, in production

Five failure modes you have to know cold to defend a Kafka design.

  • ZooKeeper quorum loss on pre-KRaft clusters. One AZ blips, the ZK ensemble loses majority, Kafka brokers survive but cannot accept metadata changes. Producers keep writing for a while, then start failing on leader-not-available. Fix: KRaft. If you cannot move yet, run ZK across three AZs with its own observability and treat ZK incidents as P1 even when Kafka looks fine.
  • Rebalance storms from consumer churn. A pod restart loop in a stateful consumer group triggers eager rebalances every cycle; the group stops processing each time. Fix: cooperative-sticky assignor (KIP-429), longer session.timeout.ms, and stop deploying with rolling restarts that bounce too fast.
  • ISR shrinkage under load. Replicas fall behind because broker disk I/O is saturated; ISR drops to 1; min.insync.replicas=2 with acks=all blocks producers. The page is “producers are stuck”; the cause is replication fetcher lag. Fix: bigger EBS IOPS, segment.bytes tuning, or fewer partitions per broker.
  • Log compaction lag on compacted topics with high write rate. The cleaner cannot keep up; your changelog topic grows unbounded. Fix: log.cleaner.threads, log.cleaner.io.max.bytes.per.second, and accepting that compacted topics are not free.
  • Controller failover after a broker JVM pause. The new controller has to refresh metadata for every partition; producers see leader-not-available for the duration. On a 5K-partition cluster, that is minutes of degraded write availability. Mitigation: smaller clusters, or KRaft (faster controller failover).
Kinesis, in production

Five throttling and lease pathologies that page Kinesis-on-call.

  • PROVISIONED_THROUGHPUT_EXCEEDED on writes. The producer hits the per-shard 1 MB/sec or 1000 records/ sec cap. The math was right on average but a hot key (a single high-traffic user_id) lands all writes on one shard. Fix: salt the partition key with a hash mod-N suffix, re-aggregate downstream. On-Demand mode does not save you; it scales total capacity, not per-key fanout.
  • GetRecords throttling on reads. The 5 GetRecords/sec/shard ceiling kicks in when too many consumers share a shard. Fix: Enhanced Fan-Out (one HTTP/2 stream per consumer per shard, billed separately) or fewer consumers.
  • Stale iterator exception. A consumer holds a shard iterator longer than 5 minutes. KCL consumers that lag past their lease get this and have to recover. Fix: faster polling, or accept the iterator refresh churn.
  • Resharding mid-incident. You split a hot shard to recover from throttling; KCL leases for the parent shard are invalidated for the affected hash range; the application restarts processing from a new sequence number. If your downstream is not idempotent, you double-process. Plan for this; do not discover it during the page.
  • On-Demand ramp lag. A promo launch 10x your traffic in 30 seconds. On-Demand mode ramps over roughly 15 minutes per doubling, so the first few minutes throttle. Fix: pre-warm by running synthetic load, or use Provisioned with explicit pre-splits ahead of the launch.
The exactly-once question

How each platform actually delivers exactly-once

The phrase is meaningless without naming the mechanism. Here is the mechanism.

Kafka exactly-once: producer-side
  • Idempotent producer (enable.idempotence=true). Each producer gets a producer ID and sequence numbers per partition; the broker dedupes on retry within the same session. Default since 3.0.
  • Transactions wrap multi-partition writes (and consumed offsets) in an atomic commit via the transaction coordinator. A consumer reading with isolation.level=read_committed only sees committed data, never aborted batches.
  • Read-process-write loops (the Kafka Streams pattern) commit input offsets and output records in the same transaction, so a failure between reading and writing is rolled back atomically.
  • Cost: latency from coordinator round-trips, plus the operational weight of running a transaction coordinator. Worth it for stream processing; overkill for fire-and-forget event ingest.
Kinesis exactly-once: consumer-side
  • KCL checkpointing stores the last-processed sequence number per shard lease in DynamoDB. On restart, the consumer resumes from the checkpoint, not from where it died.
  • Idempotent sinks are your responsibility. KCL guarantees at-least-once; exactly-once requires the application to dedupe on sequence number or a producer-supplied id.
  • Resharding interaction. When you split or merge shards, the parent shard lease ends and KCL takes leases on the children. The checkpoint boundary is preserved, but the application has to handle the new sequence number space cleanly.
  • Flink as the alternative. If you do not want to manage KCL state, a Flink job with the Kinesis source connector and exactly-once sinks moves the guarantee back into the framework.
The middle ground

MSK and Confluent Cloud: when each fits

The third option that wins more interviews than naming Kafka or Kinesis alone.

MSK Provisioned is the AWS-native default once your throughput crosses where Kinesis economics break (roughly 50 MB/sec sustained) and you need the Kafka API. AWS manages broker patching and placement; you keep ownership of topics, partitions, consumer groups, ACLs, MirrorMaker, and Schema Registry. If you already run Connect or Streams, MSK is the lift-and- shift option.

MSK Serverless goes one step further and absorbs broker capacity planning. You pay per partition-hour and per GB ingested. The trade-off is fewer dials: you do not pick instance types, and some advanced configs are not exposed. Good fit for variable workloads where the alternative was over-provisioning MSK Provisioned.

Confluent Cloud is the right answer for multi-cloud shops, for teams that need Confluent-specific features (Stream Designer, fully managed Connect with hundreds of source/sink connectors, Stream Lineage), or for organizations where the Kafka commit log is foundational and the team will not own broker ops. Available on AWS, GCP, and Azure. The cost is higher than MSK at equivalent throughput; the value is portability and ecosystem.

In an interview, naming MSK Serverless and Confluent Cloud as known options signals that you have looked past the two-option framing the question often comes wrapped in. That is a senior cue.

Connecting the dots

How this decision fits the rest of the cluster

Streaming platform choice is upstream of stream processing, downstream of pipeline architecture, and adjacent to cloud platform choice.

Streaming platform choice is foundational for Kafka and Flink interview prep roles and for the system design framework for data engineers framework. The Flink stateful streaming interview prep guide covers the most-tested stream processor that works with both Kafka and Kinesis as sources.

For cloud platform decisions, see Glue, Redshift, Kinesis, EMR interview prep (Kinesis is AWS-only; MSK is the AWS-native Kafka), BigQuery and Dataflow interview prep (GCP equivalent: Pub/Sub, with similar managed-queue semantics to Kinesis), and Synapse, Data Factory, Fabric interview prep (Azure equivalent: Event Hubs, which speaks the Kafka protocol and is the rare cloud-native service that does).

Kafka vs Kinesis FAQ

Is Kafka or Kinesis more popular in 2026?+
Kafka has more total deployment globally; Kinesis is more common in pure-AWS shops. MSK adoption is growing as AWS-native teams want Kafka API without operational burden. Kinesis Data Streams adoption is flat or slightly declining as MSK closes the gap. Confluent Cloud holds the high end of multi-cloud Kafka.
Which is harder to operate, self-hosted Kafka or Kinesis?+
Self-hosted Kafka is significantly harder. Broker capacity planning, version upgrades, ZooKeeper or KRaft management, monitoring, security configuration, MirrorMaker. Kinesis is fully managed. MSK is in between (managed brokers, you handle topic configuration, partitions, consumer groups, ACLs, and schema registry).
Can I switch from Kinesis to Kafka later?+
Yes, but it is a significant migration. APIs differ; producer and consumer code needs rewriting. Connector ecosystem differs. State migration (Kafka offsets vs Kinesis sequence numbers) is non-trivial. Plan three to six months for a serious migration, including dual-writes, cutover, and a backout plan.
Is Kafka Streams or KCL better for stream processing?+
Kafka Streams is more mature and more featureful (state stores, interactive queries, KStreams/KTable joins). KCL is a thinner library focused on shard lease management. For non-trivial stream processing, most teams reach for Apache Flink (which works with both Kafka and Kinesis as sources) regardless of broker.
Does the Data Engineer interview always ask about Kafka vs Kinesis?+
In streaming-heavy loops, yes. In batch-heavy or analytics-engineer loops, less common. Most cloud-native data engineer system design rounds reference one or both as the message broker layer, often within the same question.
Are there other streaming alternatives worth naming?+
Yes. Apache Pulsar (BookKeeper-based, less common but tiered storage is interesting). Redpanda (Kafka API-compatible, Rust-based, simpler ops, no JVM). Google Pub/Sub (GCP equivalent of Kinesis). Azure Event Hubs (Azure equivalent with Kafka API compatibility). Naming these as known alternatives signals breadth without distracting from the core decision.
What is the cost difference between Kafka and Kinesis at 100 MB/sec?+
Roughly: self-hosted Kafka costs around $1.7K/mo on reserved instances plus engineer time. Kinesis Data Streams costs around $3.6K/mo (100 shards plus PUT charges). MSK costs around $3K/mo (3x m7g.xlarge plus storage). The cross-over depends heavily on your operational expertise; once you price an SRE at $25/hour-equivalent, self-hosted Kafka is only cheaper if you would have hired that headcount anyway.
How important is exactly-once semantics in the interview?+
Critical. Both Kafka (transactions) and Kinesis (KCL with checkpoints) support exactly-once. Naming the implementation matters; vague mentions of exactly-once without naming how are L4 ceiling signals. Senior interviewers will follow up: which side, what failure mode, what mitigation.
Should I name Confluent Cloud in an interview?+
Yes, as the multi-cloud managed-Kafka option. It is more expensive than MSK and Kinesis at most scales, so do not default to it, but mentioning it signals familiarity with the streaming ecosystem beyond AWS-only thinking. Confluent Cloud is the right answer for shops that need Kafka and refuse to be locked to a cloud.
What is the right shard count for X events/sec on Kinesis?+
Floor: ceil(write_MB_per_sec / 1) shards by throughput, and ceil(records_per_sec / 1000) by record count, take the max. Then add 30 to 50 percent headroom for hot keys and bursts. Example: 10 MB/sec at 1 KB average = 10K records/sec = 10 shards floor by throughput, 10 shards floor by records, with headroom: 13 to 15 shards. The trap is sizing for average and getting throttled by a hot partition key, not by total throughput.

Drill streaming system design

DataDriven covers SQL, Python, system design, and data modeling at interview difficulty. Practice the streaming patterns interviewers actually grade on.

More data engineer interview prep guides

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 921 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats