Which streaming engine should a data engineer pick: Flink, Spark Structured Streaming, or Kafka Streams?

Flink for stateful streaming at scale with exactly-once and event-time windowing as primary requirements (Netflix, Uber, Stripe). Spark Structured Streaming for unified batch and streaming with the familiar Spark API and Spark-team operational expertise (Databricks, generic Spark shops). Kafka Streams for in-Kafka transformations with lowest operational overhead and Kafka-only data (logging, metrics enrichment). Beam on Dataflow if you are on GCP.

How does a streaming pipeline achieve exactly-once semantics?

Through at-least-once delivery (Kafka with replication, source connector with snapshot recovery) plus idempotent processing (dedup on composite key, MERGE INTO with run_id, transactional writes). Pure exactly-once at the message level is impossible without producer-broker-consumer coordination; what matters is exactly-once effect. Flink + Kafka transactional commits provide end-to-end exactly-once within Flink-managed sinks. Spark Structured Streaming provides exactly-once with checkpoint-based fault tolerance and idempotent sinks (Delta, Iceberg).

What is the difference between event-time and processing-time?

Event-time is when the event happened at the source (user clicked at 14:23:05). Processing-time is when the streaming engine sees the event (14:25:30 due to network delay or retry). Event-time windows produce correct results for out-of-order events but require watermarks to bound waiting. Processing-time windows are simpler but produce wrong results when events arrive late. Most data engineer interview rounds expect event-time windows.

What is a watermark in streaming?

A watermark is the streaming engine's commitment that no events with event_time earlier than the watermark will arrive. Flink and Spark Structured Streaming both have explicit watermark configuration. Typical setting: watermark = current_event_time minus 5 minutes for most workloads. Setting it too aggressive drops late events; setting it too conservative delays results and grows state size. The watermark configuration is a senior data engineer system design rubric item.

What is allowed lateness in Flink and Spark Structured Streaming?

Allowed lateness extends a window's reactivity beyond the watermark. Events arriving after the watermark passed but within allowed lateness update past results; events arriving after allowed lateness are dropped or sent to a side-output. Useful for handling late phone-offline data without growing state indefinitely. Configure based on the longest legitimate lateness observed in production.

What is the difference between Kafka and Kinesis from a data engineer interview perspective?

Functionally similar: distributed, partitioned, durable, append-only log. Kafka is the open-source default with the largest ecosystem (Kafka Connect, Schema Registry, ksqlDB, Confluent). Kinesis is AWS-managed with simpler ops and tighter AWS integration (Firehose, Kinesis Data Analytics, IAM). Throughput is comparable per shard/partition. Kafka has lower latency, higher throughput per partition, and more flexibility. Kinesis has lower operational overhead. Choose based on the company's existing stack.

How does a data engineer size a streaming pipeline?

Start with throughput: events per second average and peak. Each event size in bytes. Convert to bytes per second. Divide by per-shard throughput (1MB/s on Kinesis, configurable on Kafka, typically 10MB/s per Kafka partition for safe headroom). Add 2-3x headroom for peak and rebalances. Sample math: 10B events/day = 116k events/sec average, 580k peak, with 1KB events = 580MB/s peak = 58 Kafka partitions or 580 Kinesis shards.

Streaming System Design Questions: Kafka, Flink, Spark

Streaming System Design Interview Questions

Streaming pipeline design problems for data engineer interview prep.

Streaming system design interview questions for data engineer roles. Kafka + Flink + Spark Structured Streaming architectures. Exactly-once via at-least-once plus idempotency. Watermarks and allowed lateness. Event-time versus processing-time. Stateful streaming with RocksDB. The patterns that compose 80 percent of streaming data engineer interviews in 2026.

Streaming system design rounds for data engineer roles test six recurring concerns. Choosing the right streaming engine: Kafka Streams for in-Kafka transformations and lightweight processing; Flink for stateful streaming at scale with exactly-once and event-time windowing; Spark Structured Streaming for unified batch and streaming with familiar Spark API; Beam on Dataflow for GCP-native; Kinesis Data Analytics for AWS-native. Tradeoffs: Flink has the strongest stateful and event-time story; Spark has the strongest batch-integration story; Kafka Streams has the lowest operational overhead but limited to Kafka-only.

Exactly-once semantics in streaming pipelines. The hard truth is that exactly-once at the message-delivery level is impossible without coordination across producer, broker, and consumer; what business cares about is exactly-once effect, which is achievable with at-least-once delivery plus idempotent processing. Kafka offers transactional writes with isolation_level = read_committed for end-to-end exactly-once within the Kafka ecosystem. Flink provides exactly-once with checkpoint snapshots and two-phase commit to transactional sinks. Spark Structured Streaming provides exactly-once with checkpoint-based fault tolerance and idempotent sinks (Delta, Iceberg with MERGE).

Event-time versus processing-time. Event-time is when the event happened at the source (the user clicked at 14:23:05). Processing-time is when the streaming engine sees the event (14:25:30 due to network latency, retries, batching). Streaming windows can be defined in either dimension. Event-time windows handle out-of-order and late-arriving events correctly but require watermarks to bound how long to wait. Processing-time windows are simpler but produce wrong results when events arrive late. Most data engineer interview rounds expect event-time windows with watermarks; processing-time is acceptable for monitoring and ops dashboards where staleness is tolerated.

Watermarks and allowed lateness. A watermark is the streaming engine's commitment that no events with event_time earlier than the watermark will arrive. Flink and Spark Structured Streaming both have explicit watermark configuration. Setting the watermark too aggressive causes late events to be dropped (or sent to a side-output); setting it too conservative delays results and increases state size. Typical configuration: watermark = current_event_time minus 5 minutes for most workloads, minus 1 hour for higher-latency sources. Allowed lateness extends the window's reactivity beyond the watermark: events arriving within allowed lateness update past results.

Stateful streaming with RocksDB. Flink and Spark Structured Streaming both use RocksDB as the embedded state backend for large state (millions to billions of keys per executor). State is checkpointed to durable storage (S3, GCS) for fault tolerance. Sessionization, deduplication, joins, and aggregations all build state. The state size grows with the cardinality of the partitioning key and the watermark configuration; oversized state causes executor OOM and checkpoint timeouts. Senior data engineer system design rounds test whether the candidate sizes state correctly and discusses RocksDB tuning (block cache size, compaction).

Companies whose data engineer interviews emphasize streaming heavily: Netflix (Mantis platform, Flink for ops monitoring, Spark Structured Streaming for analytics), Uber (Flink for ride dispatching analytics, Kafka Streams for some flows), Stripe (Kafka with idempotent consumers for financial-data exactly-once), Meta (internal stream-processing tools, late-arriving conversion handling), Pinterest (Kafka and Flink for real-time recommendations).

Data pipeline interview questions - The pipeline pillar: end-to-end design plus operational rounds, backfills, retries, schema drift.
Data engineer system design interview prep - Full design prep including streaming and batch.
Data engineering system design questions - Scenario catalog covering streaming and batch.
Data engineer system design problems - Practice problems with rubric-scored verdicts.
Clickstream pipeline interview questions - 10B-event-per-day streaming ingestion.
CDC pipeline interview questions - Streaming change data capture with Debezium and Kafka.
Kafka system design interview questions - Partition strategy, consumer groups, exactly-once.
Spark data engineer interview problems - Spark Structured Streaming patterns.
PySpark interview questions including streaming - Structured streaming code in PySpark.
Netflix data engineer interview questions - Streaming-heavy data engineer interviews at Netflix.

Which streaming engine should a data engineer pick: Flink, Spark Structured Streaming, or Kafka Streams?: Flink for stateful streaming at scale with exactly-once and event-time windowing as primary requirements (Netflix, Uber, Stripe). Spark Structured Streaming for unified batch and streaming with the familiar Spark API and Spark-team operational expertise (Databricks, generic Spark shops). Kafka Streams for in-Kafka transformations with lowest operational overhead and Kafka-only data (logging, metrics enrichment). Beam on Dataflow if you are on GCP.
How does a streaming pipeline achieve exactly-once semantics?: Through at-least-once delivery (Kafka with replication, source connector with snapshot recovery) plus idempotent processing (dedup on composite key, MERGE INTO with run_id, transactional writes). Pure exactly-once at the message level is impossible without producer-broker-consumer coordination; what matters is exactly-once effect. Flink + Kafka transactional commits provide end-to-end exactly-once within Flink-managed sinks. Spark Structured Streaming provides exactly-once with checkpoint-based fault tolerance and idempotent sinks (Delta, Iceberg).
What is the difference between event-time and processing-time?: Event-time is when the event happened at the source (user clicked at 14:23:05). Processing-time is when the streaming engine sees the event (14:25:30 due to network delay or retry). Event-time windows produce correct results for out-of-order events but require watermarks to bound waiting. Processing-time windows are simpler but produce wrong results when events arrive late. Most data engineer interview rounds expect event-time windows.
What is a watermark in streaming?: A watermark is the streaming engine's commitment that no events with event_time earlier than the watermark will arrive. Flink and Spark Structured Streaming both have explicit watermark configuration. Typical setting: watermark = current_event_time minus 5 minutes for most workloads. Setting it too aggressive drops late events; setting it too conservative delays results and grows state size. The watermark configuration is a senior data engineer system design rubric item.
What is allowed lateness in Flink and Spark Structured Streaming?: Allowed lateness extends a window's reactivity beyond the watermark. Events arriving after the watermark passed but within allowed lateness update past results; events arriving after allowed lateness are dropped or sent to a side-output. Useful for handling late phone-offline data without growing state indefinitely. Configure based on the longest legitimate lateness observed in production.
How does stateful streaming work with RocksDB?: Flink and Spark Structured Streaming both use RocksDB as the embedded state backend for large state (millions to billions of keys per executor). State is checkpointed to durable storage (S3, GCS) for fault tolerance. Sessionization, deduplication, joins, and aggregations all build state. State size grows with the partitioning-key cardinality and the watermark configuration; oversized state causes executor OOM and checkpoint timeouts.
What is the difference between Kafka and Kinesis from a data engineer interview perspective?: Functionally similar: distributed, partitioned, durable, append-only log. Kafka is the open-source default with the largest ecosystem (Kafka Connect, Schema Registry, ksqlDB, Confluent). Kinesis is AWS-managed with simpler ops and tighter AWS integration (Firehose, Kinesis Data Analytics, IAM). Throughput is comparable per shard/partition. Kafka has lower latency, higher throughput per partition, and more flexibility. Kinesis has lower operational overhead. Choose based on the company's existing stack.
How does a data engineer size a streaming pipeline?: Start with throughput: events per second average and peak. Each event size in bytes. Convert to bytes per second. Divide by per-shard throughput (1MB/s on Kinesis, configurable on Kafka, typically 10MB/s per Kafka partition for safe headroom). Add 2-3x headroom for peak and rebalances. Sample math: 10B events/day = 116k events/sec average, 580k peak, with 1KB events = 580MB/s peak = 58 Kafka partitions or 580 Kinesis shards.

144 practice problems matching this filter. Difficulty: medium (71), hard (72), easy (1).

Pipeline Architecture (144)

45 Minutes Turned Into 3.5 Hours - medium - Spark jobs are running. Just not fast enough.
600 Million Events a Day - hard - 600 million events a day. Two years of retention.
A Clean Number for Every Merchant - hard - Raw payment logs in. Clean merchant summaries out.
A Million Cars Phoning Home - hard - Every vehicle is a sensor. Deploy the pipeline to catch it all.
A Million Moving Dots - medium
Analysts Are Slowing the Store Down - medium - Orders placed. Data warehouse hungry.
A New Column on a Billion Rows - hard - Add and backfill a new column to a billion-row production table with zero downtime.
A Shared Drive Full of Contracts - medium - Buried in PDFs. The data is in there somewhere.
A Stream All Day and a File at Midnight - hard - Real-time and batch. Same pipeline. No compromises.
Badging Items That Already Sold Out - hard - Same-day delivery. The features have to be faster.
Basel, CCAR, and Monday Morning - medium - The regulator does not accept 'eventually consistent.'
Before the Batch Is Lost - hard
Bikes Before Rush Hour - hard - Bikes in, bikes out. The city needs to predict demand.
Credit for Every Touch - medium - They saw the ad, clicked the email, then bought. Who gets credit?
Disappearing Ink - easy
Everything Lands, Then It Ships - medium
Doubling Every Six Months - hard - Tuesdays are quiet. Black Friday is not.
Eight-Hour-Old Positions - hard - Positions shift by the second. The math cannot lag.
Eight Teams, Eight Latencies - medium - Millions of gamers. The architecture decision changes everything.
End of Day Is Too Late - medium - Every swipe tells a story.
Equities, ETFs, and the SEC - hard - Fractional shares, multi-currency, point-in-time. All of it.
Event System for Multiple Consumers - hard - One event, many hungry consumers.
Every Dataset Needs a Paper Trail - hard - The FDA has opinions about your data pipeline.
Every Deal Is a Financial Transaction - hard - Real money on the table. Reconstruct every hand.
Every Device, Every Impression - hard - Every ad seen. Every second watched. Real-time.
Every Device Has Its Own Dialect - medium - Three sources. Three formats. Same workout.
Every Firm Formats It Differently - medium - The regulator changed the format. Again. Handle it.
Every Format Imaginable - hard - PDFs, HL7, JSON. All of it lands in the same lake.
Everyone Wants the Same Data, Differently - hard - How you store it decides how fast you can read it.
Every Region Exports Its Own Way - medium - Sales data, BigQuery, Dataflow. Make it all sing.
Every Scan, Every Parcel, Every Pin Code - medium - Out for delivery. Delivered. Except the events arrived backwards.
Every Version of You - medium
Fifty Thousand Retailers - medium - Retail data at CPG scale. Every SKU, every store.
Five Times the Traffic, Five Times the Bill - hard - Scale up when needed. Do not bankrupt the team.
Five Years of Cron Jobs - hard - Half the jobs run on cron. Half run on events. All of it has to move.
Flying Blind Until Midnight - hard - Intraday risk, full lineage. The regulator is watching.
Four Teams, One Topic, No Agreement - hard - Everybody is writing to it. Nobody documented it. Now production is fragile.
Fresh and Forever - medium
Greenfield Build for Six Sources - hard - Infrastructure as code. Meaning as a service.
Half a Million Rental Cars - medium - Every vehicle is reporting. Every rental matters.
The Identity Problem - hard - Old systems. New demands. The same customer appears under three different names.
Listens From Everywhere, Counted Once - hard - Phones, tablets, laptops. And some of them report late.
Live Viewers, Live Billing - hard - The stream is live. The data cannot wait.
Mark to Market - medium
Near-Real-Time Trending Dishes Dashboard - hard - The dish rankings update faster than the kitchen.
Nested Docs, Flat Reports - medium - Two databases. One direction. No data left behind.
Nightly Exports Are Too Slow - medium - Healthcare claims change constantly. The warehouse cannot fall behind.
4,500 Stores Before Sunrise - medium - The shelves open at 7. The data better be there.
Not Every Team Can See Every Row - hard - Everyone can see the bucket. Not everyone should.
One Bill Across Three Clouds - medium - AWS, Azure, GCP. Three bills. One truth.
One Earthquake, Ten Thousand Tweets - hard - The firehose is on. Separate signal from noise.
Out of the Data Center - medium - The on-prem servers are not getting any younger.
The Speed Layer - medium - Dashboards can't wait for raw logs. Something has to happen upstream.
Prove the Number Is Right - hard - Bad data in fintech is not just messy. It is expensive.
Real Data, Fake Patients - hard - Dev needs production data. HIPAA says absolutely not.
The Register Never Sleeps - medium - Every swipe lands in the warehouse. The table has to stay current without breaking.
Recommendations Now, Royalties Later - medium - The catalog updated. Did anyone notice?
Replicate It Without Breaking It - hard - The source changed. The lake needs to know immediately.
Risk Models on Week-Old Data - medium - Loan approved. Loan denied. Every decision is an event.
SaaS API Connector with Incremental Sync - medium - The API has rate limits. You have deadlines.
Same-Day Sales, Every Store - medium - The cash register data needs to be queryable by morning.
The Living Table - medium - Data lands continuously. History must survive every update.
Score It Before It Clears - hard - The fraudsters move fast. Your pipeline has to move faster.
Seconds and Months - medium
Seconds to Trend - medium
Ship Before Fraud Finishes Checking - hard - The claim looks clean. The fraud model disagrees.
Six Hours to Miss a Deadline - medium - The rebuild works. It just doesn't finish in time.
Six Hours to Refresh Every Number - medium - Ratings change. The incremental model has to keep pace.
Six Million Rows Before the Market Opens - medium - One massive CSV. Millions of timestamps.
Six Sources, One Platform - medium - ADF orchestrates. Unity Catalog governs. Nothing leaks.
Sixty Minutes, Every Hour - medium - Every hour, on the hour. No excuses.
Someone Else's Server - hard
Stores and the Site, Together - hard - The registers never stop ringing.
Store, Site, and Distributor - medium - Sales data is piling up. Someone has to make sense of it.
The Acquisition Still Taking Bookings - hard - Two systems, two schemas. One truth.
The Agency That Changes the Columns - medium - The schema changed overnight. Again.
The Analysts Cannot Touch Production - medium - Production is the source. Analytics needs its own copy.
The Analyst Who Saw the Salary Data - hard - Two incidents. One shared lake. The access model was never designed, just assumed.
The API Drip Feed - medium - The API gives you 100 records at a time. You need millions.
The Bad Row That Broke the Dashboard - medium - Bad records cannot reach the warehouse.
The Binding and the Claim - medium - Policies are instant. Claims take their time.
The Booking That Came Three Ways - hard - PMS, OTA, and website all think they took the reservation first.
The Boutique That Sold in Six Currencies - hard - Every sale is real. The rate it was converted at depends on who is asking.
The Bucket Full of Resumes - medium - A thousand resumes. Structured data inside each one.
The Carrier Moving to Azure - medium - Claims arrive messy. The medallion cleans them up.
The Claim That Picks Its Own Lane - medium - Three entry points. Different workflows. All must route correctly.
The Clicks We Throw Away - hard - Every tap, swipe, and scroll. At scale.
The Clock That Runs Two Ways - hard - Nightly batch and live events. One dashboard.
The Consent Stitcher - medium - Consent was given. Or was it? Stitch the records together.
The Dashboard and the Attribution Model - hard - Streaming and batch. One pipeline to rule them.
The Decision Before the Door Closes - hard - The window to stop it is smaller than you think.
The Distributor Filing Problem - medium - Hundreds of suppliers. One warehouse. One deadline.
The Early Warning - medium
The Event Pile - hard - 600 million clicks a day. The budget is not infinite.
The Fare Aggregator - medium - Airfares shift every minute. Catch the best ones.
The Firehose and the Ledger - hard
The Fleet That Never Stops - hard - Every truck is talking. Not everyone can hear them yet.
The Leaderboard That Costs $25K a Month - hard - Product wants it live. Engineering has a price tag.
The Ledger and the Live Wire - hard
The Meal Kit That Knows You - medium - What they ordered says a lot about what they want next.
The Metric That Moved - medium
The Migration That Cannot Break Morning - hard - It all works today. Moving it without losing a single report is the hard part.
The Models Going Stale - hard - The model is only as good as what you feed it.
The Morning File - medium
The Next Track - medium
The Panel and the Set-Top Boxes - hard - Set-top boxes tell you who watched. Projection tells you how many.
The Patients We Cannot Move - hard - Patient data stays local. Insights have to be global.
The Points Arrive Two Days Late - medium - The bank data shows up late. The rewards were already sent.
The Provider That Sometimes Sleeps - medium - The models run at dawn. The data has to be there first.
The Query That Used to Be Fast - medium - Queries used to be fast. Something changed.
The Queue That Wouldn't Stop Growing - medium - 500,000 messages behind and the number keeps climbing.
The Revenue That Was Wrong for Two Weeks - medium - Nobody caught it until the CFO asked a question. Design the system that catches it first.
The Sale That Needs to Land Now - medium - Three channels feeding one view. Not all of them speak the same language.
The Same Stream Twice - hard
The Signals That Power Recommendations - medium - Fresh signals, many teams, one pipeline.
The Thirty-Second Rule - medium
Counted Once, Remembered Forever - hard
The User Who Asked to Be Forgotten - hard - Users want their data erased. Completely.
The Vendor Who Never Warns You - medium - Every month, something is different. The dashboards have no idea.
The What-If Machine - hard - A million slots. A thousand campaigns. Every combination matters.
The Whiteboard Exercise - medium - Marker in hand. Draw the whole thing.
Thirty Cities, One Forecast - hard - Five cities. Five data formats. One prediction.
Thirty Countries, One Solvency Number - hard - Premiums collected globally. Losses happen locally.
Thirty Million Unique Jobs a Year - hard - One press run, many orders. Group them right.
Thousands of Practices, One Dataset - hard - Patient records in, operational insights out.
Three Providers, One Workout - hard - The same ride, reported three times.
Three Regions, One Finance Team - hard - Payments from everywhere. One consistent report.
Three Regions, One Report - hard - Three regions, billions of payments, one merchant summary by 6 AM.
Towers and Phones, Same Story - hard - Tower signals meet app events. Somewhere in between is the truth.
Traders, Risk, and the Regulators - medium - Markets move in milliseconds. The pipeline has to keep up.
Two Hundred Million Redirects - medium - Billions of clicks. One tiny code. Two very different clocks.
Two Million Boxes by Monday Morning - hard - Shipped, maybe. Delivered, debatable.
Two Sources of Truth - medium
Two Systems, One Room Count - hard - Two booking systems. Rooms do not duplicate themselves.
Two Ways to Catch a Change - medium - Two ways to watch the database. Each has a cost.
Two Years of Every Click - hard - Every click, every aisle, every day for two years.
Two Years of Clicks, Cheap - hard - Two years of clicks. Every query has to be affordable.
What Everyone Is Watching - hard - Someone is watching. Capture everything.
What Should We Recommend Tonight - hard - They ordered pad thai twice. That means something.
Where Is Every Truck, Right Now - medium - Trucks are moving. Every ping counts.
Where the Crowd Goes - medium
Which Promotion Is Actually Working - hard - Was the promotion worth it? The data knows.
Who Is Churning and Why - medium - Subscribers churn. The pipeline cannot.
Who Saw the Ad Twice - hard - TV and digital. Same viewer, two measurement worlds.