Flink Interview Questions
Apache Flink interview questions for data engineer roles at streaming-heavy companies (Netflix, Uber, Lyft, Pinterest, Twitter/X, AWS managed Flink customers). 35+ questions covering Flink architecture (TaskManagers, JobManagers, slot- based parallelism), stateful streaming (RocksDB state backend, checkpointing, savepoints), exactly-once semantics (two-phase commit, transactional sinks), event-time processing (watermarks, allowed lateness, side outputs), and Flink SQL. Pair with the data engineer interview prep guide and the streaming data engineer interview guide.
Flink Topics in Streaming Interviews
Topics ranked roughly by how often they appear in streaming data engineer loops.
| Topic | Frequency | Depth Expected |
|---|---|---|
| Exactly-once via two-phase commit | Very common | How it differs from at-least-once + idempotent |
| State backend choice (heap vs RocksDB) | Common | When each is right; state size implications |
| Checkpointing and savepoints | Very common | Frequency tuning, incremental checkpoints, recovery |
| Event-time and watermarks | Very common | Watermark generation strategies, allowed lateness |
| Window types (tumbling, sliding, session) | Common | When to pick each, window assigners |
| Keyed state vs operator state | Common | ValueState, ListState, MapState, broadcast state |
| Backpressure handling | Common | Detection via metrics, mitigation strategies |
| Job parallelism and slot allocation | Common | Sizing TaskManagers and slots |
| Flink SQL (Table API) | Occasional | When to use vs DataStream API |
| Side outputs for late data | Occasional | Routing late events to dead-letter |
| Async I/O for external lookups | Occasional | Pattern for enriching from external services |
| Connectors (Kafka, Kinesis, Iceberg, JDBC) | Common | Connector-level exactly-once semantics |
| Schema evolution in stateful jobs | Occasional | POJO migration, Avro evolution |
| Flink on Kubernetes vs YARN vs standalone | Occasional | Deployment trade-offs |
Exactly-Once: Flink’s Defining Capability
Flink's exactly-once is end-to-end exactly-once via two-phase commit between source and sink. The source provides replayable offsets (Kafka). The processing produces deterministic output. The sink participates in a transaction that commits atomically with the offset commit. On failure, the entire transaction rolls back; on retry, the same input produces the same output, committed atomically.
Most candidates can recite this; fewer can explain the practical implications. True exactly-once has costs: latency increases (transactions add hundreds of ms per commit), throughput decreases (commits are synchronous), and sink choice is constrained (sink must support transactions: Kafka, JDBC, Iceberg, and a few others). For sinks without transactions (HTTP services, legacy databases), Flink's exactly-once degrades to effectively-once with idempotent consumers.
The senior-level signal in the interview: stating exactly-once guarantees with the constraints. For example, "the Flink-Kafka pipeline is end-to-end exactly-once because the Kafka producer participates in the two-phase commit; the Flink-HTTP sink would degrade to at-least-once with consumer-side idempotency required."
Know Flink the way the interviewer who asks it knows it.
State Backends: Heap vs RocksDB
Flink stores keyed state in a state backend. Two production options, with different performance and operational characteristics.
Heap state backend: state lives in JVM heap. Reads and writes are fast (microseconds). State size limited by heap size (typically 4-32 GB per TaskManager). Best for small-state workloads where speed matters and total state fits in memory.
RocksDB state backend: state lives in an embedded RocksDB instance with disk spillover. Reads and writes are slower (milliseconds). State size limited by disk (typically 100s of GB to TB per TaskManager). Required for large-state workloads (sessionization with long TTLs, feature pipelines with many keys). Supports incremental checkpoints, which is critical at scale.
Choosing between them is a state-size question. Roughly: under 1 GB state per TaskManager, use heap; over 10 GB, use RocksDB; in between, depends on latency requirements. Strong candidates can estimate state size from the workload (number of keys * size per key * retention) and pick accordingly.
Six Real Flink Interview Questions
Implement a sessionization Flink job with 30-min inactivity gap
public class SessionizeFunction extends KeyedProcessFunction<String, Event, Session> {
private ValueState<SessionAccumulator> sessionState;
private static final long GAP_MS = 30 * 60 * 1000;
@Override
public void open(Configuration parameters) {
sessionState = getRuntimeContext().getState(
new ValueStateDescriptor<>("session", SessionAccumulator.class));
}
@Override
public void processElement(Event event, Context ctx, Collector<Session> out) throws Exception {
SessionAccumulator current = sessionState.value();
if (current == null || event.ts - current.lastEventTs > GAP_MS) {
if (current != null) {
out.collect(current.toSession());
}
current = new SessionAccumulator(event);
} else {
current.add(event);
}
sessionState.update(current);
ctx.timerService().registerEventTimeTimer(event.ts + GAP_MS);
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Session> out) throws Exception {
SessionAccumulator current = sessionState.value();
if (current != null && current.lastEventTs + GAP_MS <= timestamp) {
out.collect(current.toSession());
sessionState.clear();
}
}
}Design a Flink job that maintains a 24-hour rolling unique user count
How do watermarks work and what’s allowed lateness?
Design checkpoint configuration for a state-heavy Flink job
Handle a hot key in a Flink keyed stream
Design a multi-region Flink deployment with cross-region failover
Flink vs Spark Structured Streaming vs Kafka Streams
All three are production stream processors. The choice depends on workload, team expertise, and ecosystem.
| Dimension | Flink | Spark Structured Streaming | Kafka Streams |
|---|---|---|---|
| Processing model | True streaming | Micro-batch (configurable) | Streaming |
| State management | Heap or RocksDB, well-documented | RocksDB, less explicit tuning | RocksDB, application-embedded |
| Exactly-once | True end-to-end via 2PC | End-to-end with transactional sinks | End-to-end with Kafka transactions |
| Window semantics | Most flexible, all window types | Tumbling, sliding, session | Tumbling and hopping |
| Connectors | Most extensive | Spark ecosystem | Kafka-native only |
| Operational complexity | Highest | Moderate (Spark expertise transfers) | Lowest (embedded in JVM apps) |
| Best fit | Complex stateful, high throughput | Spark-native teams, mixed batch+stream | Kafka-only ecosystems, embedded |
How Flink Connects to the Rest of the Cluster
Flink is the most-tested stream processor in dedicated the streaming data engineer interview guide loops. The system design framework from the system design round prep guide applies to streaming architectures with Flink as the primary stream processor.
For broader streaming context, see the streaming guide. For comparison with batch-first stream processors, see the Kafka vs Kinesis decision page (Flink works equally well with both). Companies most likely to test deep Flink: the Netflix data engineer interview guide, the Uber data engineer interview guide, the Lyft data engineer interview guide, the Pinterest data engineer interview guide.
The Patients We Cannot Move
Click or drag a node from the toolbar above. Right-click the canvas for the full menu.
Drag from a node's right port to another node's left port to wire data flow.
Data engineer interview prep FAQ
Should I learn Flink DataStream API or Table API / SQL?+
Is RocksDB knowledge required for Flink interviews?+
How do checkpoints differ from savepoints in Flink?+
What’s the difference between Flink’s exactly-once and at-least-once with idempotent consumers?+
How do I size a Flink TaskManager?+
Is Flink production-ready on Kubernetes?+
How does Flink compare to AWS Kinesis Data Analytics?+
Are Flink certifications useful?+
Practice Streaming System Design
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition
More data engineer interview prep reading
The full streaming role framework with Flink as primary stream processor.
Message broker decision relevant to every Flink deployment.
Pillar guide covering every round in the Data Engineer loop, end to end.
More data engineer interview prep guides
The full SQL interview problem set, indexed by topic, difficulty, and company.
BigQuery internals, slot-based pricing, partitioning, and clustering interview prep.
Redshift sort keys, dist keys, compression, and RA3 architecture interview prep.
Postgres MVCC, indexing, partitioning, and replication interview prep.
Hadoop ecosystem (HDFS, MapReduce, YARN, Hive) interview prep, including modern relevance.
AWS Glue ETL jobs, crawlers, Data Catalog, and PySpark-on-Glue interview prep.