Spark interview problems across all data engineer-relevant Spark surfaces. PySpark coding (DataFrame + Window + UDF). Spark SQL (MERGE, broadcast hints, partition pruning). Structured Streaming (watermarks, allowed lateness, checkpoint). Spark UI reading (skew, spill, GC). Delta and Iceberg MERGE patterns. The complete Spark surface for data engineer interview prep.

Spark data engineer interview problems span 5 surfaces: PySpark DataFrame coding, Spark SQL, Structured Streaming, Spark UI reading, and optimization. A Spark-first data engineer interview at Databricks, Netflix, Uber, Airbnb, DoorDash, or Spotify samples from all 5 across the 5-6 round loop, with the dedicated PySpark coding round (45-60 minutes) as the focus and supporting questions in SQL and system design rounds.

PySpark DataFrame coding: write a join between an 800M-row events table and a 2M-row users table. Decide broadcast versus sort-merge. Handle skew on user_id. Window function for top-N per user. SCD Type 2 merge with Delta. Convert SQL to DataFrame and back without thinking.

Spark SQL: MERGE INTO on Delta or Iceberg for upsert. Partition pruning with proper WHERE clauses. Broadcast hints. Use of QUALIFY-equivalent via outer-query filter. EXPLAIN reading to verify physical plan. The patterns translate directly from the Postgres SQL practice catalog with Spark-specific syntax tagged.

Structured Streaming: read from Kafka, dedup on composite key, apply windowed aggregations with watermark, write to Delta with append or merge mode. Trigger configuration (processingTime for micro-batch, continuous for sub-second). Checkpoint location for fault tolerance. Allowed lateness for late-arriving events. End-to-end exactly-once via at-least-once Kafka plus idempotent Delta sink.

Spark UI reading: present a screenshot, identify the cause and propose the fix. Summary Metrics row anomalies (max 10x median equals skew, spill greater than 0 equals memory pressure, GC time greater than 10 percent equals GC pressure). Tasks table sorted descending by duration shows the culprit partition. Stage timing distribution. The senior-versus-mid signal.

Optimization: skew handling with salt-and-rebalance, AQE override scenarios, partition strategy tuning, broadcast threshold adjustment. Predicate pushdown verification. Avoiding collect() and other driver-pulling actions. The L5+ optimization round expects EXPLAIN-driven and UI-driven diagnosis.

Companies whose data engineer interviews emphasize Spark across all surfaces: Databricks (Spark creator; deepest expertise expected), Netflix (Spark at extreme scale with Iceberg and Mantis), Uber (large-scale batch and Spark Streaming), Airbnb (Spark with Druid and Airflow), DoorDash and Spotify (similar Spark+Kafka+warehouse stacks), Capital One and Comcast (enterprise Spark adopters).

Spark Data Engineer Interview Problems

End-to-end Spark interview problems for data engineer interview prep.

Common questions

What does a Spark-first data engineer interview cover?
5 surfaces across 5-6 rounds: PySpark DataFrame coding (45-60 min dedicated round), Spark SQL in the SQL round, Structured Streaming in the design round, Spark UI reading as a senior-signal question, optimization in EXPLAIN-driven questions. Each company samples differently but expects fluency across all 5 for a data engineer hire.
How do I prepare for a Spark-first data engineer interview?
Practice the 4 PySpark coding shapes (broadcast join, sort-merge with skew, window function, SCD merge). Practice Spark SQL with MERGE INTO patterns. Walk through 8 Spark UI screenshots identifying anomalies. Design a Structured Streaming pipeline with watermark and Delta sink. Two timed mock PySpark coding rounds in the final 2 weeks.
Which companies most emphasize Spark in data engineer interviews?
Databricks (Spark creator), Netflix (Spark at extreme scale with Iceberg and Mantis), Uber (large-scale batch and Spark Streaming), Airbnb (Spark with Druid), DoorDash, Spotify, Capital One, Comcast. Each runs a 45-60 minute PySpark coding round plus supporting questions in SQL and design rounds.
What is the Spark UI question format?
Interviewer presents a screenshot (Summary Metrics, Tasks table, Stage detail) with a specific anomaly. The data engineer identifies the cause (skew on join key, partition under-parallelism, memory pressure, GC overhead) and proposes the fix. Rubric scores cause identification and fix correctness.
How is Spark SQL different from generic SQL in interviews?
Spark SQL adds MERGE INTO via Delta/Iceberg, broadcast hints, AQE-driven runtime optimization, no recursive CTEs. Practice in Postgres is portable for ~85 percent of patterns. The Spark-specific syntax (MERGE INTO, /*+ BROADCAST() */, AQE) is tagged on the relevant problems.
What is Structured Streaming and when does it appear in data engineer interviews?
Spark's unified API for batch and streaming. Read from Kafka or Delta as source, transform with DataFrame operations, write to sink with checkpoint for fault tolerance. Appears in system design rounds at Spark-first companies. Watermark and allowed lateness configuration are the senior signal.
How does a data engineer answer a Spark optimization question?
Tie each proposed fix to specific evidence from EXPLAIN or Spark UI. SortMergeJoin where BroadcastHashJoin expected: stats stale (ANALYZE TABLE) or threshold too low (raise to 100MB). Skew: salt and rebalance. No PartitionFilters in plan: function in WHERE preventing pruning, rewrite predicate. Evidence-driven, not guess-driven.
Does Spark expertise help in non-Spark-first data engineer interviews?
Yes. Even at non-Spark-first companies (Snowflake-and-BigQuery shops like Stripe, Block, Coinbase), Spark is mentioned in design rounds as the alternative for heavy joins or ML feature pipelines. Demonstrating Spark depth shows engineering range. But the dedicated 45-60 minute PySpark coding round is only at Spark-first companies.