# Spark Data Engineer Interview Problems

> End-to-end Spark interview problems for data engineer interview prep.

Canonical URL: <https://datadriven.io/spark-data-engineer-interview-problems>

Breadcrumb: [Home](https://datadriven.io/) > [Spark Data Engineer](https://datadriven.io/spark-data-engineer-interview-problems)

## Summary

Spark interview problems across all data engineer-relevant Spark surfaces. PySpark coding (DataFrame + Window + UDF). Spark SQL (MERGE, broadcast hints, partition pruning). Structured Streaming (watermarks, allowed lateness, checkpoint). Spark UI reading (skew, spill, GC). Delta and Iceberg MERGE patterns. The complete Spark surface for data engineer interview prep.

## What this page covers

Spark data engineer interview problems span 5 surfaces: PySpark DataFrame coding, Spark SQL, Structured Streaming, Spark UI reading, and optimization. A Spark-first data engineer interview at Databricks, Netflix, Uber, Airbnb, DoorDash, or Spotify samples from all 5 across the 5-6 round loop, with the dedicated PySpark coding round (45-60 minutes) as the focus and supporting questions in SQL and system design rounds.

PySpark DataFrame coding: write a join between an 800M-row events table and a 2M-row users table. Decide broadcast versus sort-merge. Handle skew on user_id. Window function for top-N per user. SCD Type 2 merge with Delta. Convert SQL to DataFrame and back without thinking.

Spark SQL: MERGE INTO on Delta or Iceberg for upsert. Partition pruning with proper WHERE clauses. Broadcast hints. Use of QUALIFY-equivalent via outer-query filter. EXPLAIN reading to verify physical plan. The patterns translate directly from the Postgres SQL practice catalog with Spark-specific syntax tagged.

Structured Streaming: read from Kafka, dedup on composite key, apply windowed aggregations with watermark, write to Delta with append or merge mode. Trigger configuration (processingTime for micro-batch, continuous for sub-second). Checkpoint location for fault tolerance. Allowed lateness for late-arriving events. End-to-end exactly-once via at-least-once Kafka plus idempotent Delta sink.

Spark UI reading: present a screenshot, identify the cause and propose the fix. Summary Metrics row anomalies (max 10x median equals skew, spill greater than 0 equals memory pressure, GC time greater than 10 percent equals GC pressure). Tasks table sorted descending by duration shows the culprit partition. Stage timing distribution. The senior-versus-mid signal.

Optimization: skew handling with salt-and-rebalance, AQE override scenarios, partition strategy tuning, broadcast threshold adjustment. Predicate pushdown verification. Avoiding collect() and other driver-pulling actions. The L5+ optimization round expects EXPLAIN-driven and UI-driven diagnosis.

Companies whose data engineer interviews emphasize Spark across all surfaces: Databricks (Spark creator; deepest expertise expected), Netflix (Spark at extreme scale with Iceberg and Mantis), Uber (large-scale batch and Spark Streaming), Airbnb (Spark with Druid and Airflow), DoorDash and Spotify (similar Spark+Kafka+warehouse stacks), Capital One and Comcast (enterprise Spark adopters).

## Frequently asked questions

### What does a Spark-first data engineer interview cover?

5 surfaces across 5-6 rounds: PySpark DataFrame coding (45-60 min dedicated round), Spark SQL in the SQL round, Structured Streaming in the design round, Spark UI reading as a senior-signal question, optimization in EXPLAIN-driven questions. Each company samples differently but expects fluency across all 5 for a data engineer hire.

### How do I prepare for a Spark-first data engineer interview?

Practice the 4 PySpark coding shapes (broadcast join, sort-merge with skew, window function, SCD merge). Practice Spark SQL with MERGE INTO patterns. Walk through 8 Spark UI screenshots identifying anomalies. Design a Structured Streaming pipeline with watermark and Delta sink. Two timed mock PySpark coding rounds in the final 2 weeks.

### Which companies most emphasize Spark in data engineer interviews?

Databricks (Spark creator), Netflix (Spark at extreme scale with Iceberg and Mantis), Uber (large-scale batch and Spark Streaming), Airbnb (Spark with Druid), DoorDash, Spotify, Capital One, Comcast. Each runs a 45-60 minute PySpark coding round plus supporting questions in SQL and design rounds.

### What is the Spark UI question format?

Interviewer presents a screenshot (Summary Metrics, Tasks table, Stage detail) with a specific anomaly. The data engineer identifies the cause (skew on join key, partition under-parallelism, memory pressure, GC overhead) and proposes the fix. Rubric scores cause identification and fix correctness.

### How is Spark SQL different from generic SQL in interviews?

Spark SQL adds MERGE INTO via Delta/Iceberg, broadcast hints, AQE-driven runtime optimization, no recursive CTEs. Practice in Postgres is portable for ~85 percent of patterns. The Spark-specific syntax (MERGE INTO, /*+ BROADCAST() */, AQE) is tagged on the relevant problems.

### What is Structured Streaming and when does it appear in data engineer interviews?

Spark's unified API for batch and streaming. Read from Kafka or Delta as source, transform with DataFrame operations, write to sink with checkpoint for fault tolerance. Appears in system design rounds at Spark-first companies. Watermark and allowed lateness configuration are the senior signal.

### How does a data engineer answer a Spark optimization question?

Tie each proposed fix to specific evidence from EXPLAIN or Spark UI. SortMergeJoin where BroadcastHashJoin expected: stats stale (ANALYZE TABLE) or threshold too low (raise to 100MB). Skew: salt and rebalance. No PartitionFilters in plan: function in WHERE preventing pruning, rewrite predicate. Evidence-driven, not guess-driven.

### Does Spark expertise help in non-Spark-first data engineer interviews?

Yes. Even at non-Spark-first companies (Snowflake-and-BigQuery shops like Stripe, Block, Coinbase), Spark is mentioned in design rounds as the alternative for heavy joins or ML feature pipelines. Demonstrating Spark depth shows engineering range. But the dedicated 45-60 minute PySpark coding round is only at Spark-first companies.

## How a data engineer prepares for a Spark-first interview loop

Six-step prep framework for the full Spark interview surface.

### Step 1: Master the 4 PySpark coding shapes

Broadcast join, sort-merge with skew handling, window function for top-N per user, SCD Type 2 merge with Delta.

### Step 2: Master Spark SQL MERGE INTO patterns

Idempotent MERGE with composite natural key plus run_id. Delta and Iceberg syntax.

### Step 3: Master 8 Spark UI screenshots

Skew, spill, GC, under-parallelism, memory pressure, shuffle imbalance, stage timing. Cause identification and fix.

### Step 4: Master Structured Streaming basics

Kafka source, dedup, windowed aggregation with watermark, Delta sink with checkpoint.

### Step 5: Master optimization diagnosis

EXPLAIN reading. Spark UI Summary Metrics. AQE behavior. Broadcast threshold tuning.

### Step 6: Two timed mock PySpark coding rounds

Final 2 weeks before onsite. Pair with someone or use AI mock to simulate the 45-60 minute round.

## Related practice catalogs

- [PySpark interview questions](https://datadriven.io/pyspark-interview-questions): Full PySpark catalog organized by frequency.
- [PySpark practice problems](https://datadriven.io/pyspark-practice-problems): Live Spark sandbox with skew-engineered tests.
- [PySpark coding questions](https://datadriven.io/pyspark-coding-questions): Open-ended coding questions for Spark-first interviews.
- [Spark SQL interview questions](https://datadriven.io/spark-sql-interview-questions): SQL syntax on Spark with MERGE INTO and broadcast hints.
- [Spark DataFrame interview questions](https://datadriven.io/spark-dataframe-interview-questions): Transformations, actions, lazy evaluation, partition strategy.
- [Spark optimization interview questions](https://datadriven.io/spark-optimization-interview-questions): Skew, AQE, broadcast, partition pruning.
- [Databricks interview problems](https://datadriven.io/databricks-interview-problems): Delta, Photon, Unity Catalog, Auto Loader.
- [Streaming system design with Spark Structured Streaming](https://datadriven.io/streaming-system-design-interview-questions): Streaming patterns in system design rounds.
- [Netflix data engineer interview questions](https://datadriven.io/netflix-data-engineer-interview-questions): Spark-heavy Netflix data engineer interviews.

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.