# Spark SQL Interview Questions

> Spark SQL interview questions for data engineer interview prep.

Canonical URL: <https://datadriven.io/spark-sql-interview-questions>

Breadcrumb: [Home](https://datadriven.io/) > [Spark SQL](https://datadriven.io/spark-sql-interview-questions)

## Summary

Spark SQL interview questions for data engineer roles. SQL syntax executed by the Spark engine. Partition pruning, broadcast hints, MERGE INTO on Delta and Iceberg, the differences between Spark SQL and Postgres SQL or Snowflake SQL. For data engineer candidates targeting Databricks, Netflix, Uber, or any Spark-first company.

## What this page covers

Spark SQL is the SQL interface to the Spark engine. spark.sql("SELECT ...") returns a DataFrame; under the hood, Catalyst (Spark's query optimizer) compiles the SQL to the same physical plan as the DataFrame API. Spark SQL is the most common data engineer interface to Spark in 2026 because it is portable from analytical SQL skills and produces the same performance as DataFrame code.

The Spark SQL dialect differs from Postgres and Snowflake in several ways data engineer interview rounds test. Spark SQL supports ANSI SQL for most operations. UNION versus UNION ALL behavior matches ANSI. CTEs (WITH clause) are supported and frequently used. Window functions are full ANSI plus QUALIFY-equivalent via filter on outer query. Recursive CTEs are NOT supported (a difference from Postgres and Snowflake). Lateral joins are supported in Spark 3.0+. Sub-second precision in timestamps requires explicit cast in some versions.

MERGE INTO on Delta and Iceberg is the dominant write pattern in modern Spark SQL data engineer pipelines. MERGE INTO target t USING source s ON t.pk = s.pk WHEN MATCHED AND s.op_type = 'DELETE' THEN DELETE WHEN MATCHED THEN UPDATE SET col1 = s.col1, col2 = s.col2 WHEN NOT MATCHED THEN INSERT (pk, col1, col2) VALUES (s.pk, s.col1, s.col2). Delta and Iceberg both support this syntax; the underlying storage differs but the SQL is portable. Idempotency comes from MERGE-on-natural-key plus run_id baked into source.

Partition pruning is the most-important performance pattern. Spark scans only the partitions referenced by WHERE clauses on the partition column. WHERE event_date = '2026-05-27' on a date-partitioned table reads only that day's files. Function-wrapped predicates (WHERE DATE(event_ts) = '2026-05-27') prevent pruning because Catalyst cannot reason about DATE() in reverse; rewrite as WHERE event_ts >= '2026-05-27' AND event_ts < '2026-05-28'. EXPLAIN ANALYZE shows whether pruning fired (look for PartitionFilters in the plan).

Broadcast hints. Spark SQL supports /*+ BROADCAST(table) */ to force a broadcast join when the optimizer's autoBroadcastJoinThreshold heuristic would not pick it. Useful when the small side is between 10MB and 100MB (default threshold is 10MB) and broadcasting is faster than the inferred sort-merge. /*+ COALESCE(N) */ to reduce partitions before write. /*+ REPARTITION(N) */ to shuffle to N partitions. These hints matter in optimization rounds at L5+.

Spark SQL adaptive execution. spark.sql.adaptive.enabled = true (default in Spark 3.2+) lets the runtime adjust the plan based on actual statistics: skew-join splitting, broadcast-threshold adjustment, partition coalescing. Override with explicit hints when the adaptive choice is wrong for the workload. AQE is on by default in most production Spark deployments; the data engineer should know what AQE does and when to override.

Companies whose data engineer interviews emphasize Spark SQL: Databricks (Spark SQL on Delta), Netflix (Spark SQL on Iceberg), Uber (Spark SQL plus Presto), Airbnb (Spark SQL plus Druid), Snowflake (vendor-neutral testing of Spark SQL versus Snowflake SQL).

## Frequently asked questions

### What is the difference between Spark SQL and the DataFrame API in PySpark?

Spark SQL is the SQL interface (spark.sql('SELECT ...') returns a DataFrame). DataFrame API uses Python methods (df.filter, df.groupBy). Both compile to the same physical plan via Catalyst. Performance is identical for equivalent queries. Pick the one your team is faster in; mention the equivalence to show fluency.

### Does Spark SQL support recursive CTEs?

No. Recursive CTEs are not supported in Spark SQL as of 2026. The Postgres and Snowflake WITH RECURSIVE syntax does not compile. For hierarchical traversal in Spark, the workaround is iterative DataFrame joins until convergence, or using a graph library (GraphFrames) for one-shot hierarchical queries.

### How does partition pruning work in Spark SQL?

Spark scans only the partitions referenced by WHERE clauses on the partition column. WHERE event_date = '2026-05-27' on a date-partitioned table reads only that day's files. Function-wrapped predicates (WHERE DATE(event_ts) = '2026-05-27') prevent pruning because Catalyst cannot reason about DATE() in reverse. EXPLAIN ANALYZE shows whether pruning fired by listing PartitionFilters.

### How does a data engineer force a broadcast join in Spark SQL?

Use the /*+ BROADCAST(table) */ hint: SELECT /*+ BROADCAST(users) */ ... FROM events e JOIN users u ON e.user_id = u.user_id. Useful when the small side is between the autoBroadcastJoinThreshold (default 10MB) and the maximum broadcast size (typically 100-200MB depending on cluster memory). Without the hint, the optimizer falls back to sort-merge for sides above the threshold.

### What is the MERGE INTO pattern in Spark SQL on Delta or Iceberg?

MERGE INTO target t USING source s ON t.pk = s.pk WHEN MATCHED AND s.op_type = 'DELETE' THEN DELETE WHEN MATCHED THEN UPDATE SET col1 = s.col1 WHEN NOT MATCHED THEN INSERT (pk, col1) VALUES (s.pk, s.col1). Delta and Iceberg both support this syntax. Idempotency from MERGE-on-natural-key plus run_id baked into source.

### What is Spark AQE and how does it affect SQL queries?

Adaptive Query Execution adjusts the query plan at runtime based on actual statistics from completed stages. Three main optimizations: skew-join detection (splits a skewed partition automatically), broadcast-threshold adjustment (uses runtime data sizes), partition coalescing (combines small post-shuffle partitions). Enable with spark.sql.adaptive.enabled = true (default in 3.2+). Override with explicit hints when needed.

### What is the difference between Spark SQL and Snowflake SQL?

Both are ANSI-compatible for most operations. Spark SQL adds MERGE INTO via Delta/Iceberg, broadcast hints, AQE-driven runtime optimization. Snowflake adds QUALIFY (filter window results without CTE wrap), micro-partitions for automatic clustering, time travel and zero-copy clone. Practice in Postgres is portable to both for ~85 percent of patterns; the engine-specific features are tagged.

### How does a data engineer use EXPLAIN in Spark SQL?

EXPLAIN [EXTENDED|FORMATTED] query shows the physical plan. Look for: SortMergeJoin vs BroadcastHashJoin (join strategy), PartitionFilters (partition pruning fired), Exchange (shuffle steps), Filter (predicate pushdown), Project (column pruning). EXPLAIN ANALYZE in Spark 3.0+ also shows actual runtime statistics. Compare your plan to the reference solution's plan for optimization rounds.

## How a data engineer writes Spark SQL in an interview

Six-step framework for writing performant Spark SQL.

### Step 1: Check for partition pruning opportunity

WHERE on the partition column directly, not wrapped in a function. EXPLAIN to verify PartitionFilters fired.

### Step 2: Choose the join strategy

Broadcast for small-and-large with /*+ BROADCAST(table) */. Sort-merge default for large-and-large.

### Step 3: Use window functions for top-N and dedup

ROW_NUMBER OVER (PARTITION BY ... ORDER BY ... DESC, tiebreaker) for dedup-latest. Filter rn = 1 in outer query.

### Step 4: Use MERGE INTO for upsert

MERGE INTO target USING source ON natural_key WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT. Idempotent on re-run.

### Step 5: Leverage AQE for runtime adjustment

spark.sql.adaptive.enabled = true. AQE handles skew, broadcast threshold, partition coalesce automatically.

### Step 6: EXPLAIN to verify the plan

Check for partition pruning, join strategy, shuffle counts. Compare to reference solution for optimization rounds.

## Related practice catalogs

- [PySpark interview questions (DataFrame API equivalent)](https://datadriven.io/pyspark-interview-questions): Same patterns expressed in DataFrame API instead of SQL.
- [Spark DataFrame interview questions](https://datadriven.io/spark-dataframe-interview-questions): DataFrame API as the alternative to Spark SQL.
- [Spark optimization interview questions](https://datadriven.io/spark-optimization-interview-questions): Partition pruning, broadcast hints, AQE overrides.
- [SQL interview questions catalog (portable patterns)](https://datadriven.io/sql-interview-questions): 927 SQL problems with ~85 percent dialect portability to Spark SQL.
- [Advanced SQL: MERGE patterns for Delta and Iceberg](https://datadriven.io/advanced-sql-interview-questions): MERGE INTO syntax used in Spark SQL.
- [Databricks interview problems with Delta](https://datadriven.io/databricks-interview-problems): Spark SQL on Delta tables.
- [Spark data engineer interview problems](https://datadriven.io/spark-data-engineer-interview-problems): All Spark surfaces including SQL.
- [ETL design with Spark SQL](https://datadriven.io/etl-design-interview-prep): ETL patterns using Spark SQL for transformation.

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.