Spark SQL interview questions for data engineer roles. SQL syntax executed by the Spark engine. Partition pruning, broadcast hints, MERGE INTO on Delta and Iceberg, the differences between Spark SQL and Postgres SQL or Snowflake SQL. For data engineer candidates targeting Databricks, Netflix, Uber, or any Spark-first company.
Spark SQL is the SQL interface to the Spark engine. spark.sql("SELECT ...") returns a DataFrame; under the hood, Catalyst (Spark's query optimizer) compiles the SQL to the same physical plan as the DataFrame API. Spark SQL is the most common data engineer interface to Spark in 2026 because it is portable from analytical SQL skills and produces the same performance as DataFrame code.
The Spark SQL dialect differs from Postgres and Snowflake in several ways data engineer interview rounds test. Spark SQL supports ANSI SQL for most operations. UNION versus UNION ALL behavior matches ANSI. CTEs (WITH clause) are supported and frequently used. Window functions are full ANSI plus QUALIFY-equivalent via filter on outer query. Recursive CTEs are NOT supported (a difference from Postgres and Snowflake). Lateral joins are supported in Spark 3.0+. Sub-second precision in timestamps requires explicit cast in some versions.
MERGE INTO on Delta and Iceberg is the dominant write pattern in modern Spark SQL data engineer pipelines. MERGE INTO target t USING source s ON t.pk = s.pk WHEN MATCHED AND s.op_type = 'DELETE' THEN DELETE WHEN MATCHED THEN UPDATE SET col1 = s.col1, col2 = s.col2 WHEN NOT MATCHED THEN INSERT (pk, col1, col2) VALUES (s.pk, s.col1, s.col2). Delta and Iceberg both support this syntax; the underlying storage differs but the SQL is portable. Idempotency comes from MERGE-on-natural-key plus run_id baked into source.
Partition pruning is the most-important performance pattern. Spark scans only the partitions referenced by WHERE clauses on the partition column. WHERE event_date = '2026-05-27' on a date-partitioned table reads only that day's files. Function-wrapped predicates (WHERE DATE(event_ts) = '2026-05-27') prevent pruning because Catalyst cannot reason about DATE() in reverse; rewrite as WHERE event_ts >= '2026-05-27' AND event_ts < '2026-05-28'. EXPLAIN ANALYZE shows whether pruning fired (look for PartitionFilters in the plan).
Broadcast hints. Spark SQL supports /*+ BROADCAST(table) */ to force a broadcast join when the optimizer's autoBroadcastJoinThreshold heuristic would not pick it. Useful when the small side is between 10MB and 100MB (default threshold is 10MB) and broadcasting is faster than the inferred sort-merge. /*+ COALESCE(N) */ to reduce partitions before write. /*+ REPARTITION(N) */ to shuffle to N partitions. These hints matter in optimization rounds at L5+.
Spark SQL adaptive execution. spark.sql.adaptive.enabled = true (default in Spark 3.2+) lets the runtime adjust the plan based on actual statistics: skew-join splitting, broadcast-threshold adjustment, partition coalescing. Override with explicit hints when the adaptive choice is wrong for the workload. AQE is on by default in most production Spark deployments; the data engineer should know what AQE does and when to override.
Companies whose data engineer interviews emphasize Spark SQL: Databricks (Spark SQL on Delta), Netflix (Spark SQL on Iceberg), Uber (Spark SQL plus Presto), Airbnb (Spark SQL plus Druid), Snowflake (vendor-neutral testing of Spark SQL versus Snowflake SQL).
Spark SQL Interview Questions
Spark SQL interview questions for data engineer interview prep.
Common questions
- What is the difference between Spark SQL and the DataFrame API in PySpark?
- Spark SQL is the SQL interface (spark.sql('SELECT ...') returns a DataFrame). DataFrame API uses Python methods (df.filter, df.groupBy). Both compile to the same physical plan via Catalyst. Performance is identical for equivalent queries. Pick the one your team is faster in; mention the equivalence to show fluency.
- Does Spark SQL support recursive CTEs?
- No. Recursive CTEs are not supported in Spark SQL as of 2026. The Postgres and Snowflake WITH RECURSIVE syntax does not compile. For hierarchical traversal in Spark, the workaround is iterative DataFrame joins until convergence, or using a graph library (GraphFrames) for one-shot hierarchical queries.
- How does partition pruning work in Spark SQL?
- Spark scans only the partitions referenced by WHERE clauses on the partition column. WHERE event_date = '2026-05-27' on a date-partitioned table reads only that day's files. Function-wrapped predicates (WHERE DATE(event_ts) = '2026-05-27') prevent pruning because Catalyst cannot reason about DATE() in reverse. EXPLAIN ANALYZE shows whether pruning fired by listing PartitionFilters.
- How does a data engineer force a broadcast join in Spark SQL?
- Use the /*+ BROADCAST(table) */ hint: SELECT /*+ BROADCAST(users) */ ... FROM events e JOIN users u ON e.user_id = u.user_id. Useful when the small side is between the autoBroadcastJoinThreshold (default 10MB) and the maximum broadcast size (typically 100-200MB depending on cluster memory). Without the hint, the optimizer falls back to sort-merge for sides above the threshold.
- What is the MERGE INTO pattern in Spark SQL on Delta or Iceberg?
- MERGE INTO target t USING source s ON t.pk = s.pk WHEN MATCHED AND s.op_type = 'DELETE' THEN DELETE WHEN MATCHED THEN UPDATE SET col1 = s.col1 WHEN NOT MATCHED THEN INSERT (pk, col1) VALUES (s.pk, s.col1). Delta and Iceberg both support this syntax. Idempotency from MERGE-on-natural-key plus run_id baked into source.
- What is Spark AQE and how does it affect SQL queries?
- Adaptive Query Execution adjusts the query plan at runtime based on actual statistics from completed stages. Three main optimizations: skew-join detection (splits a skewed partition automatically), broadcast-threshold adjustment (uses runtime data sizes), partition coalescing (combines small post-shuffle partitions). Enable with spark.sql.adaptive.enabled = true (default in 3.2+). Override with explicit hints when needed.
- What is the difference between Spark SQL and Snowflake SQL?
- Both are ANSI-compatible for most operations. Spark SQL adds MERGE INTO via Delta/Iceberg, broadcast hints, AQE-driven runtime optimization. Snowflake adds QUALIFY (filter window results without CTE wrap), micro-partitions for automatic clustering, time travel and zero-copy clone. Practice in Postgres is portable to both for ~85 percent of patterns; the engine-specific features are tagged.
- How does a data engineer use EXPLAIN in Spark SQL?
- EXPLAIN [EXTENDED|FORMATTED] query shows the physical plan. Look for: SortMergeJoin vs BroadcastHashJoin (join strategy), PartitionFilters (partition pruning fired), Exchange (shuffle steps), Filter (predicate pushdown), Project (column pruning). EXPLAIN ANALYZE in Spark 3.0+ also shows actual runtime statistics. Compare your plan to the reference solution's plan for optimization rounds.