Question 1

What is the difference between Spark SQL and the DataFrame API in PySpark?

Accepted Answer

Spark SQL is the SQL interface (spark.sql('SELECT ...') returns a DataFrame). DataFrame API uses Python methods (df.filter, df.groupBy). Both compile to the same physical plan via Catalyst. Performance is identical for equivalent queries. Pick the one your team is faster in; mention the equivalence to show fluency.

Question 2

Does Spark SQL support recursive CTEs?

Accepted Answer

No. Recursive CTEs are not supported in Spark SQL as of 2026. The Postgres and Snowflake WITH RECURSIVE syntax does not compile. For hierarchical traversal in Spark, the workaround is iterative DataFrame joins until convergence, or using a graph library (GraphFrames) for one-shot hierarchical queries.

Question 3

How does partition pruning work in Spark SQL?

Accepted Answer

Spark scans only the partitions referenced by WHERE clauses on the partition column. WHERE event_date = '2026-05-27' on a date-partitioned table reads only that day's files. Function-wrapped predicates (WHERE DATE(event_ts) = '2026-05-27') prevent pruning because Catalyst cannot reason about DATE() in reverse. EXPLAIN ANALYZE shows whether pruning fired by listing PartitionFilters.

Question 4

How does a data engineer force a broadcast join in Spark SQL?

Accepted Answer

Use the /*+ BROADCAST(table) */ hint: SELECT /*+ BROADCAST(users) */ ... FROM events e JOIN users u ON e.user_id = u.user_id. Useful when the small side is between the autoBroadcastJoinThreshold (default 10MB) and the maximum broadcast size (typically 100-200MB depending on cluster memory). Without the hint, the optimizer falls back to sort-merge for sides above the threshold.

Question 5

What is the MERGE INTO pattern in Spark SQL on Delta or Iceberg?

Accepted Answer

MERGE INTO target t USING source s ON t.pk = s.pk WHEN MATCHED AND s.op_type = 'DELETE' THEN DELETE WHEN MATCHED THEN UPDATE SET col1 = s.col1 WHEN NOT MATCHED THEN INSERT (pk, col1) VALUES (s.pk, s.col1). Delta and Iceberg both support this syntax. Idempotency from MERGE-on-natural-key plus run_id baked into source.

Question 6

What is Spark AQE and how does it affect SQL queries?

Accepted Answer

Adaptive Query Execution adjusts the query plan at runtime based on actual statistics from completed stages. Three main optimizations: skew-join detection (splits a skewed partition automatically), broadcast-threshold adjustment (uses runtime data sizes), partition coalescing (combines small post-shuffle partitions). Enable with spark.sql.adaptive.enabled = true (default in 3.2+). Override with explicit hints when needed.

Question 7

What is the difference between Spark SQL and Snowflake SQL?

Accepted Answer

Both are ANSI-compatible for most operations. Spark SQL adds MERGE INTO via Delta/Iceberg, broadcast hints, AQE-driven runtime optimization. Snowflake adds QUALIFY (filter window results without CTE wrap), micro-partitions for automatic clustering, time travel and zero-copy clone. Practice in Postgres is portable to both for ~85 percent of patterns; the engine-specific features are tagged.

Question 8

How does a data engineer use EXPLAIN in Spark SQL?

Accepted Answer

EXPLAIN [EXTENDED|FORMATTED] query shows the physical plan. Look for: SortMergeJoin vs BroadcastHashJoin (join strategy), PartitionFilters (partition pruning fired), Exchange (shuffle steps), Filter (predicate pushdown), Project (column pruning). EXPLAIN ANALYZE in Spark 3.0+ also shows actual runtime statistics. Compare your plan to the reference solution's plan for optimization rounds.

Spark SQL Interview Questions

Spark SQL Interview Questions

PySpark (4)