Question 1

What is the difference between a Spark transformation and an action?

Accepted Answer

Transformations (filter, select, groupBy, join, withColumn) are lazy: they build the logical plan but do not execute. Actions (count, collect, show, write) are eager: they trigger execution through the entire transformation chain. The data engineer who confuses them writes code that calls collect() in a loop and is shocked by the slowdown.

Question 2

What is Catalyst in Spark?

Accepted Answer

Spark's query optimizer. Compiles the logical plan (DataFrame or SQL) to a physical plan with rule-based optimizations (predicate pushdown, column pruning, constant folding) and cost-based optimizations (join reordering when stats are available). The output of Catalyst feeds Tungsten for code generation. The data engineer interview round expects awareness of Catalyst as the abstraction underneath the API.

Question 3

What is Tungsten in Spark?

Accepted Answer

Spark's execution engine. Generates Java bytecode for the physical operators produced by Catalyst, avoiding Java reflection overhead. Operates on off-heap memory for cache locality. The combination of Catalyst + Tungsten makes DataFrame code as fast as equivalent Scala RDD code with far less effort.

Question 4

What is the difference between repartition and coalesce in Spark?

Accepted Answer

repartition(N) shuffles to N partitions: expensive but lets you increase or decrease and gives even distribution. coalesce(N) merges existing partitions without shuffle: cheap, but only decreases and can produce uneven distribution. Use repartition before a join or write where balance matters; use coalesce after a filter that reduced data volume.

Question 5

When should a data engineer use a UDF in Spark?

Accepted Answer

Only when no built-in function covers the logic. Built-in functions (col, when, regexp_extract, date_format) compile to Tungsten bytecode and are 5-10x faster than UDFs. When a UDF is necessary, use pandas UDFs (@pandas_udf) which operate on batches of rows in vectorized form, much faster than traditional Python UDFs.

Question 6

How does a data engineer manage schemas in PySpark?

Accepted Answer

Explicit schemas at read time avoid the inferSchema pass: spark.read.schema(schema).csv(path). For production pipelines, define schemas with StructType + StructField + type primitives; version them in the codebase. Auto-inference is acceptable for ad-hoc exploration but not for pipelines where schema drift matters.

Question 7

What is the default value of spark.sql.shuffle.partitions and when should it be changed?

Accepted Answer

Default is 200. Often too high for small data (creates many small files) and too low for very large data (each partition takes too long). Rule of thumb: target 100-200 MB per partition. For 100 GB shuffle, 500-1000 partitions. AQE in Spark 3.0+ adjusts post-shuffle partitions automatically.

Question 8

What is the lifecycle of a Spark DataFrame query?

Accepted Answer

1. Logical plan built from transformations. 2. Catalyst applies rule-based and cost-based optimization. 3. Physical plan generated. 4. Tungsten generates Java bytecode. 5. Execution on executors. 6. Action returns result to driver or writes to sink. Lazy evaluation means steps 1-5 happen only when an action is called.

Spark DataFrame Interview Questions

Spark DataFrame Interview Questions

PySpark (5)