Spark DataFrame interview questions for data engineer roles. Transformations versus actions. Lazy evaluation and the execution model. Partition strategy and repartition versus coalesce. Catalyst optimizer and Tungsten execution engine. Schema management with StructType. UDFs and pandas UDFs.

Spark DataFrame interview questions for data engineer roles test the conceptual model of Spark execution and the practical use of the DataFrame API. The DataFrame is Spark's typed distributed collection of rows with a schema. Transformations (filter, select, groupBy, join) are lazy: they build up a logical plan but do not execute. Actions (count, collect, show, write) are eager: they trigger execution through the entire transformation chain. The data engineer who confuses transformations with actions writes code that prints debug output between transformations and is shocked by the slowdown when collect() is called in a loop.

The Catalyst optimizer compiles the DataFrame plan to physical operators. Catalyst applies rule-based optimizations (predicate pushdown, column pruning, constant folding) and cost-based optimizations (join reordering when stats are available). The Tungsten execution engine generates Java bytecode for the physical operators, avoiding Java reflection overhead. Together they make DataFrame code as fast as equivalent Scala RDD code with far less effort. The data engineer interview round expects awareness of Catalyst and Tungsten as the abstractions underneath the API.

Partition strategy is the most-asked DataFrame topic at L5+. df.rdd.getNumPartitions() returns the current partition count. spark.sql.shuffle.partitions controls the post-shuffle count (default 200, often too high for small data and too low for large data). repartition(N) shuffles to N partitions: expensive but lets you increase or decrease and gives even distribution. coalesce(N) merges existing partitions without shuffle: cheap, but only decreases and can produce uneven distribution. Use repartition before a join or write where balance matters; use coalesce after a filter that reduced data volume.

UDFs and pandas UDFs. A traditional UDF (udf(lambda x: ...) at module level) serializes Python code to executors, incurring overhead per row. Pandas UDFs (@pandas_udf) operate on batches of rows in vectorized form, with much lower overhead. For most data engineer interview problems, built-in functions (col, when, regexp_extract, date_format) are faster than UDFs because they compile to Tungsten bytecode. Use UDFs only when no built-in covers the logic; use pandas UDFs over Python UDFs when UDFs are necessary.

Schema management with StructType. df.schema returns the StructType describing the columns. Explicit schemas at read time (spark.read.schema(schema).csv(...)) avoid the inferSchema pass that scans the file for type detection. For pipeline reliability, define schemas explicitly and version them; auto-inference is acceptable for ad-hoc exploration but not production pipelines. PySpark's StructType plus StructField plus type primitives compose the schema; the data engineer interview round expects fluency with this API.

Companies whose data engineer interviews emphasize Spark DataFrame internals: Databricks (Spark itself; deep Catalyst/Tungsten questions), Netflix (Iceberg plus DataFrame), Uber (large-scale DataFrame at scale), Airbnb. Most other Spark-first companies focus more on usage than internals.

Spark DataFrame Interview Questions

DataFrame API interview questions for data engineer prep.

Common questions

What is the difference between a Spark transformation and an action?
Transformations (filter, select, groupBy, join, withColumn) are lazy: they build the logical plan but do not execute. Actions (count, collect, show, write) are eager: they trigger execution through the entire transformation chain. The data engineer who confuses them writes code that calls collect() in a loop and is shocked by the slowdown.
What is Catalyst in Spark?
Spark's query optimizer. Compiles the logical plan (DataFrame or SQL) to a physical plan with rule-based optimizations (predicate pushdown, column pruning, constant folding) and cost-based optimizations (join reordering when stats are available). The output of Catalyst feeds Tungsten for code generation. The data engineer interview round expects awareness of Catalyst as the abstraction underneath the API.
What is Tungsten in Spark?
Spark's execution engine. Generates Java bytecode for the physical operators produced by Catalyst, avoiding Java reflection overhead. Operates on off-heap memory for cache locality. The combination of Catalyst + Tungsten makes DataFrame code as fast as equivalent Scala RDD code with far less effort.
What is the difference between repartition and coalesce in Spark?
repartition(N) shuffles to N partitions: expensive but lets you increase or decrease and gives even distribution. coalesce(N) merges existing partitions without shuffle: cheap, but only decreases and can produce uneven distribution. Use repartition before a join or write where balance matters; use coalesce after a filter that reduced data volume.
When should a data engineer use a UDF in Spark?
Only when no built-in function covers the logic. Built-in functions (col, when, regexp_extract, date_format) compile to Tungsten bytecode and are 5-10x faster than UDFs. When a UDF is necessary, use pandas UDFs (@pandas_udf) which operate on batches of rows in vectorized form, much faster than traditional Python UDFs.
How does a data engineer manage schemas in PySpark?
Explicit schemas at read time avoid the inferSchema pass: spark.read.schema(schema).csv(path). For production pipelines, define schemas with StructType + StructField + type primitives; version them in the codebase. Auto-inference is acceptable for ad-hoc exploration but not for pipelines where schema drift matters.
What is the default value of spark.sql.shuffle.partitions and when should it be changed?
Default is 200. Often too high for small data (creates many small files) and too low for very large data (each partition takes too long). Rule of thumb: target 100-200 MB per partition. For 100 GB shuffle, 500-1000 partitions. AQE in Spark 3.0+ adjusts post-shuffle partitions automatically.
What is the lifecycle of a Spark DataFrame query?
1. Logical plan built from transformations. 2. Catalyst applies rule-based and cost-based optimization. 3. Physical plan generated. 4. Tungsten generates Java bytecode. 5. Execution on executors. 6. Action returns result to driver or writes to sink. Lazy evaluation means steps 1-5 happen only when an action is called.