Data Engineering Interview Prep

What Is PySpark? Python API for Apache Spark

PySpark is the Python interface to Apache Spark. It accounts for roughly 70% of all Spark API usage, compared to about 25% for Scala and 5% for Java. When a job posting says "Spark experience required," they almost always mean PySpark.

Spark processes over 100 PB daily at companies like Netflix, Uber, and Apple. Databricks, built on Spark, reached $1.6B annual revenue in 2024.

PySpark Is the Python API for Apache Spark

PySpark lets you write distributed data processing code in Python. Under the hood, your DataFrame operations compile to the same execution plan as Scala. The Catalyst optimizer treats them identically, applying 100+ optimization rules before generating a physical plan.

The DataFrame API was introduced in Spark 1.3 (2015) and replaced RDDs as the primary interface. Today, Spark 3.5 ships with 1,500+ built-in functions in pyspark.sql.functions. Interviewers expect you to use these functions instead of writing Python UDFs, which serialize data between the JVM and Python for every row.

Why PySpark Matters for Data Engineers

Every major data platform runs Spark: Databricks, AWS EMR, Google Dataproc, Azure Synapse. A typical production cluster runs 50 to 500 executors with 4 to 8 cores each. If a company processes more than a few terabytes daily, they are almost certainly running Spark.

Knowing PySpark is not optional for data engineering roles at these companies. Interviewers test whether you have built and maintained a real project, not whether you can recite documentation. The tradeoff questions matter: when to broadcast vs. sort-merge join, when to cache vs. recompute, when to repartition vs. coalesce.

PySpark vs Pandas

Pandas runs on a single machine and holds everything in memory. PySpark distributes work across a cluster. For datasets under 10GB, Pandas is simpler and faster to iterate with. For datasets over 10GB, PySpark is necessary.

The key behavioral difference: PySpark operations are lazy. Nothing executes until you call an action like .show(), .count(), or .write(). Pandas operations execute immediately. This catches people who call .count() inside a loop, forcing a full DAG evaluation on every iteration.

PySpark vs Spark Scala

Both compile to the same physical plan. Performance is identical for DataFrame operations. The difference appears with UDFs: Python UDFs serialize data between the JVM and Python, adding overhead that can make them 10x to 100x slower than native functions. Pandas UDFs (vectorized) close most of this gap by operating on Arrow batches instead of individual rows.

PySpark accounts for roughly 70% of Spark API usage based on GitHub activity, Scala about 25%, Java about 5%. Choose PySpark if your team writes Python. Choose Scala if you need custom RDD operations or maximum UDF performance. For interview prep, the concepts are identical across both languages.

The PySpark Execution Model

When you write df.filter(...).groupBy(...).agg(...), nothing happens. PySpark builds a logical plan. When you trigger an action (.show(), .write()), the Catalyst optimizer converts that logical plan to a physical plan, splits it into stages at shuffle boundaries, and distributes tasks across executors.

Shuffle write/read is the #1 performance bottleneck in 80%+ of slow Spark jobs. spark.sql.shuffle.partitions defaults to 200. Executor memory is split roughly 60% to the unified pool (spark.memory.fraction = 0.6). Understanding this model is what separates tutorial knowledge from interview readiness.

Core PySpark APIs You Need to Know

DataFrame API

df.select('name', 'salary').filter(F.col('salary') > 100000)

The primary interface since Spark 1.3. Column-oriented and optimized by Catalyst.

Spark SQL

spark.sql('SELECT name, salary FROM employees WHERE salary > 100000')

SQL strings on registered views. Same execution plan as DataFrames. Performance is identical.

Window Functions

F.row_number().over(Window.partitionBy('dept').orderBy(F.desc('salary')))

Ranking, running totals, and lag/lead across partitions. The most common interview topic after joins.

GroupBy Aggregations

df.groupBy('dept').agg(F.sum('salary'), F.countDistinct('employee_id'))

Wide transformation. Creates a shuffle boundary. Know the difference between groupBy().agg() and groupByKey().

Joins

fact_df.join(F.broadcast(dim_df), 'customer_id')

BroadcastHashJoin is O(n). SortMergeJoin is O(n log n). The join strategy matters more than the syntax. autoBroadcastJoinThreshold defaults to 10MB.

PySpark FAQ

Is PySpark hard to learn?+
If you know Python and SQL, the DataFrame API is straightforward. The syntax mirrors Pandas in many ways. The hard part is understanding distributed execution: shuffles, partitions, and executor memory. That is what separates someone who completed a tutorial from someone who can debug a production job.
Do I need a cluster to learn PySpark?+
No. You can run PySpark locally with spark.master set to local[*]. For interview prep, local execution covers everything you need. DataDriven also runs a Spark-compatible engine in the browser that handles PySpark syntax without any local installation.
Is PySpark still relevant in 2026?+
Yes. PySpark accounts for roughly 70% of Spark API usage based on GitHub activity. Databricks, which reached $1.6B annual revenue in 2024, is built on Spark. Most data engineering job postings at companies processing over 1TB daily list PySpark as a requirement.
What is the difference between PySpark and regular Python?+
Regular Python runs on one machine. PySpark distributes work across a cluster of 50 to 500 executors. You write Python code, but PySpark translates your DataFrame operations into a distributed execution plan that the Catalyst optimizer refines through 100+ rules before a single byte moves.

Practice PySpark Interview Questions

DataDriven runs PySpark and Scala Spark code in your browser against real datasets. Write joins, window functions, and aggregations, then defend your approach in AI mock interviews.

Start Practicing