PySpark Tutorial: SparkSession to Write (2026)

Each step covers the API, what interviewers test, and why the operation matters on a 500GB table. PySpark accounts for ~70% of Spark API usage in production.

Step 1: Create a SparkSession and Load Data

from pyspark.sql import SparkSession

# SparkSession controls your app name, memory, and cluster connection.
# Interviewers ask what happens when two sessions share a JVM.
spark = SparkSession.builder \
    .appName("tutorial") \
    .getOrCreate()

# inferSchema reads the file twice: once to detect types, once to load.
# In production, always define the schema explicitly to avoid that double read.
df = spark.read.csv("data/employees.csv", header=True, inferSchema=True)

df.printSchema()
df.show(5)

SparkSession is the entry point for all Spark operations. In production, you configure memory (spark.executor.memory), parallelism (spark.sql.shuffle.partitions = 200 by default), and serialization here. Interviewers test whether you know that inferSchema doubles the read cost on large files, and that appName is how you find your job in the Spark UI.

Step 2: Select, Filter, and Transform Columns

from pyspark.sql import functions as F

# select() picks columns. At scale, dropping unused columns early
# reduces shuffle volume in downstream joins.
df.select("name", "department", "salary").show()

# filter() is a narrow transform: no shuffle, each partition processes independently.
senior = df.filter(F.col("salary") > 100000)

# withColumn creates a new DataFrame. Chaining 20+ withColumn calls
# builds a deep logical plan that slows Catalyst optimization.
with_tax = df.withColumn("after_tax", F.col("salary") * 0.7)

# Rename before joining to avoid ambiguous column errors.
renamed = df.withColumnRenamed("dept", "department")

These are narrow transformations: each partition processes independently without shuffling data across the network. Operations are lazy. Nothing executes until you call an action like show() or count(). Interviewers test whether you know that column pruning early (selecting only needed columns) reduces shuffle bytes in later stages. On a 2TB table, dropping 3 unused columns before a join can cut shuffle volume by 40%.

Step 3: Join Two DataFrames

# Inner join: most common in interviews.
# Spark picks sort-merge join for tables over 10MB (autoBroadcastJoinThreshold).
result = employees.join(departments, on="dept_id", how="inner")

# Broadcast join: send the small table to every executor.
# Eliminates the shuffle of the large table entirely.
# Only works when one side fits in driver memory.
result = employees.join(
    F.broadcast(departments),
    on="dept_id"
)

# Left anti join: find rows with no match.
# Interviewers use this to test whether you know all join types.
orphans = employees.join(departments, on="dept_id", how="left_anti")

Joins trigger shuffles. A shuffle moves data across the network so rows with the same key land on the same partition. For a 500GB table, that shuffle can take 30+ minutes. Broadcast joins avoid the shuffle by sending the small table (under 10MB by default) to every executor. Interviewers test whether you know the threshold, when to override it, and what happens when the broadcast table grows past driver memory. A strong answer mentions that PySpark ~70% of Spark API usage defaults to sort-merge for large joins.

Step 4: GroupBy and Aggregate

# groupBy triggers a shuffle: all rows with the same key
# must land on the same partition.
dept_stats = employees.groupBy("department").agg(
    F.count("*").alias("headcount"),
    F.avg("salary").alias("avg_salary"),
    F.max("salary").alias("max_salary")
)

# Multiple grouping columns increase partition count.
# If one group holds 80% of the data, you get skew.
monthly = sales.groupBy("store", F.month("date").alias("month")).agg(
    F.sum("revenue").alias("total_revenue")
)

groupBy is a wide transformation. Spark redistributes all rows so records with the same key end up on the same partition, using 200 partitions by default (spark.sql.shuffle.partitions). If one key holds disproportionate data, that partition becomes a bottleneck. Interviewers test whether you can spot aggregation skew and fix it with salting or two-phase aggregation: first aggregate within salt groups, then aggregate across them.

Step 5: Window Functions

from pyspark.sql.window import Window

# partitionBy defines groups. orderBy defines sort within each group.
# Interviewers test whether you know the difference between
# rank() (gaps on ties) and row_number() (no gaps, arbitrary tiebreak).
w = Window.partitionBy("department").orderBy(F.desc("salary"))
ranked = employees.withColumn("rank", F.rank().over(w))

# Running total with explicit frame boundary.
# rowsBetween(unboundedPreceding, 0) means "from first row to current row."
# Without this, Spark uses rangeBetween, which behaves differently on ties.
w2 = Window.partitionBy("account_id").orderBy("date") \
    .rowsBetween(Window.unboundedPreceding, 0)

with_running = transactions.withColumn(
    "running_balance", F.sum("amount").over(w2)
)

Window functions compute a value for each row using a group of related rows, without collapsing the DataFrame. They appear in 15.3% of PySpark interview questions. The partitionBy clause acts like a groupBy that does not reduce rows. The frame boundary (rowsBetween vs rangeBetween) changes results when there are ties in the orderBy column. Interviewers specifically test this edge case.

Step 6: Write Results

# Parquet: columnar, compressed, splittable.
# This is the default production format.
result.write.parquet("output/employees_enriched", mode="overwrite")

# partitionBy on write creates a directory per partition value.
# Enables partition pruning: queries filtering on department
# skip reading other directories entirely.
result.write.partitionBy("department").parquet("output/by_dept")

# coalesce(10) reduces 200 partitions to 10 output files.
# Without this, you get 200 small files per write.
# The tradeoff: coalesce can create uneven files if partition sizes vary.
result.coalesce(10).write.parquet("output/compact")

In production, Parquet is the standard. It compresses 5-10x better than CSV and supports predicate pushdown (Spark reads only the row groups that match your filter). The small files problem is the most common production issue: 200 shuffle partitions means 200 output files per write, multiplied across daily runs. coalesce() reduces file count but can create skewed files. Delta Lake and Iceberg solve this with automatic compaction. Interviewers test whether you know the tradeoff between file count and file size.

Frequently Asked Questions

What is the fastest way to learn PySpark for interviews?+

Write code against real problems, not isolated examples. After completing these 6 steps, move to practice problems that test join optimization, window functions, and debugging. Pattern recognition from solving 3-4 problems per category is more effective than reading documentation.

Do I need to install Spark to follow this tutorial?+

No. DataDriven runs PySpark in the browser. You can also install locally with pip install pyspark, which includes a single-node Spark for development. Production clusters run 50-500 executors, but single-node is fine for learning the API.

Is this PySpark tutorial enough for interview prep?+

It covers the API surface. Interviews also test performance reasoning: why is this join slow, what does the physical plan show, how would you fix skew on this key. After the tutorial, practice with production debugging problems to build that skill.

02 / Why practice

The candidate who gets the offer

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Parsing and reshaping, sessionization, dedup with tie-breaks, streaming aggregation, top-N-per-group. Writing them by hand turns the unfamiliar into pattern recognition

Practice PySpark Problems

Related Guides

What Is PySpark?→

Architecture overview and how PySpark maps to Spark

PySpark Functions Cheat Sheet→

Quick reference for pyspark.sql.functions by category

PySpark Practice Problems→

35+ problems by category and difficulty