PySpark Tutorial: SparkSession to Production Write in 6 Steps
Each step covers the API, what interviewers test, and why the operation matters on a 500GB table. PySpark accounts for ~70% of Spark API usage in production.
Create a SparkSession and Load Data
from pyspark.sql import SparkSession
# SparkSession controls your app name, memory, and cluster connection.
# Interviewers ask what happens when two sessions share a JVM.
spark = SparkSession.builder \
.appName("tutorial") \
.getOrCreate()
# inferSchema reads the file twice: once to detect types, once to load.
# In production, always define the schema explicitly to avoid that double read.
df = spark.read.csv("data/employees.csv", header=True, inferSchema=True)
df.printSchema()
df.show(5)SparkSession is the entry point for all Spark operations. In production, you configure memory (spark.executor.memory), parallelism (spark.sql.shuffle.partitions = 200 by default), and serialization here. Interviewers test whether you know that inferSchema doubles the read cost on large files, and that appName is how you find your job in the Spark UI.
Select, Filter, and Transform Columns
from pyspark.sql import functions as F
# select() picks columns. At scale, dropping unused columns early
# reduces shuffle volume in downstream joins.
df.select("name", "department", "salary").show()
# filter() is a narrow transform: no shuffle, each partition processes independently.
senior = df.filter(F.col("salary") > 100000)
# withColumn creates a new DataFrame. Chaining 20+ withColumn calls
# builds a deep logical plan that slows Catalyst optimization.
with_tax = df.withColumn("after_tax", F.col("salary") * 0.7)
# Rename before joining to avoid ambiguous column errors.
renamed = df.withColumnRenamed("dept", "department")These are narrow transformations: each partition processes independently without shuffling data across the network. Operations are lazy. Nothing executes until you call an action like show() or count(). Interviewers test whether you know that column pruning early (selecting only needed columns) reduces shuffle bytes in later stages. On a 2TB table, dropping 3 unused columns before a join can cut shuffle volume by 40%.
Join Two DataFrames
# Inner join: most common in interviews.
# Spark picks sort-merge join for tables over 10MB (autoBroadcastJoinThreshold).
result = employees.join(departments, on="dept_id", how="inner")
# Broadcast join: send the small table to every executor.
# Eliminates the shuffle of the large table entirely.
# Only works when one side fits in driver memory.
result = employees.join(
F.broadcast(departments),
on="dept_id"
)
# Left anti join: find rows with no match.
# Interviewers use this to test whether you know all join types.
orphans = employees.join(departments, on="dept_id", how="left_anti")Joins trigger shuffles. A shuffle moves data across the network so rows with the same key land on the same partition. For a 500GB table, that shuffle can take 30+ minutes. Broadcast joins avoid the shuffle by sending the small table (under 10MB by default) to every executor. Interviewers test whether you know the threshold, when to override it, and what happens when the broadcast table grows past driver memory. A strong answer mentions that PySpark ~70% of Spark API usage defaults to sort-merge for large joins.
GroupBy and Aggregate
# groupBy triggers a shuffle: all rows with the same key
# must land on the same partition.
dept_stats = employees.groupBy("department").agg(
F.count("*").alias("headcount"),
F.avg("salary").alias("avg_salary"),
F.max("salary").alias("max_salary")
)
# Multiple grouping columns increase partition count.
# If one group holds 80% of the data, you get skew.
monthly = sales.groupBy("store", F.month("date").alias("month")).agg(
F.sum("revenue").alias("total_revenue")
)groupBy is a wide transformation. Spark redistributes all rows so records with the same key end up on the same partition, using 200 partitions by default (spark.sql.shuffle.partitions). If one key holds disproportionate data, that partition becomes a bottleneck. Interviewers test whether you can spot aggregation skew and fix it with salting or two-phase aggregation: first aggregate within salt groups, then aggregate across them.
Window Functions
from pyspark.sql.window import Window
# partitionBy defines groups. orderBy defines sort within each group.
# Interviewers test whether you know the difference between
# rank() (gaps on ties) and row_number() (no gaps, arbitrary tiebreak).
w = Window.partitionBy("department").orderBy(F.desc("salary"))
ranked = employees.withColumn("rank", F.rank().over(w))
# Running total with explicit frame boundary.
# rowsBetween(unboundedPreceding, 0) means "from first row to current row."
# Without this, Spark uses rangeBetween, which behaves differently on ties.
w2 = Window.partitionBy("account_id").orderBy("date") \
.rowsBetween(Window.unboundedPreceding, 0)
with_running = transactions.withColumn(
"running_balance", F.sum("amount").over(w2)
)Window functions compute a value for each row using a group of related rows, without collapsing the DataFrame. They appear in 15.3% of PySpark interview questions. The partitionBy clause acts like a groupBy that does not reduce rows. The frame boundary (rowsBetween vs rangeBetween) changes results when there are ties in the orderBy column. Interviewers specifically test this edge case.
Write Results
# Parquet: columnar, compressed, splittable.
# This is the default production format.
result.write.parquet("output/employees_enriched", mode="overwrite")
# partitionBy on write creates a directory per partition value.
# Enables partition pruning: queries filtering on department
# skip reading other directories entirely.
result.write.partitionBy("department").parquet("output/by_dept")
# coalesce(10) reduces 200 partitions to 10 output files.
# Without this, you get 200 small files per write.
# The tradeoff: coalesce can create uneven files if partition sizes vary.
result.coalesce(10).write.parquet("output/compact")In production, Parquet is the standard. It compresses 5-10x better than CSV and supports predicate pushdown (Spark reads only the row groups that match your filter). The small files problem is the most common production issue: 200 shuffle partitions means 200 output files per write, multiplied across daily runs. coalesce() reduces file count but can create skewed files. Delta Lake and Iceberg solve this with automatic compaction. Interviewers test whether you know the tradeoff between file count and file size.