PySpark Functions Cheat Sheet (2026)

Spark 3.5 ships with 1,500+ built-in functions. These are the ones that appear in data engineering interviews, organized by category: column ops, aggregation, window, string, date, and array/map.

Column Operations

F.col('name')	Reference a column by name
F.lit(42)	Create a literal value column
F.when(cond, val).otherwise(default)	Conditional (CASE WHEN)
F.coalesce('a', 'b', 'c')	First non-null value
F.isnull('col') / F.isnan('col')	NULL vs NaN check
F.col('a').cast('int')	Type casting
F.col('a').alias('new_name')	Rename in select()

Aggregation

F.count('*')	Row count
F.countDistinct('col')	Unique value count
F.sum('col') / F.avg('col')	Sum and mean
F.min('col') / F.max('col')	Min and max
F.collect_list('col')	Gather into array (with dups)
F.collect_set('col')	Gather into array (unique)
F.approx_count_distinct('col')	HyperLogLog approximate count (2% error)
F.stddev('col') / F.variance('col')	Standard deviation and variance
F.percentile_approx('col', 0.5)	Approximate median/percentile

Window Functions

F.row_number().over(w)	Sequential number (no ties, arbitrary tiebreak)
F.rank().over(w)	Rank with gaps on ties
F.dense_rank().over(w)	Rank without gaps
F.lag('col', 1).over(w)	Previous row value
F.lead('col', 1).over(w)	Next row value
F.sum('col').over(w)	Running sum over window frame
F.first('col').over(w) / F.last('col').over(w)	First/last in window
F.ntile(4).over(w)	Split into N equal buckets

String Functions

F.upper('col') / F.lower('col')	Case conversion
F.trim('col') / F.ltrim() / F.rtrim()	Whitespace removal
F.length('col')	String length
F.substring('col', pos, len)	Extract substring (1-indexed)
F.regexp_extract('col', pattern, group)	Regex capture group
F.regexp_replace('col', pattern, replacement)	Regex replace
F.split('col', delimiter)	Split into array
F.concat_ws(sep, 'a', 'b')	Join with separator
F.lpad('col', len, pad) / F.rpad()	Pad to fixed length

Date and Timestamp

F.current_date() / F.current_timestamp()	Current date/time
F.to_date('col', fmt)	Parse string to date
F.to_timestamp('col', fmt)	Parse string to timestamp
F.year('col') / F.month() / F.dayofmonth()	Extract date parts
F.date_trunc('month', 'col')	Truncate to period start
F.datediff('end', 'start')	Days between two dates
F.months_between('end', 'start')	Months between (decimal)
F.date_add('col', days)	Add days to date
F.date_format('col', fmt)	Format date as string

Array and Map

F.explode('array_col')	One row per element (drops nulls)
F.explode_outer('array_col')	One row per element (keeps nulls)
F.array('a', 'b', 'c')	Create array from columns
F.array_contains('arr', value)	Check if array contains value
F.size('arr')	Array or map length
F.map_keys('map') / F.map_values('map')	Extract keys or values
F.create_map('k', 'v')	Build map from key-value columns

Import Pattern

from pyspark.sql import functions as F	Standard import (always use this)
from pyspark.sql.window import Window	Window spec builder
from pyspark.sql.types import *	Type definitions for explicit schemas

Frequently Asked Questions

What is the standard import for pyspark.sql.functions?+

from pyspark.sql import functions as F. This is universal convention. Call F.col(), F.sum(), F.when(), etc. Importing individual functions clutters your namespace and makes code harder to read in interviews.

Which functions appear most often in interviews?+

F.col() and F.when().otherwise() for transforms. F.row_number().over() and F.lag() for window problems. F.sum/avg/count for aggregations. F.broadcast() for join optimization. F.regexp_extract() for string parsing. These cover roughly 90% of interview questions.

What is the difference between row_number, rank, and dense_rank?+

row_number assigns sequential numbers with no ties (arbitrary tiebreak). rank assigns the same number to ties but leaves gaps (1, 1, 3). dense_rank assigns the same number to ties without gaps (1, 1, 2). Interviewers test this distinction with deduplication problems: row_number picks one row per group, rank keeps all tied rows.

02 / Why practice

The candidate who gets the offer

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Parsing and reshaping, sessionization, dedup with tie-breaks, streaming aggregation, top-N-per-group. Writing them by hand turns the unfamiliar into pattern recognition

Practice PySpark Problems

Related Guides

Spark SQL Functions (Detailed)→

In-depth reference with gotchas per function

PySpark Tutorial→

6 steps from SparkSession to production write

PySpark Interview Questions→

Q&A with answer guidance by seniority