Quick Reference

PySpark Functions Cheat Sheet

Spark 3.5 ships with 1,500+ built-in functions. These are the ones that appear in data engineering interviews, organized by category: column ops, aggregation, window, string, date, and array/map.

7 categories55+ functionsInterview-weighted

Column Operations

F.col('name')Reference a column by name
F.lit(42)Create a literal value column
F.when(cond, val).otherwise(default)Conditional (CASE WHEN)
F.coalesce('a', 'b', 'c')First non-null value
F.isnull('col') / F.isnan('col')NULL vs NaN check
F.col('a').cast('int')Type casting
F.col('a').alias('new_name')Rename in select()

Aggregation

F.count('*')Row count
F.countDistinct('col')Unique value count
F.sum('col') / F.avg('col')Sum and mean
F.min('col') / F.max('col')Min and max
F.collect_list('col')Gather into array (with dups)
F.collect_set('col')Gather into array (unique)
F.approx_count_distinct('col')HyperLogLog approximate count (2% error)
F.stddev('col') / F.variance('col')Standard deviation and variance
F.percentile_approx('col', 0.5)Approximate median/percentile

Window Functions

F.row_number().over(w)Sequential number (no ties, arbitrary tiebreak)
F.rank().over(w)Rank with gaps on ties
F.dense_rank().over(w)Rank without gaps
F.lag('col', 1).over(w)Previous row value
F.lead('col', 1).over(w)Next row value
F.sum('col').over(w)Running sum over window frame
F.first('col').over(w) / F.last('col').over(w)First/last in window
F.ntile(4).over(w)Split into N equal buckets

String Functions

F.upper('col') / F.lower('col')Case conversion
F.trim('col') / F.ltrim() / F.rtrim()Whitespace removal
F.length('col')String length
F.substring('col', pos, len)Extract substring (1-indexed)
F.regexp_extract('col', pattern, group)Regex capture group
F.regexp_replace('col', pattern, replacement)Regex replace
F.split('col', delimiter)Split into array
F.concat_ws(sep, 'a', 'b')Join with separator
F.lpad('col', len, pad) / F.rpad()Pad to fixed length

Date and Timestamp

F.current_date() / F.current_timestamp()Current date/time
F.to_date('col', fmt)Parse string to date
F.to_timestamp('col', fmt)Parse string to timestamp
F.year('col') / F.month() / F.dayofmonth()Extract date parts
F.date_trunc('month', 'col')Truncate to period start
F.datediff('end', 'start')Days between two dates
F.months_between('end', 'start')Months between (decimal)
F.date_add('col', days)Add days to date
F.date_format('col', fmt)Format date as string

Array and Map

F.explode('array_col')One row per element (drops nulls)
F.explode_outer('array_col')One row per element (keeps nulls)
F.array('a', 'b', 'c')Create array from columns
F.array_contains('arr', value)Check if array contains value
F.size('arr')Array or map length
F.map_keys('map') / F.map_values('map')Extract keys or values
F.create_map('k', 'v')Build map from key-value columns

Import Pattern

from pyspark.sql import functions as FStandard import (always use this)
from pyspark.sql.window import WindowWindow spec builder
from pyspark.sql.types import *Type definitions for explicit schemas

Frequently Asked Questions

What is the standard import for pyspark.sql.functions?+
from pyspark.sql import functions as F. This is universal convention. Call F.col(), F.sum(), F.when(), etc. Importing individual functions clutters your namespace and makes code harder to read in interviews.
Which functions appear most often in interviews?+
F.col() and F.when().otherwise() for transforms. F.row_number().over() and F.lag() for window problems. F.sum/avg/count for aggregations. F.broadcast() for join optimization. F.regexp_extract() for string parsing. These cover roughly 90% of interview questions.
What is the difference between row_number, rank, and dense_rank?+
row_number assigns sequential numbers with no ties (arbitrary tiebreak). rank assigns the same number to ties but leaves gaps (1, 1, 3). dense_rank assigns the same number to ties without gaps (1, 1, 2). Interviewers test this distinction with deduplication problems: row_number picks one row per group, rank keeps all tied rows.