PySpark Functions Cheat Sheet (2026)
Spark 3.5 ships with 1,500+ built-in functions. These are the ones that appear in data engineering interviews, organized by category: column ops, aggregation, window, string, date, and array/map.
Column Operations
| F.col('name') | Reference a column by name |
| F.lit(42) | Create a literal value column |
| F.when(cond, val).otherwise(default) | Conditional (CASE WHEN) |
| F.coalesce('a', 'b', 'c') | First non-null value |
| F.isnull('col') / F.isnan('col') | NULL vs NaN check |
| F.col('a').cast('int') | Type casting |
| F.col('a').alias('new_name') | Rename in select() |
Aggregation
| F.count('*') | Row count |
| F.countDistinct('col') | Unique value count |
| F.sum('col') / F.avg('col') | Sum and mean |
| F.min('col') / F.max('col') | Min and max |
| F.collect_list('col') | Gather into array (with dups) |
| F.collect_set('col') | Gather into array (unique) |
| F.approx_count_distinct('col') | HyperLogLog approximate count (2% error) |
| F.stddev('col') / F.variance('col') | Standard deviation and variance |
| F.percentile_approx('col', 0.5) | Approximate median/percentile |
Window Functions
| F.row_number().over(w) | Sequential number (no ties, arbitrary tiebreak) |
| F.rank().over(w) | Rank with gaps on ties |
| F.dense_rank().over(w) | Rank without gaps |
| F.lag('col', 1).over(w) | Previous row value |
| F.lead('col', 1).over(w) | Next row value |
| F.sum('col').over(w) | Running sum over window frame |
| F.first('col').over(w) / F.last('col').over(w) | First/last in window |
| F.ntile(4).over(w) | Split into N equal buckets |
String Functions
| F.upper('col') / F.lower('col') | Case conversion |
| F.trim('col') / F.ltrim() / F.rtrim() | Whitespace removal |
| F.length('col') | String length |
| F.substring('col', pos, len) | Extract substring (1-indexed) |
| F.regexp_extract('col', pattern, group) | Regex capture group |
| F.regexp_replace('col', pattern, replacement) | Regex replace |
| F.split('col', delimiter) | Split into array |
| F.concat_ws(sep, 'a', 'b') | Join with separator |
| F.lpad('col', len, pad) / F.rpad() | Pad to fixed length |
Date and Timestamp
| F.current_date() / F.current_timestamp() | Current date/time |
| F.to_date('col', fmt) | Parse string to date |
| F.to_timestamp('col', fmt) | Parse string to timestamp |
| F.year('col') / F.month() / F.dayofmonth() | Extract date parts |
| F.date_trunc('month', 'col') | Truncate to period start |
| F.datediff('end', 'start') | Days between two dates |
| F.months_between('end', 'start') | Months between (decimal) |
| F.date_add('col', days) | Add days to date |
| F.date_format('col', fmt) | Format date as string |
Array and Map
| F.explode('array_col') | One row per element (drops nulls) |
| F.explode_outer('array_col') | One row per element (keeps nulls) |
| F.array('a', 'b', 'c') | Create array from columns |
| F.array_contains('arr', value) | Check if array contains value |
| F.size('arr') | Array or map length |
| F.map_keys('map') / F.map_values('map') | Extract keys or values |
| F.create_map('k', 'v') | Build map from key-value columns |
Import Pattern
| from pyspark.sql import functions as F | Standard import (always use this) |
| from pyspark.sql.window import Window | Window spec builder |
| from pyspark.sql.types import * | Type definitions for explicit schemas |
Frequently Asked Questions
What is the standard import for pyspark.sql.functions?+
Which functions appear most often in interviews?+
What is the difference between row_number, rank, and dense_rank?+
The candidate who gets the offer
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition