Spark SQL Functions for Data Engineers (2026)
Spark 3.5 ships with 1,500+ built-in functions in pyspark.sql.functions. These are the ones data engineers use in production and get tested on in interviews: aggregations, window functions, string operations, date handling, and conditionals.
Spark SQL Aggregation Functions
F.sum()
df.groupBy('dept').agg(F.sum('salary'))Sums a column within each group.
F.count()
df.groupBy('dept').agg(F.count('*'))Counts rows per group. Use F.countDistinct() for unique values.
F.avg()
df.groupBy('dept').agg(F.avg('salary').alias('avg_salary'))Mean value. Alias the output column or it becomes avg(salary).
F.min() / F.max()
df.agg(F.min('created_at'), F.max('created_at'))Works on numbers, strings, and dates.
F.countDistinct()
df.groupBy('dept').agg(F.countDistinct('employee_id'))Approximate alternative: F.approx_count_distinct() for large datasets.
F.collect_list() / F.collect_set()
df.groupBy('dept').agg(F.collect_list('name'))Gathers values into an array. collect_set removes duplicates. Beware: pulls all values to one partition.
Spark SQL Window Functions
F.row_number()
F.row_number().over(Window.partitionBy('dept').orderBy(F.desc('salary')))Sequential numbering within each partition. The most common window function in interviews.
F.rank() / F.dense_rank()
F.rank().over(Window.partitionBy('dept').orderBy('score'))rank() leaves gaps on ties; dense_rank() does not.
F.lag() / F.lead()
F.lag('revenue', 1).over(Window.partitionBy('store').orderBy('date'))Access previous/next row values. Essential for calculating period-over-period changes.
F.sum().over()
F.sum('amount').over(Window.partitionBy('account').orderBy('date').rowsBetween(Window.unboundedPreceding, 0))Running total. The rowsBetween clause controls the window frame.
Spark SQL String Functions
F.regexp_extract()
F.regexp_extract('url', r'/product/(\d+)', 1)Extracts a regex capture group. Group 0 returns the full match.
F.concat() / F.concat_ws()
F.concat_ws('-', 'year', 'month', 'day')concat_ws adds a separator. concat does not.
F.trim() / F.lower() / F.upper()
df.withColumn('clean', F.trim(F.lower('name')))Chain them for normalization before joins.
F.split()
F.split('tags', ',')Returns an ArrayType column. Often followed by F.explode().
Spark SQL Date and Timestamp Functions
F.current_date() / F.current_timestamp()
df.withColumn('today', F.current_date())Evaluated once per query, not per row.
F.datediff() / F.months_between()
F.datediff('end_date', 'start_date')Returns integer days. months_between returns a decimal.
F.date_trunc()
F.date_trunc('month', 'event_time')Truncates to the start of the given unit. Useful for time-series grouping.
F.to_date() / F.to_timestamp()
F.to_date('date_str', 'yyyy-MM-dd')Parses strings into date/timestamp types. The format string matters.
Spark SQL Conditional Functions
F.when() / F.otherwise()
F.when(F.col('score') >= 90, 'A').when(F.col('score') >= 80, 'B').otherwise('C')Spark's equivalent of CASE WHEN. Chain multiple conditions.
F.coalesce()
F.coalesce('preferred_email', 'work_email', 'personal_email')Returns the first non-null value. Different from the partition coalesce() method.
F.isnull() / F.isnan()
df.filter(F.isnull('email'))isnull checks for SQL NULL. isnan checks for NaN (float). They are not the same.
Spark SQL Functions FAQ
What is the difference between import pyspark.sql.functions and from pyspark.sql import functions as F?+
Are Spark SQL functions the same as PySpark functions?+
Can I use these functions in Spark Scala too?+
How many built-in functions does Spark have?+
Why should I avoid UDFs when a built-in function exists?+
Practice Spark SQL Functions
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition