Spark SQL Functions for Data Engineers (2026)

Spark 3.5 ships with 1,500+ built-in functions in pyspark.sql.functions. These are the ones data engineers use in production and get tested on in interviews: aggregations, window functions, string operations, date handling, and conditionals.

Spark SQL Aggregation Functions

F.sum()

df.groupBy('dept').agg(F.sum('salary'))

Sums a column within each group.

F.count()

df.groupBy('dept').agg(F.count('*'))

Counts rows per group. Use F.countDistinct() for unique values.

F.avg()

df.groupBy('dept').agg(F.avg('salary').alias('avg_salary'))

Mean value. Alias the output column or it becomes avg(salary).

F.min() / F.max()

df.agg(F.min('created_at'), F.max('created_at'))

Works on numbers, strings, and dates.

F.countDistinct()

df.groupBy('dept').agg(F.countDistinct('employee_id'))

Approximate alternative: F.approx_count_distinct() for large datasets.

F.collect_list() / F.collect_set()

df.groupBy('dept').agg(F.collect_list('name'))

Gathers values into an array. collect_set removes duplicates. Beware: pulls all values to one partition.

Spark SQL Window Functions

F.row_number()

F.row_number().over(Window.partitionBy('dept').orderBy(F.desc('salary')))

Sequential numbering within each partition. The most common window function in interviews.

F.rank() / F.dense_rank()

F.rank().over(Window.partitionBy('dept').orderBy('score'))

rank() leaves gaps on ties; dense_rank() does not.

F.lag() / F.lead()

F.lag('revenue', 1).over(Window.partitionBy('store').orderBy('date'))

Access previous/next row values. Essential for calculating period-over-period changes.

F.sum().over()

F.sum('amount').over(Window.partitionBy('account').orderBy('date').rowsBetween(Window.unboundedPreceding, 0))

Running total. The rowsBetween clause controls the window frame.

Spark SQL String Functions

F.regexp_extract()

F.regexp_extract('url', r'/product/(\d+)', 1)

Extracts a regex capture group. Group 0 returns the full match.

F.concat() / F.concat_ws()

F.concat_ws('-', 'year', 'month', 'day')

concat_ws adds a separator. concat does not.

F.trim() / F.lower() / F.upper()

df.withColumn('clean', F.trim(F.lower('name')))

Chain them for normalization before joins.

F.split()

F.split('tags', ',')

Returns an ArrayType column. Often followed by F.explode().

Spark SQL Date and Timestamp Functions

F.current_date() / F.current_timestamp()

df.withColumn('today', F.current_date())

Evaluated once per query, not per row.

F.datediff() / F.months_between()

F.datediff('end_date', 'start_date')

Returns integer days. months_between returns a decimal.

F.date_trunc()

F.date_trunc('month', 'event_time')

Truncates to the start of the given unit. Useful for time-series grouping.

F.to_date() / F.to_timestamp()

F.to_date('date_str', 'yyyy-MM-dd')

Parses strings into date/timestamp types. The format string matters.

Spark SQL Conditional Functions

F.when() / F.otherwise()

F.when(F.col('score') >= 90, 'A').when(F.col('score') >= 80, 'B').otherwise('C')

Spark's equivalent of CASE WHEN. Chain multiple conditions.

F.coalesce()

F.coalesce('preferred_email', 'work_email', 'personal_email')

Returns the first non-null value. Different from the partition coalesce() method.

F.isnull() / F.isnan()

df.filter(F.isnull('email'))

isnull checks for SQL NULL. isnan checks for NaN (float). They are not the same.

Spark SQL Functions FAQ

What is the difference between import pyspark.sql.functions and from pyspark.sql import functions as F?+

They are the same module. The convention is to import as F for brevity: from pyspark.sql import functions as F. Then you call F.col(), F.sum(), etc. This is the standard pattern in production PySpark code and what interviewers expect to see.

Are Spark SQL functions the same as PySpark functions?+

Yes. pyspark.sql.functions maps directly to Spark SQL built-in functions. df.select(F.upper('name')) produces the same execution plan as spark.sql('SELECT upper(name) FROM table'). The Catalyst optimizer treats them identically.

Can I use these functions in Spark Scala too?+

Yes. The same functions exist in org.apache.spark.sql.functions in Scala. The API surface is nearly identical. If you know the PySpark version, the Scala translation is straightforward.

How many built-in functions does Spark have?+

Spark 3.5 (latest stable) ships with 1,500+ built-in functions in pyspark.sql.functions. This covers aggregation, window, string, date, math, collection, and conditional operations. Interviewers expect you to reach for built-in functions before writing a UDF.

Why should I avoid UDFs when a built-in function exists?+

Python UDFs force data serialization between the JVM and Python for every row. This can be 10x to 100x slower than using a native Spark function. The Catalyst optimizer also cannot see inside a UDF, so it cannot apply predicate pushdown or column pruning through one. Pandas UDFs (vectorized) reduce this penalty by operating on Arrow batches, but built-in functions are still faster.

02 / Why practice

Practice Spark SQL Functions

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Parsing and reshaping, sessionization, dedup with tie-breaks, streaming aggregation, top-N-per-group. Writing them by hand turns the unfamiliar into pattern recognition

Open the problems

Related Guides

PySpark Functions Cheat Sheet→

Printable quick reference with the most-used functions

PySpark GroupBy Guide→

groupBy + agg patterns with execution plan analysis

PySpark Interview Questions→

The PySpark questions that come up in DE interviews