Spark SQL Functions for Data Engineers
Spark 3.5 ships with 1,500+ built-in functions in pyspark.sql.functions. These are the ones data engineers use in production and get tested on in interviews: aggregations, window functions, string operations, date handling, and conditionals.
Every built-in function runs natively on the JVM. A Python UDF doing the same work is 10x to 100x slower because of JVM/Python serialization overhead.
Spark SQL Aggregation Functions
F.sum()
df.groupBy('dept').agg(F.sum('salary'))Sums a column within each group.
F.count()
df.groupBy('dept').agg(F.count('*'))Counts rows per group. Use F.countDistinct() for unique values.
F.avg()
df.groupBy('dept').agg(F.avg('salary').alias('avg_salary'))Mean value. Alias the output column or it becomes avg(salary).
F.min() / F.max()
df.agg(F.min('created_at'), F.max('created_at'))Works on numbers, strings, and dates.
F.countDistinct()
df.groupBy('dept').agg(F.countDistinct('employee_id'))Approximate alternative: F.approx_count_distinct() for large datasets.
F.collect_list() / F.collect_set()
df.groupBy('dept').agg(F.collect_list('name'))Gathers values into an array. collect_set removes duplicates. Beware: pulls all values to one partition.
Spark SQL Window Functions
F.row_number()
F.row_number().over(Window.partitionBy('dept').orderBy(F.desc('salary')))Sequential numbering within each partition. The most common window function in interviews.
F.rank() / F.dense_rank()
F.rank().over(Window.partitionBy('dept').orderBy('score'))rank() leaves gaps on ties; dense_rank() does not.
F.lag() / F.lead()
F.lag('revenue', 1).over(Window.partitionBy('store').orderBy('date'))Access previous/next row values. Essential for calculating period-over-period changes.
F.sum().over()
F.sum('amount').over(Window.partitionBy('account').orderBy('date').rowsBetween(Window.unboundedPreceding, 0))Running total. The rowsBetween clause controls the window frame.
Spark SQL String Functions
F.regexp_extract()
F.regexp_extract('url', r'/product/(\d+)', 1)Extracts a regex capture group. Group 0 returns the full match.
F.concat() / F.concat_ws()
F.concat_ws('-', 'year', 'month', 'day')concat_ws adds a separator. concat does not.
F.trim() / F.lower() / F.upper()
df.withColumn('clean', F.trim(F.lower('name')))Chain them for normalization before joins.
F.split()
F.split('tags', ',')Returns an ArrayType column. Often followed by F.explode().
Spark SQL Date and Timestamp Functions
F.current_date() / F.current_timestamp()
df.withColumn('today', F.current_date())Evaluated once per query, not per row.
F.datediff() / F.months_between()
F.datediff('end_date', 'start_date')Returns integer days. months_between returns a decimal.
F.date_trunc()
F.date_trunc('month', 'event_time')Truncates to the start of the given unit. Useful for time-series grouping.
F.to_date() / F.to_timestamp()
F.to_date('date_str', 'yyyy-MM-dd')Parses strings into date/timestamp types. The format string matters.
Spark SQL Conditional Functions
F.when() / F.otherwise()
F.when(F.col('score') >= 90, 'A').when(F.col('score') >= 80, 'B').otherwise('C')Spark's equivalent of CASE WHEN. Chain multiple conditions.
F.coalesce()
F.coalesce('preferred_email', 'work_email', 'personal_email')Returns the first non-null value. Different from the partition coalesce() method.
F.isnull() / F.isnan()
df.filter(F.isnull('email'))isnull checks for SQL NULL. isnan checks for NaN (float). They are not the same.
Spark SQL Functions FAQ
What is the difference between import pyspark.sql.functions and from pyspark.sql import functions as F?+
Are Spark SQL functions the same as PySpark functions?+
Can I use these functions in Spark Scala too?+
How many built-in functions does Spark have?+
Why should I avoid UDFs when a built-in function exists?+
Practice Spark SQL Functions
DataDriven runs PySpark and Scala Spark code in your browser against real datasets. Write aggregations, window functions, and joins, then see results without spinning up a cluster.
Start Practicing