Spark SQL Functions for Data Engineers
Spark 3.5 ships with 1,500+ built-in functions in pyspark.sql.functions. These are the ones data engineers use in production and get tested on in interviews: aggregations, window functions, string operations, date handling, and conditionals.
Every built-in function runs natively on the JVM. A Python UDF doing the same work is 10x to 100x slower because of JVM/Python serialization overhead.
Spark SQL Aggregation Functions
F.sum()
df.groupBy('dept').agg(F.sum('salary'))Sums a column within each group.
F.count()
df.groupBy('dept').agg(F.count('*'))Counts rows per group. Use F.countDistinct() for unique values.
F.avg()
df.groupBy('dept').agg(F.avg('salary').alias('avg_salary'))Mean value. Alias the output column or it becomes avg(salary).
F.min() / F.max()
df.agg(F.min('created_at'), F.max('created_at'))Works on numbers, strings, and dates.
F.countDistinct()
df.groupBy('dept').agg(F.countDistinct('employee_id'))Approximate alternative: F.approx_count_distinct() for large datasets.
F.collect_list() / F.collect_set()
df.groupBy('dept').agg(F.collect_list('name'))Gathers values into an array. collect_set removes duplicates. Beware: pulls all values to one partition.
Spark SQL Window Functions
F.row_number()
F.row_number().over(Window.partitionBy('dept').orderBy(F.desc('salary')))Sequential numbering within each partition. The most common window function in interviews.
F.rank() / F.dense_rank()
F.rank().over(Window.partitionBy('dept').orderBy('score'))rank() leaves gaps on ties; dense_rank() does not.
F.lag() / F.lead()
F.lag('revenue', 1).over(Window.partitionBy('store').orderBy('date'))Access previous/next row values. Essential for calculating period-over-period changes.
F.sum().over()
F.sum('amount').over(Window.partitionBy('account').orderBy('date').rowsBetween(Window.unboundedPreceding, 0))Running total. The rowsBetween clause controls the window frame.
Spark SQL String Functions
F.regexp_extract()
F.regexp_extract('url', r'/product/(\d+)', 1)Extracts a regex capture group. Group 0 returns the full match.
F.concat() / F.concat_ws()
F.concat_ws('-', 'year', 'month', 'day')concat_ws adds a separator. concat does not.
F.trim() / F.lower() / F.upper()
df.withColumn('clean', F.trim(F.lower('name')))Chain them for normalization before joins.
F.split()
F.split('tags', ',')Returns an ArrayType column. Often followed by F.explode().
Spark SQL Date and Timestamp Functions
F.current_date() / F.current_timestamp()
df.withColumn('today', F.current_date())Evaluated once per query, not per row.
F.datediff() / F.months_between()
F.datediff('end_date', 'start_date')Returns integer days. months_between returns a decimal.
F.date_trunc()
F.date_trunc('month', 'event_time')Truncates to the start of the given unit. Useful for time-series grouping.
F.to_date() / F.to_timestamp()
F.to_date('date_str', 'yyyy-MM-dd')Parses strings into date/timestamp types. The format string matters.
Spark SQL Conditional Functions
F.when() / F.otherwise()
F.when(F.col('score') >= 90, 'A').when(F.col('score') >= 80, 'B').otherwise('C')Spark's equivalent of CASE WHEN. Chain multiple conditions.
F.coalesce()
F.coalesce('preferred_email', 'work_email', 'personal_email')Returns the first non-null value. Different from the partition coalesce() method.
F.isnull() / F.isnan()
df.filter(F.isnull('email'))isnull checks for SQL NULL. isnan checks for NaN (float). They are not the same.
Spark SQL Functions FAQ
What is the difference between import pyspark.sql.functions and from pyspark.sql import functions as F?+
Are Spark SQL functions the same as PySpark functions?+
Can I use these functions in Spark Scala too?+
How many built-in functions does Spark have?+
Why should I avoid UDFs when a built-in function exists?+
Practice Spark SQL Functions
DataDriven runs PySpark and Scala Spark code in your browser against real datasets. Write aggregations, window functions, and joins, then see results without spinning up a cluster.
Start PracticingContinue your prep
Data Engineer Interview Prep, explore the full guide
50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.
Interview Rounds
By Company
- Stripe Data Engineer Interview
- Airbnb Data Engineer Interview
- Uber Data Engineer Interview
- Netflix Data Engineer Interview
- Databricks Data Engineer Interview
- Snowflake Data Engineer Interview
- Lyft Data Engineer Interview
- DoorDash Data Engineer Interview
- Instacart Data Engineer Interview
- Robinhood Data Engineer Interview
- Pinterest Data Engineer Interview
- Twitter/X Data Engineer Interview
By Role
- Senior Data Engineer Interview
- Staff Data Engineer Interview
- Principal Data Engineer Interview
- Junior Data Engineer Interview
- Entry-Level Data Engineer Interview
- Analytics Engineer Interview
- ML Data Engineer Interview
- Streaming Data Engineer Interview
- GCP Data Engineer Interview
- AWS Data Engineer Interview
- Azure Data Engineer Interview