Data Engineering Interview Prep

Spark SQL Functions for Data Engineers

Spark 3.5 ships with 1,500+ built-in functions in pyspark.sql.functions. These are the ones data engineers use in production and get tested on in interviews: aggregations, window functions, string operations, date handling, and conditionals.

Every built-in function runs natively on the JVM. A Python UDF doing the same work is 10x to 100x slower because of JVM/Python serialization overhead.

Spark SQL Aggregation Functions

F.sum()

df.groupBy('dept').agg(F.sum('salary'))

Sums a column within each group.

F.count()

df.groupBy('dept').agg(F.count('*'))

Counts rows per group. Use F.countDistinct() for unique values.

F.avg()

df.groupBy('dept').agg(F.avg('salary').alias('avg_salary'))

Mean value. Alias the output column or it becomes avg(salary).

F.min() / F.max()

df.agg(F.min('created_at'), F.max('created_at'))

Works on numbers, strings, and dates.

F.countDistinct()

df.groupBy('dept').agg(F.countDistinct('employee_id'))

Approximate alternative: F.approx_count_distinct() for large datasets.

F.collect_list() / F.collect_set()

df.groupBy('dept').agg(F.collect_list('name'))

Gathers values into an array. collect_set removes duplicates. Beware: pulls all values to one partition.

Spark SQL Window Functions

F.row_number()

F.row_number().over(Window.partitionBy('dept').orderBy(F.desc('salary')))

Sequential numbering within each partition. The most common window function in interviews.

F.rank() / F.dense_rank()

F.rank().over(Window.partitionBy('dept').orderBy('score'))

rank() leaves gaps on ties; dense_rank() does not.

F.lag() / F.lead()

F.lag('revenue', 1).over(Window.partitionBy('store').orderBy('date'))

Access previous/next row values. Essential for calculating period-over-period changes.

F.sum().over()

F.sum('amount').over(Window.partitionBy('account').orderBy('date').rowsBetween(Window.unboundedPreceding, 0))

Running total. The rowsBetween clause controls the window frame.

Spark SQL String Functions

F.regexp_extract()

F.regexp_extract('url', r'/product/(\d+)', 1)

Extracts a regex capture group. Group 0 returns the full match.

F.concat() / F.concat_ws()

F.concat_ws('-', 'year', 'month', 'day')

concat_ws adds a separator. concat does not.

F.trim() / F.lower() / F.upper()

df.withColumn('clean', F.trim(F.lower('name')))

Chain them for normalization before joins.

F.split()

F.split('tags', ',')

Returns an ArrayType column. Often followed by F.explode().

Spark SQL Date and Timestamp Functions

F.current_date() / F.current_timestamp()

df.withColumn('today', F.current_date())

Evaluated once per query, not per row.

F.datediff() / F.months_between()

F.datediff('end_date', 'start_date')

Returns integer days. months_between returns a decimal.

F.date_trunc()

F.date_trunc('month', 'event_time')

Truncates to the start of the given unit. Useful for time-series grouping.

F.to_date() / F.to_timestamp()

F.to_date('date_str', 'yyyy-MM-dd')

Parses strings into date/timestamp types. The format string matters.

Spark SQL Conditional Functions

F.when() / F.otherwise()

F.when(F.col('score') >= 90, 'A').when(F.col('score') >= 80, 'B').otherwise('C')

Spark's equivalent of CASE WHEN. Chain multiple conditions.

F.coalesce()

F.coalesce('preferred_email', 'work_email', 'personal_email')

Returns the first non-null value. Different from the partition coalesce() method.

F.isnull() / F.isnan()

df.filter(F.isnull('email'))

isnull checks for SQL NULL. isnan checks for NaN (float). They are not the same.

Spark SQL Functions FAQ

What is the difference between import pyspark.sql.functions and from pyspark.sql import functions as F?+
They are the same module. The convention is to import as F for brevity: from pyspark.sql import functions as F. Then you call F.col(), F.sum(), etc. This is the standard pattern in production PySpark code and what interviewers expect to see.
Are Spark SQL functions the same as PySpark functions?+
Yes. pyspark.sql.functions maps directly to Spark SQL built-in functions. df.select(F.upper('name')) produces the same execution plan as spark.sql('SELECT upper(name) FROM table'). The Catalyst optimizer treats them identically.
Can I use these functions in Spark Scala too?+
Yes. The same functions exist in org.apache.spark.sql.functions in Scala. The API surface is nearly identical. If you know the PySpark version, the Scala translation is straightforward.
How many built-in functions does Spark have?+
Spark 3.5 (latest stable) ships with 1,500+ built-in functions in pyspark.sql.functions. This covers aggregation, window, string, date, math, collection, and conditional operations. Interviewers expect you to reach for built-in functions before writing a UDF.
Why should I avoid UDFs when a built-in function exists?+
Python UDFs force data serialization between the JVM and Python for every row. This can be 10x to 100x slower than using a native Spark function. The Catalyst optimizer also cannot see inside a UDF, so it cannot apply predicate pushdown or column pruning through one. Pandas UDFs (vectorized) reduce this penalty by operating on Arrow batches, but built-in functions are still faster.

Practice Spark SQL Functions

DataDriven runs PySpark and Scala Spark code in your browser against real datasets. Write aggregations, window functions, and joins, then see results without spinning up a cluster.

Start Practicing