Learn Practice Interview Discuss Daily Jobs

Kill the UDF

A hard Spark mock interview question on DataDriven. Practice with AI-powered feedback, real code execution, and a hire/no-hire decision.

Domain: Spark
Difficulty: hard
Seniority: L5

Interview Prompt

A PySpark pipeline scores 2 billion credit transactions per day for fraud using a Python UDF. The UDF computes a risk score from 12 columns: thresholds, lookups, weighted sums. All expressible with native Spark functions. The job takes 4 hours. Profiling shows 70% of the time in ArrowEvalPython (serializing data between JVM and Python). Rewrite the UDF as native Spark SQL expressions.

Summary

2 billion rows through Python. One UDF at a time.

How This Interview Works

Read the vague prompt (just like a real interview)
Ask clarifying questions to the AI interviewer
Write your spark solution with real code execution
Get instant feedback and a hire/no-hire decision