Learn Practice Interview Discuss Daily Jobs

Kill the UDF

A hard spark interview practice problem on DataDriven. Write and execute real spark code with instant grading.

Domain: spark
Difficulty: hard
Seniority: L5

Problem

A PySpark pipeline scores 2 billion credit transactions per day for fraud using a Python UDF. The UDF computes a risk score from 12 columns: thresholds, lookups, weighted sums. All expressible with native Spark functions. The job takes 4 hours. Profiling shows 70% of the time in ArrowEvalPython (serializing data between JVM and Python). Rewrite the UDF as native Spark SQL expressions.

Summary

2 billion rows through Python. One UDF at a time.

Practice This Problem

Solve this spark problem with real code execution. DataDriven runs your solution and grades it automatically.