Loading...
Kill the UDF
A hard spark interview practice problem on DataDriven. Write and execute real spark code with instant grading.
- Domain
- spark
- Difficulty
- hard
- Seniority
- senior
Problem
A PySpark pipeline scores 2 billion credit transactions per day for fraud using a Python UDF. The UDF computes a risk score from 12 columns: thresholds, lookups, weighted sums. All expressible with native Spark functions. The job takes 4 hours. Profiling shows 70% of the time in ArrowEvalPython (serializing data between JVM and Python). Rewrite the UDF as native Spark SQL expressions.
Practice This Problem
Solve this spark problem with real code execution. DataDriven runs your solution and grades it automatically.