A hard spark interview practice problem on DataDriven. Write and execute real spark code with instant grading.
- Domain
- spark
- Difficulty
- hard
- Seniority
- L5
Problem
A PySpark pipeline scores 2 billion credit transactions per day for fraud using a Python UDF. The UDF computes a risk score from 12 columns: thresholds, lookups, weighted sums. All expressible with native Spark functions. The job takes 4 hours. Profiling shows 70% of the time in ArrowEvalPython (serializing data between JVM and Python). Rewrite the UDF as native Spark SQL expressions.
Summary
2 billion rows through Python. One UDF at a time.
Practice This Problem
Solve this spark problem with real code execution. DataDriven runs your solution and grades it automatically.