DataDriven
LearnPracticeInterviewDiscussDailyJobs

A hard spark interview practice problem on DataDriven. Write and execute real spark code with instant grading.

Domain
spark
Difficulty
hard
Seniority
L5

Problem

A PySpark pipeline scores 2 billion credit transactions per day for fraud using a Python UDF. The UDF computes a risk score from 12 columns: thresholds, lookups, weighted sums. All expressible with native Spark functions. The job takes 4 hours. Profiling shows 70% of the time in ArrowEvalPython (serializing data between JVM and Python). Rewrite the UDF as native Spark SQL expressions.

Summary

2 billion rows through Python. One UDF at a time.

Practice This Problem

Solve this spark problem with real code execution. DataDriven runs your solution and grades it automatically.

Related

  • All Practice Problems
  • Mock Interview Mode
  • Data Engineering Interview Prep Guide
  • Daily Challenge
  • Data Engineering Lessons