DataDriven
LearnPracticeInterviewDiscussDailyJobs

The Word Count Shuffle Trap

A easy spark interview practice problem on DataDriven. Write and execute real spark code with instant grading.

Domain
spark
Difficulty
easy
Seniority
L5

Problem

Your team's text analytics pipeline runs a word count job over a 50 GB corpus every night. It has been working fine for months, but after the corpus grew 3x last quarter the job started failing. The Spark UI shows 48 GB of shuffle write and three executors dead from OOM. The code uses groupByKey. Fix it.

Summary

groupByKey works. Your cluster disagrees.

Practice This Problem

Solve this spark problem with real code execution. DataDriven runs your solution and grades it automatically.

Related

  • All Practice Problems
  • Mock Interview Mode
  • Data Engineering Interview Prep Guide
  • Daily Challenge
  • Data Engineering Lessons