DataDriven
LearnPracticeInterviewDiscussDailyJobs

Too Many Small Files

A easy spark interview practice problem on DataDriven. Write and execute real spark code with instant grading.

Domain
spark
Difficulty
easy
Seniority
L5

Problem

A client's daily export pipeline reads 200 GB of transaction data, filters it to about 2 GB of flagged records, and writes Parquet to S3. Downstream Athena queries on this table are taking 45 seconds for a simple COUNT(*). You check S3 and find 2,000 Parquet files averaging 1 MB each. The job has spark.sql.shuffle.partitions set to 2000. Fix the write so Athena can actually query this table.

Summary

Two thousand files. One megabyte each. Athena says no.

Practice This Problem

Solve this spark problem with real code execution. DataDriven runs your solution and grades it automatically.

Related

  • All Practice Problems
  • Mock Interview Mode
  • Data Engineering Interview Prep Guide
  • Daily Challenge
  • Data Engineering Lessons