Learn Practice Interview Discuss Daily Jobs

Too Many Small Files

A easy spark interview practice problem on DataDriven. Write and execute real spark code with instant grading.

Domain: spark
Difficulty: easy
Seniority: L5

Problem

A client's daily export pipeline reads 200 GB of transaction data, filters it to about 2 GB of flagged records, and writes Parquet to S3. Downstream Athena queries on this table are taking 45 seconds for a simple COUNT(*). You check S3 and find 2,000 Parquet files averaging 1 MB each. The job has spark.sql.shuffle.partitions set to 2000. Fix the write so Athena can actually query this table.

Summary

Two thousand files. One megabyte each. Athena says no.

Practice This Problem

Solve this spark problem with real code execution. DataDriven runs your solution and grades it automatically.