Loading...

Too Many Small Files

A easy spark interview practice problem on DataDriven. Write and execute real spark code with instant grading.

Domain
spark
Difficulty
easy
Seniority
senior

Problem

A client's daily export pipeline reads 200 GB of transaction data, filters it to about 2 GB of flagged records, and writes Parquet to S3. Downstream Athena queries on this table are taking 45 seconds for a simple COUNT(*). You check S3 and find 2,000 Parquet files averaging 1 MB each. The job has spark.sql.shuffle.partitions set to 2000. Fix the write so Athena can actually query this table.

Practice This Problem

Solve this spark problem with real code execution. DataDriven runs your solution and grades it automatically.