Learn Practice Interview Discuss Daily Jobs

Too Many Small Files

A easy Spark mock interview question on DataDriven. Practice with AI-powered feedback, real code execution, and a hire/no-hire decision.

Domain: Spark
Difficulty: easy
Seniority: L5

Interview Prompt

A client's daily export pipeline reads 200 GB of transaction data, filters it to about 2 GB of flagged records, and writes Parquet to S3. Downstream Athena queries on this table are taking 45 seconds for a simple COUNT(*). You check S3 and find 2,000 Parquet files averaging 1 MB each. The job has spark.sql.shuffle.partitions set to 2000. Fix the write so Athena can actually query this table.

Summary

Two thousand files. One megabyte each. Athena says no.

How This Interview Works

Read the vague prompt (just like a real interview)
Ask clarifying questions to the AI interviewer
Write your spark solution with real code execution
Get instant feedback and a hire/no-hire decision