DataDriven
LearnPracticeInterviewDiscussDailyJobs

Too Many Small Files

A easy Spark mock interview question on DataDriven. Practice with AI-powered feedback, real code execution, and a hire/no-hire decision.

Domain
Spark
Difficulty
easy
Seniority
L5

Interview Prompt

A client's daily export pipeline reads 200 GB of transaction data, filters it to about 2 GB of flagged records, and writes Parquet to S3. Downstream Athena queries on this table are taking 45 seconds for a simple COUNT(*). You check S3 and find 2,000 Parquet files averaging 1 MB each. The job has spark.sql.shuffle.partitions set to 2000. Fix the write so Athena can actually query this table.

Summary

Two thousand files. One megabyte each. Athena says no.

How This Interview Works

  1. Read the vague prompt (just like a real interview)
  2. Ask clarifying questions to the AI interviewer
  3. Write your spark solution with real code execution
  4. Get instant feedback and a hire/no-hire decision

Related

  • All Mock Interviews
  • Practice Mode (untimed)
  • Spark Interview Questions
  • Data Engineering Interview Prep Guide
  • Practice Problems
  • Daily Challenge