# Too Many Small Files

> Two thousand files. One megabyte each. Athena says no.

Canonical URL: <https://datadriven.io/problems/spark_repartition_vs_coalesce_daily_export>

Domain: PySpark · Difficulty: easy · Seniority: L5

## Problem

A client's daily export pipeline reads 200 GB of transaction data, filters it to about 2 GB of flagged records, and writes Parquet to S3. Downstream Athena queries on this table are taking 45 seconds for a simple COUNT(*). You check S3 and find 2,000 Parquet files averaging 1 MB each. The job has spark.sql.shuffle.partitions set to 2000. Fix the write so Athena can actually query this table.

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/spark_repartition_vs_coalesce_daily_export)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.