Loading section...
Small File Problem
Concepts: paSmallFiles
What They Want to Hear 'Too many small files kill read performance. Each file requires a separate metadata lookup, a separate file open, and a separate read request. Thousands of 1KB files are far slower to read than one 128MB file with the same data. The target file size is 128MB to 256MB. To fix small files, I use coalesce() to reduce the number of output partitions before writing, or run a compaction job that rewrites small files into larger ones.' That is the answer. Target size, the problem, and two fixes.