# The File Size Profiler

> File types and their disk footprint. One type dominates.

Canonical URL: <https://datadriven.io/problems/the_file_size_profiler>

Domain: Python · Difficulty: medium · Seniority: L3

## Problem

Given entries (list of 'filepath size_bytes timestamp' strings), sum size_bytes per file extension (everything after the last '.' in the filename). Files without an extension (no '.' in basename) are grouped under the key 'no_extension'. Return a dict mapping each extension (or 'no_extension') to the total size.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **string parsing combined with grouping aggregation**, a common log analysis pattern. It probes file path parsing, extension extraction, and case-insensitive grouping.

---

### Break down the requirements

#### Step 1: Parse each metadata string

Split by spaces to extract filepath and size. Extract the file extension from the filepath.

#### Step 2: Handle files without extensions

If no dot exists in the filename, group under `'no_ext'`.

#### Step 3: Aggregate sizes by extension, case-insensitive

Convert extensions to lowercase before grouping.

---

### The solution

**Parse, extract extension, and aggregate**

```python
def profile_storage(metadata: list) -> dict:
    totals = {}
    for entry in metadata:
        parts = entry.split()
        filepath = parts[0]
        size = int(parts[1])
        filename = filepath.split('/')[-1]
        if '.' in filename:
            ext = filename.rsplit('.', 1)[1].lower()
        else:
            ext = 'no_ext'
        if ext in totals:
            totals[ext] += size
        else:
            totals[ext] = size
    return totals
```

> **Time and Space Complexity**
>
> **Time:** O(n) where n is the number of metadata entries.
> 
> **Space:** O(e) where e is the number of distinct extensions.

> **Interviewers Watch For**
>
> Whether you use `rsplit('.', 1)` instead of `split('.')`. Files like `archive.tar.gz` should extract `gz` as the extension, not `tar`.

> **Common Pitfall**
>
> Treating `.gitignore` (dotfile with no extension) as having extension `gitignore`. Whether this is correct depends on convention; clarify with the interviewer.

---

## Common follow-up questions

- What if the metadata format varies? _(Tests robust parsing with error handling for malformed entries.)_
- How would you find the top-3 extensions by total size? _(Tests sorting the aggregated dict by value.)_
- What if you needed average file size per extension? _(Tests tracking both sum and count for each group.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_file_size_profiler)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.