# Cloud File Storage Metadata Schema

> A file is also a folder. A folder is also a file.

Canonical URL: <https://datadriven.io/problems/cloud_file_storage_metadata_schema>

Domain: Data Modeling · Difficulty: hard · Seniority: L6

## Problem

We run a cloud file storage platform. Users upload files from multiple devices, organize them into folders, and share files with other users. We need to model the metadata for analytics: upload volume, storage consumption per user, sharing behavior, and quota enforcement. Design the data model.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether a candidate recognizes **self-referential hierarchies**, **semi-additive measures**, and **polymorphic subtypes**. Folders and files share the same parent relationship, quota is a periodic snapshot rather than a summable measure, and EXIF metadata exists only for photos. Strong candidates see all three patterns.

> **Trick to Solving**
>
> Before drawing tables, a strong candidate asks: do folders and files share any behavior, and is quota a sum or a snapshot? The signal here is that a folder is just a file with a different type, which points to a single `dim_objects` table with `parent_folder_id` self-referencing.
> 
> 1. Unify files and folders into one objects dimension
> 2. Treat EXIF as a subtype table joined only for photos
> 3. Model storage quota as a daily periodic snapshot
> 4. Preserve soft-deleted files as active quota contributors

---

### Break down the requirements

#### Step 1: Unify objects

Folders and files are both nodes in the same tree. Give them one `dim_objects` table with `object_type` and `parent_folder_id` pointing back to the same table. Two separate tables explode join complexity for anything that crosses file and folder boundaries.

#### Step 2: Subtype EXIF onto its own table

EXIF applies to photos only. Storing it on `dim_objects` either creates a wide sparse table or blocks schema evolution. `dim_photo_metadata` holds EXIF keyed by `object_id`, joined only when needed.

#### Step 3: Model quota as a periodic snapshot

Quota is semi-additive: you can sum across users but not across days. Daily snapshots on `fact_storage_snapshots` give analysts a stable as-of view without risky SUM-over-time queries on the event log.

#### Step 4: Keep soft-deleted bytes in quota

Trashed files still consume storage. The snapshot job counts `is_deleted = TRUE` rows until the file is purged. A candidate who filters `is_deleted = FALSE` at the snapshot boundary will under-bill the user.

#### Step 5: Split upload and sharing facts

Uploads are per-byte transaction events; shares are per-grant transaction events. Different measures, different grains, same user dimension.

---

### The solution

Below is one defensible design: unified objects, subtyped photo metadata, and a periodic snapshot fact for quota enforcement.

> **Why this works**
>
> Self-referential objects collapse folder and file modeling into one recursive tree, EXIF subtype avoids sparse columns, and the periodic snapshot gives quota a stable grain. The trade-off is explicit daily snapshots in exchange for correct semi-additive reporting.

> **Interviewers watch for**
>
> Strong candidates say the words `periodic snapshot` and `semi-additive`. They also catch the trashed-bytes trap on their own. Weak candidates model folders and files as separate tables, then spend five minutes explaining how to recurse across two tables to compute folder size.

> **Common pitfall**
>
> Summing `bytes_transferred` from `fact_upload_events` to estimate storage. That double counts a re-upload of the same file and ignores deletions entirely. Storage is a state, not a sum of events.

---

### The analysis pattern

**Users approaching quota**

```sql
SELECT
    u.email,
    u.quota_bytes,
    s.total_bytes,
    ROUND(100.0 * s.total_bytes / NULLIF(u.quota_bytes, 0), 1) AS pct_used,
    s.trashed_bytes
FROM fact_storage_snapshots s
JOIN dim_users u ON u.user_key = s.user_key
WHERE s.snapshot_date = CURRENT_DATE
  AND s.total_bytes > u.quota_bytes * 0.9
ORDER BY pct_used DESC
```

---

### Trade-offs and alternatives

**Unified objects plus snapshot**

One tree, one dimension, snapshots keep quota correct. Cost: recursive queries for deep folder ancestry and a daily snapshot job.

**Separate files and folders, event-only**

Simpler per-table schema, no snapshot job. Cost: folder size requires a join across two tables and quota must be summed over mutable events, which is slow and error prone.

---

## Common follow-up questions

- How would you compute the full folder path for an object at query time without recursing to the root? _(Tests materialized path or closure table alternatives for deep hierarchies.)_
- What if a file is shared into another user's namespace and both should see it in their folder tree? _(Tests whether sharing introduces a junction table distinct from ownership.)_
- How do you enforce GDPR when a user deletes their account but shared files remain? _(Tests ownership transfer semantics and cascading deletes on the object tree.)_
- At 10B objects globally, how do you partition `dim_objects` and `fact_storage_snapshots`? _(Tests partitioning by owner hash and by snapshot_date respectively.)_
- How would you back-compute storage for a historical date after the snapshot has drifted from reality? _(Tests whether the candidate can reconcile upload events against the snapshot for repair.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/cloud_file_storage_metadata_schema)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.