# Nested Docs, Flat Reports

> Two databases. One direction. No data left behind.

Canonical URL: <https://datadriven.io/problems/nested_docs_flat_reports>

Domain: Pipeline Design · Difficulty: medium · Seniority: L4

## Problem

Our product team stores application data in MongoDB, but the analytics team needs it in Redshift for reporting. The documents are nested and schemaless. Design a batch ETL pipeline to move this data from MongoDB to Redshift.

## Worked solution and explanation

### Why this problem exists in real interviews

Application data is nested and updates continuously. Reporting wants flat tables that standard SQL can join. Reports run all day, so the load can never produce a moment when half the tables are updated and half aren't. The interesting pressure isn't the flattening, it's the atomicity.

First instinct is to flatten and load each table as it's transformed. The orders table updates first, then line_items, then customers. A report that runs partway through the load joins the new orders to yesterday's customers, gets a number that doesn't reconcile, and an analyst files a data-quality ticket the next morning. The bug isn't in the flattening; it's in not making the run visible all-or-nothing.

> **Trick to Solving**
>
> When reports run while the warehouse is loading, atomicity is a feature, not a nice-to-have. Land into staging, swap into production in one move.
> 
> 1. Flatten once, swap atomically. Two distinct concerns; don't blur them into one job.
> 2. Staging is the cheap insurance policy. The cost of an extra storage layer is small compared to the cost of inconsistent reports.

---

### Walk the requirements

#### Step 1: Flatten nested documents into relational tables before the warehouse

Reports query with standard SQL joins. That means the documents have to be unpacked into rows where each repeated structure becomes its own table, order header, order line, address, joinable on the natural keys. The flattening rules belong in the transformation layer (Spark, dbt, or whatever your team reaches for), separate from the warehouse load, so when the source adds a field you change one place.

#### Step 2: Make each load atomic by swapping from staging in one step

Land all the flattened tables in a staging schema. Once every table for that run is built and validated, swap them into the production schema in a single move (a rename, a view repointing, or a transactional batch, whatever your warehouse supports cheaply). On the canvas this is an 'idempotency: staging_table' annotation on the load step. Reports running mid-load see either the prior run or the new one, never half-and-half.

---

### The shape that fits

> **What this design gives up**
>
> Staging costs you double storage during the load window and a couple extra minutes on each run. Atomicity isn't free. What you're buying is that analysts trust the warehouse: nobody files a ticket for an inconsistent join, and the team isn't on the hook for a class of bug that only happens during the load window.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Analytics is in the warehouse and the documents have to land there as flat tables analysts can join with standard SQL.

> **The mistake that ships**
>
> The version that ships writes each flattened table directly into the production schema as it finishes. A morning report runs while the load is in flight, joins yesterday's `customers` to today's `orders`, and the revenue total comes in 2% off. An analyst spends two days investigating, the team adds 'don't query during the load window' to the runbook, and somebody eventually proposes a flag column to indicate freshness. The cleaner fix has always been the staging swap; the runbook entry is a workaround for a missing atomicity guarantee.

---

## Common follow-up questions

- How would you make sure a long-running report that started before the swap still sees a consistent view? _(Probes whether 'atomic load' is actually atomic across long-running reads, not just at the moment of swap.)_
- If the source had hard deletes, how would you make sure deleted records actually leave the warehouse rather than just stop appearing? _(Tests handling of removals, the easiest thing to miss in a copy-from-source pipeline.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/nested_docs_flat_reports)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.