# The Shape They All Share

> A pile of exports, no two quite alike. Keep only what every record agrees on.

Canonical URL: <https://datadriven.io/problems/the-shape-they-all-share>

Domain: Python · Difficulty: medium · Seniority: mid

## Problem

A nightly job loads dozens of regional CSV exports into memory as `tables`, each one a list of row dicts. Over the years each region added its own extra columns, so the records no longer line up: combine every row into a single list, keeping only the fields present in every record across all tables and dropping the rest. With no records at all, the result is empty.

## Worked solution and explanation

### Why this problem exists in real interviews

This is a schema-intersection projection wearing a 'merge a pile of CSVs' costume. The phrase that decides the whole problem is 'common data': you are not concatenating tables, you are concatenating them AFTER pinning every row to the set of fields that all rows share. Almost everyone flattens the rows correctly. The thing that separates people is realizing you cannot decide which columns survive until you have seen every record, so the answer needs one pass to intersect the field sets and a second pass to project. Skip the intersection and you ship ragged rows where one region's bonus column rides along on half the records and is missing from the other half.

---

### Break down the requirements

#### Step 1: Flatten first, but do not emit yet

Walk every table and every row into one flat list. This is the easy half. The mistake is returning here: at this point you still know nothing about which columns are universal, so anything you emit now is ragged by construction.

#### Step 2: Intersect the field sets across all rows

Seed a 'common' set with the keys of the first row, then intersect it with each remaining row's keys. A single short row, one record missing a column, drags that column out of the result for everyone. That is the whole point of 'common'.

#### Step 3: Project every row to the common set

Now make a second pass and rebuild each row keeping only keys in 'common'. Guard the empty case up front: if there are no rows at all, the intersection is undefined, so return an empty list before you ever touch rows[0].

---

### The solution

**Intersect field sets, then project**

```python
def merge_common(tables):
    rows = [row for table in tables for row in table]
    if not rows:
        return []
    common = set(rows[0])
    for row in rows[1:]:
        common &= row.keys()
    return [{k: v for k, v in row.items() if k in common} for row in rows]
```

> **Complexity**
>
> Time is O(N * K) where N is the total number of rows and K the average columns per row: one pass to intersect, one pass to project, each O(K) per row. Space is O(N * K) for the output. At the real scale this models, hundreds of files of a few thousand rows, that is a fraction of a second; the cost is dominated by the CSV parse you did upstream, not this merge.

**Naive: flatten and return**

out = []; for t in tables: out.extend(t); return out. Reads clean, passes a test where every file happens to share columns, and silently fails the moment one region added a field. The extra column rides along on the rows that have it.

**Correct: intersect then project**

Compute the shared field set across every row, then keep only those keys on each row. Every output row has the identical schema, which is exactly the contract 'merge the common data' was asking for.

> **Interviewers Watch For**
>
> Whether you ask what 'common' means before coding (common across files, or across every row?) and whether you handle the empty input without a crash. Strong candidates also note that row.keys() supports set intersection directly with &=, so you never build throwaway sets per row.

> **Common Pitfall**
>
> Indexing rows[0] before the empty guard, which throws IndexError on tables=[] or tables=[[], []]. The second trap is treating an empty TABLE as a row with no keys and intersecting it to nothing: an empty table has zero rows, so it contributes no constraint and must not wipe out every column.

---

## Common follow-up questions

- What if instead of dropping non-common columns you must keep the union, filling missing fields with None? _(Flips intersection to union and forces a per-row backfill; tests whether the candidate sees the two requirements as the same skeleton with a different reducer.)_
- The files no longer fit in memory. How would you compute the common columns across hundreds of files without loading them all at once? _(Tests a streaming two-pass design: one pass over headers to intersect, a second to project; the natural bridge to Spark or a chunked pandas read.)_
- Two files describe the same entity with the same key but conflicting values. How do you resolve duplicates instead of just concatenating? _(Moves from concat to a keyed merge with a conflict policy (last-wins, newest-timestamp), which is the real shape of most ETL consolidation jobs.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the-shape-they-all-share)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.