# Not Every Team Can See Every Row

> Everyone can see the bucket. Not everyone should.

Canonical URL: <https://datadriven.io/problems/not_every_team_can_see_every_row>

Domain: Pipeline Design · Difficulty: hard · Seniority: L7

## Problem

We store all our data in S3 and want to build a warehouse on top of it. The challenge is that different teams have strict data access requirements: some data is confidential and can only be queried by specific groups, and we need that access control enforced at the file level in S3, not just at the query layer. Design the warehouse architecture.

## Worked solution and explanation

### Why this problem exists in real interviews

Access control with two boundaries that have to hold together: storage-layer file isolation (so a direct API call can't bypass the query engine) and column-level visibility on mixed tables (so an analyst sees public columns and HR sees everything). The trap is putting all governance in the query engine; one direct API call to S3 makes that the wrong abstraction.

The default reach is one S3 prefix for everything, with the warehouse enforcing access in views. An engineer copies a confidential file directly via the AWS API because the bucket policy didn't say no; the privacy office finds out from a screenshot. A mixed table is exposed as two views (public-only and full); the analyst's query against the underlying table without going through the view returns confidential columns. Compliance asks who's read which files for the last year and the answer is nothing.

> **Trick to Solving**
>
> Per-sensitivity prefixes / buckets, classification at ingest, column policies on mixed tables in the engine, audit log of every access.
> 
> 1. Sensitivity is assigned at ingest; records route to per-sensitivity prefixes (or buckets) with file-level policies. Direct API calls hit the policy at the storage layer.
> 2. Mixed tables (public columns plus confidential columns) live in the warehouse with column-level access policies tied to roles. Analysts query the table and get public columns; HR gets everything.
> 3. Every query and file access writes to an audit log retained for the regulatory window. The log is queryable for compliance.
> 4. Reclassification is rare and explicit; the design makes the at-ingest classification the contract.

---

### Walk the requirements

#### Step 1: Storage-layer access enforced by per-sensitivity prefixes or buckets

Sensitivity is assigned to each record at ingest; records route to per-sensitivity locations in cold storage (separate buckets for confidential, restricted, public, with bucket policies tied to roles). A direct API call to a confidential bucket from a role that doesn't have access fails at the storage layer, not just at the query engine. Without storage-layer policies the access boundary lives only in the warehouse, and a single direct API call bypasses it.

#### Step 2: Classify at ingest; routing follows the classification

Reclassifying data after it lands is the version where confidential rows sit briefly in the public bucket and somebody downloads them. The ingest step assigns sensitivity (from a classification rule, a column name match, or an upstream tag) and routes the record to the matching prefix at write time. The classification is durable on the record. A 'we'll classify later' approach is the version where the audit finds confidential rows where they shouldn't be.

#### Step 3: Column-level access on mixed tables in the warehouse

Some tables (employee, customer) hold public columns like name and department alongside confidential columns like salary. Column-level access policies tied to roles enforce that an analyst's query returns only public columns; HR's query returns everything. The same table; different visible columns per role. A 'hide the columns in the BI tool' approach is the version where a query against the underlying table bypasses the BI tool and the columns are visible.

#### Step 4: Audit log of every query and file access, retained for years

Compliance asks who accessed what. Every query (engine query log) and every file access (storage access log) writes to an audit archive retained for the regulatory window. Compliance queries the archive for any window and gets a record. Without the archive the answer to 'who accessed what last March' is a forensic exercise; with it, the answer is a SQL query.

---

### The shape that fits

> **What this design gives up**
>
> Per-sensitivity buckets multiply the storage layout and the IAM configuration; classification at ingest needs the classification rule to be reliable on each record; column-level policies are configuration that has to be reviewed when teams or roles change; the audit log grows for years. Implementation cost is the price; the win is access that holds against direct API calls, mixed tables that show the right columns to the right user, and an audit answer that's a query.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Storage-layer access policies on per-sensitivity prefixes / buckets enforce file-level access regardless of which query path is used.
> - Records classify at ingest and route to the location matching their sensitivity rather than reclassifying after landing.
> - Mixed tables apply column-level access in the warehouse so each user sees only their authorized columns.
> - Every query and file access writes to a queryable audit log retained for the regulatory window.

> **The mistake that ships**
>
> What gets shipped puts everything in one S3 prefix and trusts the warehouse views to enforce access. An engineer downloads a confidential file via direct API because the bucket policy didn't say no. A mixed table's columns are 'hidden' in BI but a direct query returns them. Compliance asks who saw what last quarter and the answer is a forensic exercise nobody finishes. The eventual rebuild adds per-sensitivity zones, ingest-time classification, column-level policies, and an audit archive , each was reachable up front if 'enforce access at the storage layer' had been the architecture rather than a phrase in the requirements.

---

## Common follow-up questions

- A new department is granted access to a previously-confidential dataset. What in this design changes, and where? _(Tests whether the candidate sees the change as policy-only: the new role gets read access to the confidential bucket and column-level access on the relevant warehouse tables; no data movement, no reclassification. The audit log records the policy change and subsequent accesses.)_
- An ingest-time classification rule misses a row that should have been confidential. What does this design do when the misclassification is found, and what does the audit show? _(Tests whether the candidate sees that the misclassified row is moved (a deletion from the lower zone, an insert into the higher zone), the audit log shows the access events from while the row was in the wrong zone, and compliance can identify which users may have read it. The fix is at the data; the discoverability is in the audit log.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/not_every_team_can_see_every_row)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.