# The Bucket Full of Resumes

> A thousand resumes. Structured data inside each one.

Canonical URL: <https://datadriven.io/problems/the_bucket_full_of_resumes>

Domain: Pipeline Design · Difficulty: medium · Seniority: L7

## Problem

Our HR platform receives thousands of resumes monthly as PDFs and scanned images. Right now they sit in an S3 bucket and searching them means opening files manually. We need a pipeline that extracts structured information from every document - candidate name, skills, work history, education - and makes it queryable. Design the end-to-end ingestion and extraction pipeline.

## Worked solution and explanation

### Why this problem exists in real interviews

An L7 search-and-extraction pipeline with three properties that often conflict: full-text plus structured filters, manual corrections that have to survive reprocessing, and per-user access to PII. The trap is treating extraction as a one-shot transformation; reprocessing is going to happen as the model improves and corrections will be overwritten the first time.

The default reach is to OCR every resume, extract structured fields, push the text into a search index, and tell recruiters to query it. Recruiters can grep but not filter by years of experience. HR corrects a mis-extracted skill set; the next reprocessing run overwrites the correction with the model's output again. Recruiters and analysts both read the same store and analytics queries pull names along with skills, which the privacy team flags.

> **Trick to Solving**
>
> Originals immutable, extraction outputs structured plus text, corrections layer that survives reprocessing, per-tier access on PII.
> 
> 1. Originals stay in cold storage immutably; extraction is the fallible layer that improves over time and reprocesses against the originals.
> 2. Each resume produces both full-text (for search) and structured fields (for filters) into a warehouse the recruiter UI joins across.
> 3. A corrections layer captures HR's edits with a higher precedence than extraction output; reprocessing reads extraction plus corrections, and the corrected fields persist.
> 4. Per-tier access enforces PII visibility at the warehouse: recruiters see PII, analytics reads a de-identified view that exposes skills and experience without names.

---

### Walk the requirements

#### Step 1: Move resumes through extraction into a search layer within hours

Uploads land in an immutable original archive in cold storage; an extraction pipeline OCRs scans, parses text, extracts structured fields, and writes both full-text records to the search index and structured rows to the warehouse. Recruiters query within hours of upload. The originals stay around because reprocessing reads from them when extraction improves; without that, every model improvement requires re-uploading documents.

#### Step 2: Both full-text and structured filters in one query

Recruiters search by free text and filter by skills, years of experience, and degree. The extraction produces structured fields (skills as a list, experience years as a number, degree as an enum) into the warehouse keyed on resume id; the search index holds the full text keyed on the same id. The recruiter UI joins index hits with warehouse filters. A 'put it all in search and filter on text patterns' approach is the version where 'five years of experience' becomes a regex problem and recruiters give up.

#### Step 3: Corrections persist through reprocessing

When HR corrects a wrongly-extracted field, the correction writes to a corrections store keyed on (resume_id, field). The recruiter view reads the union of extraction output and corrections, with corrections taking precedence. The next reprocessing run reads extraction's new output but the corrections store stays as it was; the union still surfaces the corrected value. A 'reprocessing overwrites everything' approach is the version where HR redoes the same correction every time the model improves.

#### Step 4: Per-tier access on PII at the warehouse

Resumes contain names, addresses, and other PII recruiters need to see and analytics doesn't. Per-tier policies on the warehouse: recruiters' role sees PII columns; analytics' role reads a de-identified view that exposes skills, experience, and degree without identifying fields. The platform enforces the boundary regardless of how the query is written. A 'we'll filter PII in BI' approach is the version where one analytics query without the filter pulls names alongside skills.

---

### The shape that fits

> **What this design gives up**
>
> An immutable original archive and reprocessing infrastructure is more storage and compute than a one-shot extract-and-discard; the corrections store adds a precedence layer recruiters and reprocessing have to know about; per-tier views and the de-identified view are configuration that has to be reviewed when teams change. Implementation cost is the price; the win is recruiters who can find candidates with structured filters, HR corrections that don't get re-overwritten, and analytics paths that survive a privacy review.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Original PDFs and scans are preserved unchanged in cold storage; reprocessing reads from the originals.
> - Extraction produces both full-text search and structured fields in the warehouse.
> - A corrections layer persists HR's manual edits with precedence over extraction; reprocessing doesn't overwrite them.
> - Per-tier access enforces PII visibility at the warehouse so analytics reads a de-identified view.

> **The mistake that ships**
>
> What gets shipped OCRs each resume once, writes to a search index, and tells everyone to query it. Recruiters can grep but not filter by skills or years of experience. HR corrects a mis-extracted skill; the next reprocessing run overwrites the correction. An analytics query pulls names alongside skills because the privacy filter was a column-list the analyst forgot. The eventual rebuild adds the original archive, the structured warehouse, the corrections precedence layer, and per-tier access , each was reachable up front if 'recruiters need to filter' had been treated as a structured-data problem rather than a search problem.

---

## Common follow-up questions

- Extraction improves and reprocesses every resume. How does this design preserve corrections, and what does HR see for a correction the new model agrees with? _(Tests whether the candidate sees the precedence layer: corrections always win in the recruiter view, even when the new model agrees. The corrections store is unchanged; the recruiter sees the corrected value (which now matches the new model's output anyway). HR can prune corrections that no longer differ if they want to clean up.)_
- An analytics user wants to know how many candidates have a specific skill in a specific city. What in this design lets them, and what doesn't? _(Tests whether the candidate sees the de-identified analytics view exposing aggregated skills by location without exposing identifying fields. A query for 'count of candidates with skill X in city Y' returns a number without revealing names; if location alone could re-identify in small cities, the view applies a minimum cohort size before exposing it.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_bucket_full_of_resumes)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.