# The Resume Sifter

> Pull what's useful. Skip what you know.

Canonical URL: <https://datadriven.io/problems/the_resume_sifter>

Domain: Python · Difficulty: medium · Seniority: L3

## Problem

Given a list of URLs in the format 'https://resumes.io/firstname_lastname_id' (or 'firstname_id' for single-name variants) and a set existing_ids of known ID strings, return a list of [name_prefix, id] for URLs whose id is NOT in existing_ids, preserving input order. The name_prefix is everything before the last underscore; the id is what follows.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **URL parsing, string extraction, and set-based filtering**, a practical combination in ETL workflows. Interviewers check whether candidates can extract structured data from URLs and filter against a known set.

---

### Break down the requirements

#### Step 1: Parse the URL to extract the path segment

Split on `/` to get the last path segment containing `firstname_lastname_id`.

#### Step 2: Split the segment into name parts and ID

The format is `firstname_lastname_id`. Split on `_` and extract the components.

#### Step 3: Filter out known IDs

Check each extracted ID against the provided set of known IDs. Only keep new candidates.

---

### The solution

**URL parsing with set-based dedup filtering**

```python
def sift_resumes(urls, known_ids):
    result = []
    for url in urls:
        parts = url.rstrip('/').split('/')
        segment = parts[-1]
        pieces = segment.split('_')
        firstname = pieces[0]
        lastname = pieces[1]
        candidate_id = pieces[2]
        if candidate_id not in known_ids:
            result.append({
                'name': firstname + ' ' + lastname,
                'id': candidate_id
            })
    return result
```

> **Time and Space Complexity**
>
> **Time:** O(n) where n is the number of URLs. Set membership check is O(1).
> 
> **Space:** O(k) where k is the number of new candidates.

> **Interviewers Watch For**
>
> Do you use set membership (`in known_ids`) rather than list membership? This is the O(1) vs O(m) distinction that interviewers specifically look for.

> **Common Pitfall**
>
> Assuming the URL always ends without a trailing slash. Using `rstrip('/')` before splitting handles both cases cleanly.

---

## Common follow-up questions

- What if names could have multiple underscores (e.g., firstname_middle_lastname_id)? _(Tests using `rsplit('_', 1)` to split from the right for the ID.)_
- What if URLs had query parameters? _(Tests using `urllib.parse.urlparse` for robust URL parsing.)_
- How would you also deduplicate across the incoming batch? _(Tests adding extracted IDs to a running seen set.)_
- What if the known_ids set had millions of entries? _(Tests that sets handle this efficiently in Python with O(1) lookups.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_resume_sifter)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.