# The Original Keeper

> Clean up duplicate events without losing the timeline.

Canonical URL: <https://datadriven.io/problems/the_original_keeper>

Domain: Python · Difficulty: easy · Seniority: L3

## Problem

Given a list of items (possibly with duplicates), return a new list containing only the first occurrence of each value, preserving input order.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **order-preserving deduplication**, a fundamental ETL operation. It checks whether candidates use a set for O(1) lookups while maintaining the original insertion order in the output.

---

### Break down the requirements

#### Step 1: Track seen values in a set

Use a set for O(1) membership checks.

#### Step 2: Iterate in order, keeping first occurrences

For each element, if it has not been seen, add it to both the result list and the seen set.

#### Step 3: Skip duplicates

If the value is already in the seen set, skip it.

---

### The solution

**Set-tracked first-occurrence dedup**

```python
def deduplicate(items):
    seen = set()
    result = []
    for item in items:
        if item not in seen:
            seen.add(item)
            result.append(item)
    return result
```

> **Time and Space Complexity**
>
> **Time:** O(n) with a single pass.
> 
> **Space:** O(k) where k is the number of unique values.

> **Interviewers Watch For**
>
> Do you use `dict.fromkeys(items)` or a set? While `dict.fromkeys` is a clever one-liner that preserves order in Python 3.7+, interviewers want to see the explicit approach to confirm understanding.

> **Common Pitfall**
>
> Using `list(set(items))`. This deduplicates but destroys the original order since sets are unordered.

---

## Common follow-up questions

- What if elements are unhashable (e.g., lists)? _(Tests converting to tuples or using a different tracking strategy.)_
- What if you needed to count how many times each value was dropped? _(Tests adding a counter alongside the seen set.)_
- How would you handle this for a very large streaming dataset? _(Tests Bloom filters for approximate deduplication at scale.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_original_keeper)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.