# Running Distinct Count

> New values keep appearing. Track the count.

Canonical URL: <https://datadriven.io/problems/running_distinct_count>

Domain: Python · Difficulty: medium · Seniority: L3

## Problem

Given a list, return a list of the same length where position i is the count of distinct values among values[0..i] inclusive.

## Worked solution and explanation

### Why this problem exists in real interviews

Tracking the distinct count over a growing window tests **set accumulation** alongside list construction. It mirrors HyperLogLog-style cardinality estimation in data pipelines.

---

### Break down the requirements

#### Step 1: Maintain a set of seen values

Add each element to the set as you scan left to right.

#### Step 2: Record the set size at each position

After adding each element, the set size is the distinct count up to that index.

---

### The solution

**Incremental set tracking for running distinct counts**

```python
def running_distinct(values):
    seen = set()
    result = []
    for val in values:
        seen.add(val)
        result.append(len(seen))
    return result
```

> **Time and Space Complexity**
>
> **Time:** O(n) where n is the list length. Each set add and len call is O(1) average.
> 
> **Space:** O(n) for the set (worst case all unique) and the result list.

> **Interviewers Watch For**
>
> Using a set for O(1) add and cardinality tracking. Candidates who use a list and check `if val not in seen_list` get O(n^2) total time.

> **Common Pitfall**
>
> Using a list instead of a set for tracking seen values. The `in` operator on a list is O(n), making the overall approach O(n^2).

---

## Common follow-up questions

- How would you handle this for a stream too large to fit in memory? _(Tests knowledge of HyperLogLog or similar probabilistic cardinality estimators.)_
- What if you needed a sliding window distinct count? _(Tests maintaining a Counter dict and decrementing/removing when elements leave the window.)_
- How would you compute this in SQL? _(Tests `COUNT(DISTINCT col) OVER (ORDER BY ...)` which may not be supported in all databases.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/running_distinct_count)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.