# The Genre Filter

> Three tables, two conditions, one actor's total.

Canonical URL: <https://datadriven.io/problems/the_genre_filter>

Domain: Python · Difficulty: medium · Seniority: L4

## Problem

Given movies (list of dicts with movie_id, genre, rating, box_office), actors (list of dicts with actor_id, name), mapping (list of dicts with actor_id and movie_id), a target genre, and a min_rating, return a dict mapping each actor's name to their total box_office summed across movies whose genre matches AND rating >= min_rating. Include only actors with at least one qualifying movie.

## Worked solution and explanation

### Why this problem exists in real interviews

This is the Pythonic version of a SQL join plus group by plus filter, and interviewers use it to see whether you build lookup dicts up front instead of nested O(n*m) loops. It is the canonical test of whether you think in indexes when the data arrives as plain lists of dicts.

---

### Break down the requirements

#### Step 1: Index the dimension tables

Build `movies_by_id = {m['movie_id']: m for m in movies}` and `actors_by_id = {a['actor_id']: a for a in actors}`. Both are O(n) one-time costs that turn every later lookup into O(1) and avoid the trap of scanning movies inside the actor loop.

#### Step 2: Filter eligible movies first

Pre-compute the set of `movie_id` values whose genre matches and rating clears the threshold. Storing both the id and the box_office in a small dict means you do not re-evaluate the predicate per mapping row.

#### Step 3: Walk the mapping table once and aggregate

Iterate `mapping`, skip rows whose `movie_id` is not in the eligible set, and add `box_office` to a `defaultdict(int)` keyed by actor name. Returning `dict(totals)` at the end gives a plain dict and naturally excludes actors with no qualifying movies.

---

### The solution

**Index, filter, aggregate**

```python
from collections import defaultdict

def actor_genre_box_office(movies: list[dict], actors: list[dict], mapping: list[dict], genre: str, min_rating: float) -> dict:
    actors_by_id = {a['actor_id']: a['name'] for a in actors}
    eligible = {
        m['movie_id']: m['box_office']
        for m in movies
        if m['genre'] == genre and m['rating'] >= min_rating
    }
    totals = defaultdict(int)
    for row in mapping:
        movie_id = row['movie_id']
        if movie_id not in eligible:
            continue
        actor_name = actors_by_id.get(row['actor_id'])
        if actor_name is None:
            continue
        totals[actor_name] += eligible[movie_id]
    return dict(totals)
```

> **Cost Analysis**
>
> Time is O(M + A + K) where M is movies, A is actors, and K is mapping rows. Space is O(M + A) for the lookup dicts plus O(R) for the result where R is the number of qualifying actors. Naive nested loops would be O(M * K) and time out on realistic inputs.

> **Interviewers Watch For**
>
> Whether you build lookup dicts before iterating, whether you filter movies once rather than per mapping row, and whether you correctly exclude actors with zero qualifying movies (a `defaultdict` plus `dict()` cast does this naturally).

> **Common Pitfall**
>
> Including every actor in the output with a default of 0 violates the 'at least one qualifying movie' rule. Only insert into the totals dict when you actually add a non zero contribution.

---

## Common follow-up questions

- How would you sort the output by total box_office descending? _(Wrap with `dict(sorted(totals.items(), key=lambda kv: -kv[1]))`. Discuss whether dict insertion order is preserved (it is, since Python 3.7).)_
- What if two actors have the same name but different ids? _(Key the totals by `actor_id` instead of name and only translate to name in the output projection. Otherwise you silently merge their revenues.)_
- How would you handle this in pandas? _(Three DataFrames, two `merge` calls, then `groupby('name')['box_office'].sum()`. Compare readability and memory cost against the dict approach.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_genre_filter)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.