# Content Search and Discovery Schema

> Searchable from every angle. Design it so nothing gets lost.

Canonical URL: <https://datadriven.io/problems/content_search_and_discovery_schema>

Domain: Data Modeling · Difficulty: hard · Seniority: L4

## Problem

We run a content platform where users can search for movies by title, by the actors who appeared in them, by the director, or by any other person who worked on the production. Design the data model to support multi-attribute search, and describe how you would build the architecture to support it.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether a candidate can model a **many-to-many with role attributes** and understand that **multi-attribute search at scale lives outside SQL**. Persons appear on many titles in different roles, and a candidate who collapses role into a scalar attribute on the junction loses the entire query plan.

> **Trick to Solving**
>
> Before drawing tables, a strong candidate asks: how is the same actor represented when they play multiple roles on the same title, and where does fuzzy search run? The signal is that role lives on the junction row, and the search index is a downstream projection.
> 
> 1. Model titles and persons as independent dimensions
> 2. Put role on the junction table, not on the person
> 3. Represent genres through a separate junction
> 4. Treat the search engine as a materialized projection

---

### Break down the requirements

#### Step 1: Separate titles and persons

Both are first-class entities. A person has a stable identity independent of their filmography, and a title has a stable identity independent of its cast. Two dimensions, one junction.

#### Step 2: Model role on the junction

`person_title_roles` has `(person_id, title_id, role)` as its natural key. Supporting multiple roles per person per title (writer and director, for example) requires role to be part of the key, not a scalar attribute on the person.

#### Step 3: Separate genres into their own junction

Genres are also many-to-many with titles. Conform genres as their own small dimension with a `title_genres` junction rather than stuffing them into an array column on `titles`.

#### Step 4: Push search to a dedicated engine

Full-text and multi-attribute search at scale is not a SQL `LIKE` problem. The relational model is the source of truth; a search index (Elasticsearch, OpenSearch) is a denormalized projection rebuilt from change streams on these tables.

---

### The solution

Below is one defensible design: two dimensions, two junctions, and a note that the search index is a projection of this relational model.

> **Why this works**
>
> Role as a junction attribute supports the 'who directed and also wrote' case without duplicating a person row. The trade-off is that search queries still fan out across four tables, which is why a secondary search index is part of the architecture rather than an afterthought.

> **Interviewers watch for**
>
> Strong candidates mention the search engine as a projection in the first minute, not at the end. They also propose a change data capture pipeline from the OLTP tables into the index. Weak candidates try to solve multi-attribute search with SQL `LIKE` and trigram indexes on ten million titles.

> **Common pitfall**
>
> Adding a `role` column to `persons` as a scalar. An actor who later directs becomes two different person rows, which breaks every credit query and doubles their appearance in search results.

---

### The analysis pattern

**Titles matching an actor search**

```sql
SELECT
    t.name,
    t.release_year,
    STRING_AGG(DISTINCT g.genre_name, ', ') AS genres,
    STRING_AGG(DISTINCT ptr.role, ', ') AS roles
FROM titles t
JOIN person_title_roles ptr ON ptr.title_id = t.title_id
JOIN persons p ON p.person_id = ptr.person_id
LEFT JOIN title_genres tg ON tg.title_id = t.title_id
LEFT JOIN genres g ON g.genre_id = tg.genre_id
WHERE p.full_name = 'Tilda Swinton'
GROUP BY t.name, t.release_year
ORDER BY t.release_year DESC
```

---

### Trade-offs and alternatives

**Relational with search index projection**

Strong relational integrity, clean updates, search index is a materialized view. Cost: two systems to keep in sync via CDC.

**Document store as source of truth**

Denormalized document per title with embedded cast and genres. Fast single-document reads. Cost: updating one actor across every title they ever appeared in is a fan-out write, and referential integrity is the application's problem.

---

## Common follow-up questions

- How would you support searching by alias when an actor uses multiple stage names? _(Tests whether the candidate adds a person_aliases table.)_
- How would the model handle a TV series with per-episode guest appearances? _(Tests adding an episode entity between title and person_title_roles.)_
- How do you propagate a corrected release year from `titles` into the search index? _(Tests CDC and search index rebuild strategy.)_
- At 100M titles and 500M credits, how do you partition `person_title_roles`? _(Tests sharding strategies on high-cardinality many-to-many.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/content_search_and_discovery_schema)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.