# Livestream Analytics Schema

> Someone goes live, thousands tune in, chat explodes, and virtual gifts start flying.

Canonical URL: <https://datadriven.io/problems/livestream_analytics_schema>

Domain: Data Modeling · Difficulty: medium · Seniority: L5

## Problem

We're building the analytics backend for a livestream platform. Creators go live, viewers watch and interact through chat and gifts. We need to track everything for creator payouts, content recommendations, and engagement analytics. Can you design the data model?

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether a candidate can resist the urge to build one giant `events` table and instead model **different event grains as different facts**. Viewer sessions, gifts, and chat all arrive on the same stream but have different volumes, retention rules, and analytic uses.

> **Trick to Solving**
>
> When a prompt mentions several event types with wildly different cardinality and one is monetary, the trick is to notice gift events must snapshot dollar amount at purchase time. Before drawing tables, a strong candidate asks: does chat volume dwarf gift volume, and is payout accuracy audited?
> 
> 1. Split viewer_sessions, gifts, and chat into separate fact tables
> 2. Snapshot gift amount onto the gift event row
> 3. Model viewer_sessions at per-visit grain, not per-stream grain
> 4. Streams and creators stay as dimensions

---

### Break down the requirements

#### Step 1: Separate fact tables by grain

`viewer_sessions`, `gifts`, and `chat_messages` each get their own fact. Chat is 100x the volume of gifts; overloading crushes partition pruning.

#### Step 2: Model viewer sessions at per-visit grain

A viewer may join and leave a stream five times. Each visit is a row with `join_time` and `leave_time`. Per-stream aggregation is a GROUP BY on top.

#### Step 3: Snapshot gift amounts

`gifts.amount_usd` is captured at purchase time. If the gift catalog repriced tomorrow, payout math would not change yesterday’s numbers.

#### Step 4: Keep creators and categories as dimensions

`creators` and `stream_categories` rarely change per row of fact and are shared across every event type, so they stay conformed dimensions.

---

### The solution

Below is one defensible model. The anchor is per-visit grain for viewer sessions and snapshotted gift amounts, which keeps both engagement and payouts correct under change.

> **Why this works**
>
> Three fact tables sharing the same dimensions honor Kimball’s grain rule: a fact is atomic if and only if every measure makes sense at that row. Sessions, gifts, and chat do not share a grain, so they do not share a table.

> **Interviewers watch for**
>
> A strong candidate immediately says “viewer sessions are per visit, not per stream” and mentions snapshotting the gift amount because payouts are legally binding. They also ask about retention: chat may live 30 days, gifts forever.

> **Common pitfall**
>
> A single `events` table with `event_type` = 'session' or 'gift' or 'chat'. The sparse columns, mixed cardinality, and mutable gift amount all combine into a maintenance nightmare. Creator payouts in particular become uninvestigable.

---

### The analysis pattern

**Creator payout by stream**

```sql
SELECT
    c.handle AS creator,
    s.stream_id,
    SUM(g.amount_usd) AS gross_gifts_usd,
    COUNT(DISTINCT vs.viewer_id) AS unique_viewers
FROM streams s
JOIN creators c ON c.creator_id = s.creator_id
LEFT JOIN gifts g ON g.stream_id = s.stream_id
LEFT JOIN viewer_sessions vs ON vs.stream_id = s.stream_id
WHERE s.started_at >= NOW() - INTERVAL '7 days'
GROUP BY c.handle, s.stream_id
ORDER BY gross_gifts_usd DESC
```

---

### Trade-offs and alternatives

**Fact per event type**

Separate viewer_sessions, gifts, chat_messages.

* Each fact has a clean grain and its own retention
* Partition pruning works per workload
* Cross-event queries require UNION ALL

**Unified events table**

One events table with event_type enum.

* Simpler ingest
* Null-heavy rows and coarse retention
* Payout queries scan chat volume for no reason

---

## Common follow-up questions

- A creator disputes a payout. How does the schema support forensic audit? _(Tests whether gift events are immutable and whether amount_usd is snapshotted.)_
- How do you compute average concurrent viewers over a stream’s duration? _(Tests whether viewer_sessions supports range overlap aggregations.)_
- Chat retention is 30 days by law but gifts must be kept seven years. How is that enforced? _(Tests whether per-fact retention policies are expressible and whether the candidate reaches for partitioning.)_
- How would the schema support replay for content moderation review? _(Tests whether chat_messages and streams have enough temporal fidelity to reconstruct a window.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/livestream_analytics_schema)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.