# Content Engagement Data Model

> Post published. Now measure everything that happens next.

Canonical URL: <https://datadriven.io/problems/content_engagement_data_model>

Domain: Data Modeling · Difficulty: hard · Seniority: L5

## Problem

We run a large social content platform. Creators publish posts (text, images, video). Users engage through views, reactions, comments, and shares. The product team needs a data model to power dashboards for content virality, creator performance, and feed ranking signals. Data visualization is also required. Sketch how a virality chart would query this model.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether a candidate understands **atomic versus pre-aggregated fact grain** and can defend when each one is appropriate. Billion-event-per-day dashboards cannot scan the raw event table every load; a candidate who only models events at the atomic grain is missing half the design.

> **Trick to Solving**
>
> Before drawing tables, a strong candidate asks: what latency can the dashboard tolerate, and how many events per hour per post do we see? The signal here is that a second fact table, pre-aggregated at the hour grain, is the cheap path to fast dashboards.
> 
> 1. Model events at the atomic grain for drilldowns
> 2. Add an hourly rollup fact for virality dashboards
> 3. Make comments self-referential for threading
> 4. Use `shares` as a distinct fact because they create new posts

---

### Break down the requirements

#### Step 1: Declare the atomic grain

`engagement_events` is one row per user action on a post: view, reaction, comment, save. This is the drilldown layer for ad hoc analysis and anomaly investigation.

#### Step 2: Pre-aggregate for dashboards

`post_engagement_hourly` is one row per post per hour per event type. Dashboards scan megabytes, not terabytes. The rollup is rebuilt incrementally each hour.

#### Step 3: Self-reference comments

A comment can reply to another comment. Store `parent_comment_id` on `engagement_events` for comment threads so threading is a simple recursive query rather than a bridge table.

#### Step 4: Separate shares from engagement

A share creates a new post referencing the original. Modeling it as a row in `shares` preserves attribution and lets virality queries walk the share graph without filtering events.

#### Step 5: Keep users and posts as conformed dimensions

Both fact tables share `users` and `posts`. Creator performance rolls up through `posts.creator_id`, feed ranking signals read from the same tables.

---

### The solution

Below is one defensible design: an atomic event fact, an hourly rollup fact, and a shares fact that captures virality propagation.

> **Why this works**
>
> The rollup fact carries dashboards, and the atomic fact carries investigation. The trade-off is explicit duplication: the hourly table is a derived summary that must stay in sync. That is a standard price for sub-second virality queries on billion-event volumes.

> **Interviewers watch for**
>
> A strong candidate defends the rollup with a specific latency target and a query volume estimate. They also name the rebuild cadence and whether backfills are full or incremental. Weak candidates model only the atomic event table and then cannot answer how a dashboard loads in under a second.

> **Common pitfall**
>
> Storing `reaction_count` as a denormalized column on `posts` and incrementing it from the application layer. Every read becomes cheap, but retroactive corrections are nearly impossible, and the count drifts from the event log within days.

---

### The analysis pattern

**Top creators by 24-hour virality**

```sql
SELECT
    u.handle,
    SUM(h.views) AS views_24h,
    SUM(h.reactions) AS reactions_24h,
    SUM(h.shares) AS shares_24h,
    SUM(h.shares)::NUMERIC / NULLIF(SUM(h.views), 0) AS share_rate
FROM post_engagement_hourly h
JOIN posts p ON p.post_id = h.post_id
JOIN users u ON u.user_id = p.creator_id
WHERE h.hour_bucket >= NOW() - INTERVAL '24 hours'
GROUP BY u.handle
ORDER BY shares_24h DESC
LIMIT 50
```

---

### Trade-offs and alternatives

**Atomic plus hourly rollup**

Fast dashboards, cheap drilldown, explicit rebuild job. Cost: two tables to backfill after a correction and storage for the rollup.

**Atomic only with materialized views**

Single source of truth. Cost: materialized view refresh at this volume becomes a large operational surface, and ad hoc query cost is higher because views cover only anticipated aggregations.

---

## Common follow-up questions

- How would you backfill `post_engagement_hourly` after a corrupted upstream pipeline? _(Tests whether the rollup is reproducible from the atomic fact.)_
- How do you rank feed signals across posts of different ages without penalizing older posts? _(Tests whether the candidate introduces a decay factor on the hourly rollup.)_
- What if a reaction is toggled on and off within one hour? _(Tests whether the event fact is idempotent or stores net state.)_
- How would you handle deletion of a post to keep the rollups consistent? _(Tests tombstone propagation into the hourly rollup.)_
- At 50B engagements per day, how do you partition `engagement_events`? _(Tests partitioning by date and possibly by post_id hash for parallel reads.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/content_engagement_data_model)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.