# Social Platform Data Model

> Follows, likes, replies to replies. It never stops.

Canonical URL: <https://datadriven.io/problems/social_platform_data_model>

Domain: Data Modeling · Difficulty: medium · Seniority: L5

## Problem

We're building an analytics warehouse for a platform like Instagram. Users create posts with photos or videos, follow other users, and engage through likes and comments. The growth team wants to measure content virality and the product team needs engagement metrics per content type. Design the data model.

## Worked solution and explanation

### Why this problem exists in real interviews

Social graphs are a probe for two concepts at once: self-referential relationships and many-to-many modeling via junction tables. The `follows` table is a user-to-user many-to-many, and `posts.parent_post_id` is a self-reference for threading. Interviewers watch for whether the candidate reaches for composite primary keys on the junctions rather than surrogate IDs.

> **Trick to Solving**
>
> The tell is "users follow other users". Before drawing any tables, a strong candidate asks: can a user follow the same user twice? The answer is no, which means the junction table needs a composite primary key on (`follower_id`, `followed_id`) and a CHECK constraint to prevent self-follows.
> 
> 1. Spot the self-referential many-to-many
> 2. Use composite PKs on follows and likes
> 3. Add a parent_post_id self-reference on posts for threading
> 4. Keep comments as a distinct entity, not a subtype of posts

---

### Break down the requirements

#### Step 1: Identify the core entities

Users, posts, and comments are entities with their own identity. Follows and likes are relationship tables, not entities, and live as junctions.

#### Step 2: Model follows as a junction with composite PK

`(follower_id, followed_id)` is the primary key. This naturally prevents duplicate follows without a secondary unique constraint.

#### Step 3: Handle likes the same way

`(user_id, post_id)` is the PK on `likes`. A user either likes a post or does not; storing multiple likes is a bug that the PK catches at insert time.

#### Step 4: Self-reference posts for threading

`posts.parent_post_id` points at another post. This supports replies and quoted posts without a separate threads table.

#### Step 5: Keep comments separate from posts

Comments have different moderation rules, different cardinality, and different query patterns (by post, not by feed). Making them a distinct entity keeps both simple.

---

### The solution

Below is one defensible model. The composite primary keys on the junction tables are the anchor; they make duplicate prevention a schema-level guarantee rather than an application concern.

> **Why This Design Works**
>
> Composite primary keys on junction tables turn a business rule (a user cannot follow the same person twice) into a database invariant. They also give you a natural clustered index for the common query patterns ("who does this user follow" and "who follows this user"). The cost is slightly bulkier indexes than a surrogate key would give.

> **Interviewers Watch For**
>
> Strong candidates call out the composite PK immediately and add a CHECK constraint to prevent self-follows. They also distinguish posts from comments based on different lifecycles. Weaker candidates add surrogate IDs to follows and miss the duplicate-prevention invariant.

> **Common Pitfall**
>
> Adding a `follow_id` surrogate to the follows table. This makes duplicate-prevention a matter of remembering to add a unique constraint, and most implementations forget.

---

### The analysis pattern

**Top creators by follower growth this week**

```sql
SELECT
    u.handle,
    COUNT(*) FILTER (WHERE f.followed_at >= NOW() - INTERVAL '7 days') AS new_followers,
    COUNT(*) AS total_followers,
    COUNT(DISTINCT p.post_id) AS posts_this_week
FROM users u
JOIN follows f ON f.followed_id = u.user_id
LEFT JOIN posts p
    ON p.user_id = u.user_id
   AND p.created_at >= NOW() - INTERVAL '7 days'
GROUP BY u.handle
HAVING COUNT(*) FILTER (WHERE f.followed_at >= NOW() - INTERVAL '7 days') > 0
ORDER BY new_followers DESC
LIMIT 50
```

---

### Trade-offs and alternatives

**Junction tables with composite PKs**

Schema-level duplicate prevention, natural clustered indexes, simple queries. Cost: slightly bulkier indexes than surrogate-keyed alternatives and more effort if a relationship needs to carry many attributes.

**Graph database backing store**

Optimized traversal, native mutual-friend and recommendation queries. Cost: second storage system to operate, duplicate analytics layer, and most BI tools do not speak Cypher or Gremlin.

---

## Common follow-up questions

- A celebrity has 100M followers. Does the follows table partitioning strategy hold up? _(Tests partitioning by followed_id to keep hot-user fanout queries tractable.)_
- Users can mute other users without unfollowing. Where does that relationship live? _(Tests whether the candidate adds a new junction table rather than widening follows.)_
- A deleted user's posts must disappear for other users but remain for moderators. How? _(Tests soft delete on posts and whether the query layer filters on a tombstone column.)_
- Comments threads go 50 levels deep. Does the self-reference still work? _(Tests recursive CTE feasibility versus a materialized path or ltree alternative.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/social_platform_data_model)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.