# Duplicated User Event Messages

> Duplicated messages from the alerts topic.

Canonical URL: <https://datadriven.io/problems/duplicated_user_event_messages>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

We need a load-testing replay tool that doubles every row from the 'user-events' topic. For each row, produce two identical copies showing the message ID, topic, partition key, payload, produced timestamp, and consumer. The output should be ordered by message ID.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests event stream deduplication, a critical skill when working with Kafka or similar messaging systems. The interviewer checks whether you can identify the right grouping key for replay detection in `stream_msgs`.

---

### Break down the requirements

#### Step 1: Identify the duplicate key

For replay detection, `topic` and `part_key` and `payload` define a unique message. The `offset` naturally differs between duplicates.

#### Step 2: Group and filter

`GROUP BY topic, part_key, payload` with `HAVING COUNT(*) > 1` isolates replayed messages.

#### Step 3: Include the count

The duplicate count tells the team how many times each message was replayed.

---

### The solution

**Detect replayed stream messages via GROUP BY**

```sql
SELECT topic, part_key, payload, COUNT(*) AS replay_count
FROM stream_msgs
WHERE topic = 'user-events'
GROUP BY topic, part_key, payload
HAVING COUNT(*) > 1
ORDER BY replay_count DESC
```

> **Cost Analysis**
>
> Hash aggregation over 500M rows is expensive. Filtering to `topic = 'user-events'` first reduces the working set significantly. An index on `(topic, part_key, payload)` helps.

> **Interviewers Watch For**
>
> Strong candidates ask whether the offset column should be part of the duplicate key. Since offsets differ between replays, they should be excluded.

> **Common Pitfall**
>
> Including `offset` or `produced` in the GROUP BY makes every row unique. The duplicate key must only include columns that are identical across replays.

---

## Common follow-up questions

- How would you deduplicate and keep only the earliest offset per message? _(Tests ROW_NUMBER partitioning to select the first occurrence.)_
- What if payloads are semantically identical but differ in whitespace? _(Tests normalization before grouping.)_
- How would you prevent replay duplicates at the consumer level? _(Tests idempotent consumer design patterns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/duplicated_user_event_messages)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.