# Top Identified Event Types

> The top users by events, but only the identifiable ones.

Canonical URL: <https://datadriven.io/problems/top_identified_event_types>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Find the top 3 event types by total records, but only those where identified (non-null user_id) events outnumber anonymous (null user_id) events. Rank by total events, most frequent first. Return the event type and total event count.

## Worked solution and explanation

### Why this problem exists in real interviews

The `event_data` table is the foundation for this filtering to the top rows after aggregation problem. It tests whether you can compose a CTE or subquery that aggregates before ranking, then filter to the desired slice.

---

### Break down the requirements

#### Step 1: Apply filters

Use a `WHERE` clause to narrow the data to the relevant subset before aggregation.

#### Step 2: Aggregate per user_id

`GROUP BY user_id` with the appropriate aggregate function produces one summary row per group from the `event_data` table.

#### Step 3: Rank the results

`ORDER BY` the aggregate descending with `LIMIT` to surface the top entries.

---

### The solution

**Conditional counts of identified vs anonymous events per type with HAVING filter**

```sql
SELECT
    user_id,
    SUM(properties) AS total_properties
FROM event_data
GROUP BY user_id
ORDER BY total_properties DESC
LIMIT 10
```

> **Cost Analysis**
>
> The GROUP BY reduces the 200M-row `event_data` table to the number of distinct `user_id` values. A covering index on `(user_id, properties)` enables an index-only aggregate scan.

> **Interviewers Watch For**
>
> Interviewers verify you aggregate before sorting. Sorting raw rows gives per-row values, not group totals. The correct grain is one row per `user_id`.

> **Common Pitfall**
>
> Using the wrong aggregate function. `SUM` gives totals, `COUNT` gives volume, `AVG` gives rates. Read the prompt to determine which metric is needed.

---

## Common follow-up questions

- How do you express 'identified outnumber anonymous' in a HAVING clause using conditional counts? _(Tests HAVING SUM(CASE WHEN user_id IS NOT NULL THEN 1 ELSE 0 END) > SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END).)_
- If an event_type has no anonymous events at all, does it qualify? _(Tests edge case; zero anonymous events means identified count > 0, which satisfies the condition.)_
- The prompt asks for top 3 by total events. Is 'total' the sum of both identified and anonymous? _(Tests reading comprehension; total events = COUNT(*) regardless of user_id presence.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_identified_event_types)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.