# The Conversion Story

> Signups are one thing. Paid subscriptions are another. Find the gap.

Canonical URL: <https://datadriven.io/problems/signup_to_subscription_rate>

Domain: SQL · Difficulty: medium · Seniority: L5

## Problem

Calculate the conversion rate from free signup to paid subscription, broken down by referral source. The growth team wants to decide where to increase ad spend next quarter. Round to 4 decimal places.

## Worked solution and explanation

### Why this problem exists in real interviews

Built around the `event_data` table, this challenge probes your ability to apply conditional aggregation via CASE in a subscription setting. Correctly referencing the `tags` column is essential to a working solution.

---

### Break down the requirements

#### Step 1: Filter to the target set

The `IN` list restricts the query to only the specified values, avoiding a full-table scan of irrelevant rows.

#### Step 2: Aggregate by `tags`

`GROUP BY tags` collapses rows to one per group. The aggregate functions (`SUM`, `COUNT`, `AVG`, etc.) compute the metric for each group.

#### Step 3: Deduplicate the result with DISTINCT

`SELECT DISTINCT` removes duplicate rows from the output. This is necessary when joins or subqueries can produce repeated combinations.

---

### The solution

**Case pivot for signup to subscription rate**

```sql
SELECT tags, COUNT(DISTINCT CASE WHEN event_type = 'signup' THEN user_id END) AS signup_count, COUNT(DISTINCT CASE WHEN event_type = 'subscription' THEN user_id END) AS subscription_count, ROUND(COUNT(DISTINCT CASE WHEN event_type = 'subscription' THEN user_id END) * 1.0 / COUNT(DISTINCT CASE WHEN event_type = 'signup' THEN user_id END), 4) AS conversion_rate
FROM event_data
WHERE event_type IN ('signup', 'subscription')
GROUP BY tags
```

> **Cost Analysis**
>
> With ~200M rows, the GROUP BY reduces the working set before any downstream operations. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you know when DISTINCT is needed and when it masks a logic error.

> **Common Pitfall**
>
> Using `COUNT(*)` instead of `COUNT(DISTINCT col)` counts duplicates. If the prompt says 'unique', you need DISTINCT inside the aggregate.

---

## Common follow-up questions

- If `tags` in `event_data` is NULL for some rows, how would your aggregation or join logic be affected? _(Probes understanding of NULL propagation through joins and aggregate functions on `event_data.tags`.)_
- `event_data.event_id` has roughly 200,000,000 distinct values. What index strategy would you use to avoid a full scan on `event_data`? _(Tests indexing knowledge specific to the high-cardinality `event_id` column in `event_data`.)_
- Your conditional CASE logic assumes the categories are exhaustive. What happens if a row in `event_data` falls into none of the branches? _(Tests awareness of the implicit ELSE NULL in CASE expressions.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/signup_to_subscription_rate)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.