# iOS Adoption by Age Bucket

> The install numbers don't match the hype.

Canonical URL: <https://datadriven.io/problems/ios_adoption_by_age_bucket>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Compare the count of iOS users (running iOS 16, 17, or 18) to total users in each age bucket, but only buckets where at least one session exists. Return results by total users, highest first.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether you can aggregate to the correct grain before filtering or ordering, and join semantics and their effect on result cardinality. It is a common at L4+ screen because it requires composing multiple SQL features correctly in a single pass, and small mistakes in logic produce silently wrong results.

> **Trick to Solving**
>
> Read the prompt carefully for implicit constraints. The phrase structure hints at the grain of the output: what each row represents.
> 
> 1. Identify the output grain from the prompt (one row per what?)
> 2. Work backward from the desired output columns
> 3. Build the query inside-out: innermost subquery first, then layer on filters and aggregates

---

### Break down the requirements

#### Step 1: Join tables with INNER JOIN

Connect `users` and `devices` and `user_sessions` on `user_id`, `device_id` to keep only matching rows.

#### Step 2: Aggregate with COUNT DISTINCT

Group by the output grain and apply `COUNT DISTINCT()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 3: Filter groups with HAVING

The `HAVING` clause filters after aggregation, unlike `WHERE` which filters before. This is necessary when the condition depends on an aggregate result.

#### Step 4: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**Conditional count with three-table join**

```sql
SELECT u.age_bucket,
    COUNT(DISTINCT CASE WHEN d.os_version IN ('16', '17', '18') THEN u.user_id END) AS ios_users,
    COUNT(DISTINCT u.user_id) AS total_users
FROM users u
JOIN user_sessions s ON u.user_id = s.user_id
JOIN devices d ON s.device_id = d.device_id
GROUP BY u.age_bucket
HAVING COUNT(s.session_id) >= 1
ORDER BY total_users DESC
```

> **Cost Analysis**
>
> The join touches `users` (12M rows) and `devices` (8M rows) and `user_sessions` (50M rows). `user_sessions` is partitioned by `session_start`, which the optimizer can exploit with a partition filter. The aggregation reduces the row count before any downstream processing, which is the key performance lever.

> **Interviewers Watch For**
>
> Interviewers expect you to articulate why you chose a specific join type and what happens to unmatched rows. Naming the output grain ("one row per X") before writing the GROUP BY shows you think about data shape, not just syntax.

> **Common Pitfall**
>
> Forgetting that a JOIN can multiply rows when the relationship is one-to-many. Always check whether the join key is unique on at least one side.

---

## Common follow-up questions

- What if the join key is not unique on one side? _(Tests anticipation of row multiplication from one-to-many joins.)_
- How would the query change if data volume increased 10x? _(Tests consideration of index strategy, partitioning, or materialized views.)_
- What if you needed to exclude outliers before aggregating? _(Tests knowledge of percentile cutoffs or z-score filtering.)_
- How would you validate that this query is correct on a new dataset? _(Tests approach to verification: spot checks, known aggregates, or row-count sanity.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/ios_adoption_by_age_bucket)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.