# Signups by Age Bucket Since April

> Recent signups by age.

Canonical URL: <https://datadriven.io/problems/signups_by_age_bucket_since_april>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The marketing team is profiling the spring signup wave (April 1, 2026 onward) by age demographics. Show each age bucket alongside its signup count, largest groups first.

## Worked solution and explanation

### Why this problem exists in real interviews

Drawn from a marketing analytics domain, this question centers on grouped COUNT aggregation over the `users` table. The tricky part is handling the `age_bucket` column correctly under the given constraints.

---

### Break down the requirements

#### Step 1: Apply the range filter

The WHERE clause restricts rows to the target range. Applying this filter early reduces the volume flowing into downstream operations.

#### Step 2: Aggregate by `age_bucket`

`GROUP BY age_bucket` collapses rows to one per group. The aggregate functions (`SUM`, `COUNT`, `AVG`, etc.) compute the metric for each group.

#### Step 3: Sort the final output

The `ORDER BY` clause ensures the result appears in the expected sequence. Interviewers check that the sort direction matches the prompt.

---

### The solution

**Apply the range filter to find signups by age bucket since**

```sql
SELECT age_bucket, COUNT(*) AS user_count
FROM users
WHERE signup_date >= '2026-04-01'
GROUP BY age_bucket
ORDER BY user_count DESC
```

> **Cost Analysis**
>
> With ~15M rows, the GROUP BY reduces the working set before any downstream operations. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether the query returns exactly the columns and ordering the prompt specifies; how quickly you identify the core operation and write clean, minimal code.

> **Common Pitfall**
>
> Returning extra columns that the prompt did not ask for, or using the wrong column alias, causes a grading mismatch even when the logic is correct.

---

## Common follow-up questions

- What would happen to your result if `users.user_id` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `user_id` and uses DISTINCT or deduplication where needed.)_
- `users.user_id` has roughly 15,000,000 distinct values. What index strategy would you use to avoid a full scan on `users`? _(Tests indexing knowledge specific to the high-cardinality `user_id` column in `users`.)_
- If this query ran as a scheduled job, how would you add monitoring to detect when the result set is suspiciously empty? _(Tests operational awareness around scheduled query jobs.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/signups_by_age_bucket_since_april)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.