# Monthly Signup Counts

> Signups, month by month.

Canonical URL: <https://datadriven.io/problems/monthly_signup_counts>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

A downstream forecasting model requires monthly signup counts as floating-point values, but the dashboard also needs the raw integer. For each month (YYYY-MM format), show the signup count in both forms.

## Worked solution and explanation

### Why this problem exists in real interviews

Against the users table, grouping and date extraction on username values is the key operation. Interviewers favor this as a fundamentals check because it exposes whether candidates handle ties, NULLs, and ordering edge cases correctly.

---

### Break down the requirements

#### Step 1: Aggregate with COUNT

Group by the output grain and apply `COUNT()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 2: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**Dual-type output for downstream compatibility**

```sql
SELECT STRFTIME('%Y-%m', signup_date) AS month,
    COUNT(*) AS signup_count,
    CAST(COUNT(*) AS DOUBLE) AS signup_count_float
FROM users
GROUP BY STRFTIME('%Y-%m', signup_date)
ORDER BY month
```

> **Cost Analysis**
>
> The query scans 20M rows from `users`. The aggregation reduces the row count before any downstream processing, which is the key performance lever. CTEs in most engines are optimization fences. For production workloads, consider inlining or materializing the intermediate results.

> **Interviewers Watch For**
>
> Naming the output grain ("one row per X") before writing the GROUP BY shows you think about data shape, not just syntax. Breaking complex logic into named CTEs shows the interviewer you prioritize readability and debuggability.

> **Common Pitfall**
>
> Comparing dates stored as TEXT without casting can produce lexicographic instead of chronological ordering. Always confirm the column type.

---

## Common follow-up questions

- If users.user_id could contain unexpected NULL values, how would your query behave? _(Tests NULL awareness even when the schema does not currently allow NULLs in user_id.)_
- How would you verify that your aggregation on users.user_id is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- With millions of distinct values in users.user_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like user_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/monthly_signup_counts)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.