# April and May Active Users

> Spring cleaning for the user base. Who was actually around?

Canonical URL: <https://datadriven.io/problems/april_and_may_active_users>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The growth team needs to identify users who were active during the spring. Pull a deduplicated list of user IDs for everyone who had at least one session in April or May.

## Worked solution and explanation

### Why this problem exists in real interviews

By forcing filtering and projection on `user_sessions`, this question separates candidates who understand how `session_start`, `session_duration_sec`, `pages_viewed` behave under aggregation from those who guess at the GROUP BY clause.

---

### Break down the requirements

#### Step 1: Filter `user_sessions` to qualifying rows

Apply the WHERE clause to keep only rows matching the prompt criteria. This reduces the working set before deduplication.

#### Step 2: Deduplicate with DISTINCT

`SELECT DISTINCT session_start` removes duplicate values, producing one row per unique entry.

#### Step 3: Sort the output

Order by `session_start` for deterministic, readable results.

---

### The solution

**Distinct-filter for april may active users**

```sql
SELECT DISTINCT session_start
FROM user_sessions
ORDER BY session_start
```

> **Cost Analysis**
>
> The main table has 30M rows (12 GB). Partitioned on `session_start`, so queries filtering on that column skip most partitions.

> **Interviewers Watch For**
>
> Clean, readable SQL with correct column references signals production readiness. Candidates who verbalize their approach before coding score higher on communication.

> **Common Pitfall**
>
> Returning extra columns not asked for, or missing a required column, are both common mistakes that fail automated grading.

---

## Common follow-up questions

- What happens to your results if `session_start` in `user_sessions` contains trailing whitespace or mixed casing? _(Tests awareness of text normalization issues that silently fragment GROUP BY results.)_
- If `user_sessions` were partitioned by date, would your query need to scan all partitions or could it prune? How would you verify? _(Tests understanding of partition pruning and EXPLAIN output.)_
- `session_id` in `user_sessions` has ~30M distinct values. What index strategy keeps your query from doing a full table scan? _(Tests whether the candidate can design indexes for high-cardinality columns and understands selectivity.)_
- Could you express this same logic as a single query without CTEs or subqueries? What readability trade-off does that introduce? _(Tests whether the candidate can flatten nested logic and understands when decomposition aids maintainability.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/april_and_may_active_users)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.