# Repeat Buyers Across Halves

> First half buyer. Second half buyer. Same person.

Canonical URL: <https://datadriven.io/problems/repeat_buyers_across_halves>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Find users who made at least 1 transaction in both the first half (Jan through Jun) and second half (Jul through Dec) of 2025. Return the user ID.

## Worked solution and explanation

### Why this problem exists in real interviews

The core skill being tested is conditional aggregation via CASE, applied to the `transactions` table in a user behavior context. Getting the `user_id` column right is where most candidates slip. The problem layers in HAVING for post-aggregation filtering, date extraction for time bucketing as well.

> **Trick to Solving**
>
> When the prompt asks for multiple metrics split by a condition (e.g., resolved vs. unresolved), conditional aggregation avoids multiple passes.
> 
> 1. Spot the split: two or more categories in one output row
> 2. Use `SUM(CASE WHEN condition THEN 1 ELSE 0 END)` for each bucket
> 3. Group by the common dimension

---

### Break down the requirements

#### Step 1: Use conditional aggregation with CASE

A `CASE` expression inside the aggregate function splits rows into buckets without multiple passes over the data. Each condition maps to one output column.

#### Step 2: Aggregate by `user_id`

`GROUP BY user_id` collapses rows to one per group. The aggregate functions (`SUM`, `COUNT`, `AVG`, etc.) compute the metric for each group.

#### Step 3: Filter groups with HAVING

HAVING applies after GROUP BY, filtering out groups that do not meet the threshold. This cannot be done in WHERE because the aggregate has not been computed yet.

#### Step 4: Sort the final output

The `ORDER BY` clause ensures the result appears in the expected sequence. Interviewers check that the sort direction matches the prompt.

---

### The solution

**Conditional filter-after-group for repeat buyers across halves**

```sql
SELECT user_id
FROM transactions
GROUP BY user_id
HAVING COUNT(CASE WHEN CAST(strftime('%m', transaction_date) AS INTEGER) <= 6 THEN 1 END) >= 1 AND COUNT(CASE WHEN CAST(strftime('%m', transaction_date) AS INTEGER) > 6 THEN 1 END) >= 1
ORDER BY user_id
```

> **Cost Analysis**
>
> With ~120M rows, the GROUP BY reduces the working set before any downstream operations. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you use HAVING (not WHERE) to filter after aggregation; whether you can pivot data with conditional aggregation in a single pass instead of multiple queries; how you handle date arithmetic and whether you account for edge cases like month boundaries.

> **Common Pitfall**
>
> Putting the aggregate condition in WHERE instead of HAVING causes a syntax error. WHERE runs before GROUP BY; HAVING runs after.

---

## Common follow-up questions

- What would happen to your result if `transactions.total_amount` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `total_amount` and uses DISTINCT or deduplication where needed.)_
- With 120,000,000 distinct values in `transactions.transaction_id`, how would a composite index on the GROUP BY columns change the execution plan? _(Probes understanding of how cardinality in `transaction_id` affects grouping and sort operations.)_
- If the date column in `transactions` spans multiple years, does your date extraction logic still produce correct time buckets? _(Tests whether the candidate accounts for year boundaries in date bucketing.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/repeat_buyers_across_halves)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.