# New Customers Per Day

> Count users whose first order falls on each date.

Canonical URL: <https://datadriven.io/problems/new_customers_per_day>

Domain: SQL · Difficulty: medium · Seniority: L5

## Problem

For each calendar date, how many customers placed their very first order on that day? A customer's first-order date is the earliest order they have in the table.

## Worked solution and explanation

### Why this problem exists in real interviews

This focuses on grouping and CTE composition within transactions, specifically around the transaction_date column. Interviewers present it in mid-level screens because the edge cases around NULL values and boundary conditions reveal depth of understanding.

---

### Break down the requirements

#### Step 1: Find each customer's first transaction date

Use `MIN(transaction_date)` grouped by `user_id` to identify the date each customer first appeared.

#### Step 2: Count new customers per day

Wrap the first-date CTE in an outer query: `GROUP BY first_date` with `COUNT(*)` to tally how many customers debuted each day.

---

### The solution

**MIN date per user, then count per day**

```sql
WITH first_purchase AS (
    SELECT user_id, MIN(transaction_date) AS first_date
    FROM transactions
    GROUP BY user_id
)
SELECT first_date, COUNT(*) AS new_customers
FROM first_purchase
GROUP BY first_date
ORDER BY first_date
```

> **Cost Analysis**
>
> With `transactions` (80,000,000 rows), the full scan reads significant data. A composite index on the filter columns pushes the predicate into the index layer. Pre-aggregation in a materialized view is worth considering at this scale.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- If transactions.transaction_id could contain unexpected NULL values, how would your query behave? _(Tests NULL awareness even when the schema does not currently allow NULLs in transaction_id.)_
- How would you verify that your aggregation on transactions.transaction_id is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- With millions of distinct values in transactions.transaction_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like transaction_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/new_customers_per_day)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.