# Who's Holding Up Traffic

> Some endpoints carry the product. Others are barely touched.

Canonical URL: <https://datadriven.io/problems/avg_daily_active_users_per_endpoint>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

The API product team wants to understand typical daily traffic per endpoint during June 2025. For each endpoint, what was the average number of unique users hitting it on a given day that month?

## Worked solution and explanation

### What this is really asking

`COUNT(DISTINCT user_id)` has to live inside a per-day bucket, not at the endpoint level. A user who hits `/search` on 30 days of June must count 30 times across the average, once per day.

---

### Break down the requirements

#### Step 1: Filter to June of last year

`strftime('%Y-%m', call_time) = 'YYYY-06'` keeps the predicate sargable on the partition column and restricts the 350M row scan.

#### Step 2: Aggregate per (endpoint, day)

Inner query computes `COUNT(DISTINCT user_id)` grouped by endpoint and `DATE(call_time)`. That gives one row per endpoint per active day.

#### Step 3: Average across days

Outer query takes `AVG(daily_users)` grouped by endpoint. Days where the endpoint had zero calls are absent, so the denominator is days-active, not 30.

---

### The solution

**AVG DAILY ACTIVE USERS PER ENDPOINT**

```sql
SELECT endpoint,
       AVG(daily_users) AS avg_daily_active_users
FROM (
  SELECT endpoint,
         DATE(call_time) AS call_day,
         COUNT(DISTINCT user_id) AS daily_users
  FROM api_calls
  WHERE strftime('%Y-%m', call_time) = '2026-06'
  GROUP BY endpoint, DATE(call_time)
) sub
GROUP BY endpoint;
```

> **Cost Analysis**
>
> Partition pruning on `call_time` cuts the 350M row scan to roughly June's slice. The inner GROUP BY hashes by (endpoint, day) with a distinct user_id set per bucket; memory scales with distinct users per endpoint per day, not total rows.

> **Interviewers Watch For**
>
> Whether you nest the aggregation. The phrase 'average on a given day' is the cue for a two-level group: distinct users per day inside, mean across days outside. Stating that out loud before writing wins the point.

> **Common Pitfall**
>
> Reusing month-wide distinct users in the numerator. A power user hitting the endpoint every day shows up once, not thirty times, so the average collapses toward the unique-visitor count divided by active days instead of the true daily mean.

> **The False Start**
>
> First instinct is `COUNT(DISTINCT user_id) * 1.0 / COUNT(DISTINCT DATE(call_time))` in one pass per endpoint. That gives month-wide uniques over active days, which understates traffic whenever users return. Pivot to a (endpoint, day) subquery and `AVG` outside.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you change this if the team wanted the average over all 30 calendar days, including days with zero traffic? _(Cross join endpoints to a generated date series, left join the per-day counts, and `COALESCE` missing days to zero before averaging.)_
- What if `user_id` can be null for unauthenticated traffic? _(`COUNT(DISTINCT user_id)` already ignores NULLs, but the team may want anonymous sessions bucketed by IP or session_id as a separate identity column.)_
- How would you rank the endpoints by week-over-week change in this metric? _(Compute the daily-uniques aggregation per ISO week, then `LAG` the average per endpoint ordered by week to get the delta.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/avg_daily_active_users_per_endpoint)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.