# Average API Latency by Year

> Latency year over year. Is it getting better?

Canonical URL: <https://datadriven.io/problems/average_api_latency_by_year>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

During the weekly platform review, the VP of Engineering asked whether API latency has been trending up. Show the average latency for each endpoint in each year, ordered by year and then by endpoint.

## Worked solution and explanation

### Why this problem exists in real interviews

Interviewers use the `api_calls` table here to probe grouped aggregation. The columns `endpoint`, `method`, `status` force candidates to reason about the correct grain before writing any aggregation, a pattern seen at the medium level (variant 1).

---

### Break down the requirements

#### Step 1: Group by `STRFTIME('%Y'`

`GROUP BY` at the correct grain produces one row per group.

#### Step 2: Compute `AVG(latency)`

The AVG function computes the avg per group.

#### Step 3: Order by the metric

Sort by `avg_latency` desc for readability.

---

### The solution

**Group-aggregate for average api latency year**

```sql
SELECT
    STRFTIME('%Y', call_time) AS year, STRFTIME('%Y', call_time), endpoint,
    AVG(latency) AS avg_latency
FROM api_calls
GROUP BY STRFTIME('%Y', call_time), endpoint
ORDER BY avg_latency DESC
```

> **Cost Analysis**
>
> The main table has 800M rows (205 GB). Partitioned on `call_time`, so queries filtering on that column skip most partitions. The GROUP BY reduces the row count early, keeping downstream operations cheap.

> **Interviewers Watch For**
>
> Strong candidates state the correct `GROUP BY` grain before writing any SQL, showing they think about the output shape first.

> **Common Pitfall**
>
> Selecting a non-aggregated column without including it in `GROUP BY` is the most common error. Some engines reject it; others silently return arbitrary values.

---

## Common follow-up questions

- The `latency` column in `api_calls` has roughly 0% NULLs. How does your query handle those rows, and would the result change if NULLs were replaced with zeros? _(Tests whether the candidate understands how NULLs propagate through aggregation functions and whether their WHERE/JOIN conditions implicitly filter them out.)_
- Your GROUP BY aggregates `call_id` from `api_calls`. If two groups have the same aggregate value, how is the output ordered, and is that deterministic? _(Tests awareness that ORDER BY on a non-unique value produces non-deterministic row order without a tiebreaker.)_
- `call_id` in `api_calls` has ~800M distinct values. What index strategy keeps your query from doing a full table scan? _(Tests whether the candidate can design indexes for high-cardinality columns and understands selectivity.)_
- If `call_id` in `api_calls` contained negative values, would your query still produce correct results? _(Tests whether the candidate validated assumptions about the domain of numeric columns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/average_api_latency_by_year)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.