# Most Active Servers by Log Volume

> The busiest servers by log volume.

Canonical URL: <https://datadriven.io/problems/most_active_servers_by_log_volume>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

To size log storage for next quarter, the observability team needs to know which servers generated the most log volume during 2025, ranked from highest to lowest.

## Worked solution and explanation

### Why this problem exists in real interviews

The server_logs table contains server_name and log_level values that must be processed with grouping and date extraction. This appears in mid-level screens to probe whether you reason about the correct aggregation grain before writing any window or GROUP BY clause.

> **Trick to Solving**
>
> Read the prompt carefully for implicit constraints. The phrase structure hints at the grain of the output: what each row represents.
> 
> 1. Identify the output grain from the prompt (one row per what?)
> 2. Work backward from the desired output columns
> 3. Build the query inside-out: innermost subquery first, then layer on filters and aggregates

---

### Break down the requirements

#### Step 1: Filter to the target rows

Apply the date filter using `STRFTIME` to extract and compare the relevant time component. This restricts rows before aggregation.

#### Step 2: Aggregate with COUNT

Group by the output grain and apply `COUNT()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 3: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**Year-filtered group count for volume ranking**

```sql
SELECT server_name, COUNT(*) AS log_count
FROM server_logs
WHERE STRFTIME('%Y', log_timestamp) = '2025'
GROUP BY server_name
ORDER BY log_count DESC
```

> **Cost Analysis**
>
> The query scans 80M rows from `server_logs`. The window function requires a sort, which is O(n log n). Pre-aggregating reduces the sort input.

> **Interviewers Watch For**
>
> Strong candidates explain their choice of window function (`ROW_NUMBER` vs `RANK` vs `DENSE_RANK`) and why it matches the tie semantics.

> **Common Pitfall**
>
> Using `ROWS` vs `RANGE` in the window frame produces different results when ties exist. Default to `ROWS` unless you specifically need tie grouping.

---

## Common follow-up questions

- What happens to your result if server_logs.response_time_ms contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on response_time_ms.)_
- How would you verify that your aggregation on server_logs.log_id is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- With millions of distinct values in server_logs.log_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like log_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/most_active_servers_by_log_volume)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.