# The Status Board

> Make sense of a pile of raw Nginx access logs.

Canonical URL: <https://datadriven.io/problems/the_status_board>

Domain: Python · Difficulty: medium · Seniority: L4

## Problem

Parse Nginx combined-format log lines. Return a 2-element list: [status_counts_dict, top_3_paths_list]. status_counts maps each status code (as string) to its count. top_3_paths lists the three most-requested paths (query strings stripped) in descending count; tie-break by path alphabetically. If fewer than 3 distinct paths exist, return all of them.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **string parsing** (regex or manual), **dict aggregation**, and **sorting with custom keys**. Parsing log formats is a common data engineering task that probes whether a candidate can extract structured data from semi-structured text.

---

### Break down the requirements

#### Step 1: Parse each log line to extract the status code and resource path

Combined Nginx log format has the request method, path, and protocol in quotes, followed by the status code. Use regex or string splitting to extract these fields.

#### Step 2: Strip query parameters from the path

Split the path on `?` and keep only the portion before it.

#### Step 3: Count status codes and path frequencies

Accumulate counts in two separate dicts.

#### Step 4: Return the status counts and top 3 paths

Sort paths by frequency descending and take the first 3.

---

### The solution

**Regex parsing with dual aggregation**

```python
import re
def analyze_logs(lines: list) -> tuple:
    status_counts = {}
    path_counts = {}
    pattern = re.compile(r'"\w+ (\S+) \S+" (\d{3})')
    for line in lines:
        match = pattern.search(line)
        if not match:
            continue
        raw_path = match.group(1)
        status = match.group(2)
        path = raw_path.split("?")[0]
        status_counts[status] = status_counts.get(status, 0) + 1
        path_counts[path] = path_counts.get(path, 0) + 1
    sorted_paths = sorted(path_counts.items(), key=lambda x: x[1], reverse=True)
    top_paths = []
    for i in range(min(3, len(sorted_paths))):
        top_paths.append(sorted_paths[i][0])
    return status_counts, top_paths
```

> **Time and Space Complexity**
>
> **Time:** O(n log n) in the worst case due to sorting paths. The parsing pass is O(n * m) where m is the average line length.
> 
> **Space:** O(u) where u is the number of unique status codes and paths.

> **Interviewers Watch For**
>
> Robustness to malformed lines. The `if not match: continue` guard prevents crashes on unexpected input. Production log parsers must handle garbage gracefully.

> **Common Pitfall**
>
> Forgetting to strip query parameters. `/api/users?page=1` and `/api/users?page=2` should count as the same resource path.

---

## Common follow-up questions

- How would you handle gzip-compressed log files? _(Tests knowledge of `gzip.open()` for transparent decompression.)_
- What if the logs were streaming in real-time? _(Tests incremental processing and bounded memory usage.)_
- How would you detect anomalous spikes in 5xx errors? _(Tests sliding window rate calculation and threshold alerting.)_
- What if paths contained URL-encoded characters? _(Tests `urllib.parse.unquote` for normalization before counting.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_status_board)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.