# Service Scorecard

> Deploys vs. alerts. One row per service tells the whole story.

Canonical URL: <https://datadriven.io/problems/service_scorecard>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

The VP of Engineering wants a single dashboard row per service tying deploys to alerts. For each service in deploy_logs, count its deploys and then count how many rows in alert_events match on svc_name, keeping services with zero alerts. Return the svc_name, deploy count, and alert count.

## Worked solution and explanation

### Why this problem exists in real interviews

Drawn from a deployment domain, this question centers on grouped COUNT aggregation over the `deploy_logs` and `alert_events` tables. The tricky part is handling the `svc_name` and `alert_id` columns correctly under the given constraints.

> **Trick to Solving**
>
> Complex queries with multiple transformation stages are best solved by writing one CTE per stage.
> 
> 1. Identify the intermediate results needed
> 2. Write each as a named CTE
> 3. Chain them: each CTE reads from the previous one
> 4. The final SELECT assembles the output

---

### Break down the requirements

#### Step 1: Isolate the intermediate result in a CTE

The `deploy_counts` CTE computes the intermediate aggregation that the outer query builds on. This separation keeps each layer focused on a single task.

#### Step 2: Left join to preserve all base rows

A `LEFT JOIN` from `deploy_logs` ensures every row appears in the output even if there is no match in `deploy_counts`. Missing values become NULL.

---

### The solution

**Isolate the intermediate result in a cte to find service scorecard**

```sql
WITH deploy_counts AS (
    SELECT svc_name, COUNT(*) AS deploy_count
    FROM deploy_logs
    GROUP BY svc_name
)
SELECT dc.svc_name, dc.deploy_count, COUNT(a.alert_id) AS alert_count
FROM deploy_counts dc
LEFT
JOIN alert_events a ON dc.svc_name = a.svc_name
GROUP BY dc.svc_name, dc.deploy_count
```

> **Cost Analysis**
>
> With ~21M rows, the GROUP BY reduces the working set before any downstream operations; the join cost depends on the smaller table's cardinality; CTEs materialize intermediate results, which can be beneficial or costly depending on the engine. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you decompose the problem into named, testable stages rather than nesting everything; whether you choose the correct join type to avoid silently dropping rows; whether you use a subquery or self-join, and can explain the tradeoffs.

> **Common Pitfall**
>
> Using INNER JOIN instead of LEFT JOIN drops rows with no match, producing an incomplete result. The prompt usually hints at this with 'all' or 'even if no'.

---

## Common follow-up questions

- What result would you get if every value in `alert_events.resolved` were NULL? Would your query return an empty set or something unexpected? _(Tests extreme NULL scenarios and whether the candidate guards against edge cases in `resolved`.)_
- `deploy_logs.log_id` has roughly 1,500,000 distinct values. What index strategy would you use to avoid a full scan on `deploy_logs`? _(Tests indexing knowledge specific to the high-cardinality `log_id` column in `deploy_logs`.)_
- Would materializing the CTE as a temp table improve performance when joining `deploy_logs` and `alert_events`? _(Tests understanding of CTE materialization behavior across different database engines.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/service_scorecard)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.