# Deployment Failure Impact

> When deploys fail, how bad is the blast radius?

Canonical URL: <https://datadriven.io/problems/deployment_failure_impact>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

After a string of rollbacks, leadership asked how much deployment capacity each service loses to failures. For each service, show what percentage of its deployments failed and what percentage of its total deployment time was consumed by those failures.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests conditional aggregation for computing failure rates on two different dimensions (count and duration). The percentage-of-total pattern probes window function or conditional SUM skills.

---

### Break down the requirements

#### Step 1: Group by service

`GROUP BY svc_name` produces one row per service.

#### Step 2: Compute failure percentages

`100.0 * SUM(CASE WHEN status = 'failed' ...) / COUNT(*)` for count percentage, and similar for duration percentage.

---

### The solution

**Dual failure percentage computation**

```sql
SELECT
    svc_name,
    ROUND(100.0 * SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) / COUNT(*), 2) AS failure_pct,
    ROUND(100.0 * SUM(CASE WHEN status = 'failed' THEN dur_secs ELSE 0 END) / NULLIF(SUM(dur_secs), 0), 2) AS failure_time_pct
FROM deploy_logs
GROUP BY svc_name
```

> **Cost Analysis**
>
> Single scan of 1.5M rows. Both percentages computed in one aggregation pass. NULLIF guards against division by zero.

> **Interviewers Watch For**
>
> Whether the candidate computes both percentages in a single pass using conditional aggregation, rather than two separate queries or self-joins.

> **Common Pitfall**
>
> Using SUM(dur_secs) in the denominator without NULLIF risks division by zero for services with all-NULL durations. Always guard division denominators.

---

## Common follow-up questions

- What if a service has zero deployments? _(It would not appear in the GROUP BY output. Tests whether to use a service dimension table.)_
- How would you identify the service with the worst failure impact? _(ORDER BY failure_time_pct DESC LIMIT 1.)_
- What other failure metrics would be useful? _(MTTR (mean time to recovery), failure frequency trends, etc.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/deployment_failure_impact)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.