# Service Uptime Turnaround

> It was down. Then it came back. Stronger.

Canonical URL: <https://datadriven.io/problems/service_uptime_turnaround>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

Find services that experienced at least one month of declining uptime followed by at least one month of growth. For each qualifying service, return the service name, the month the decline started, the month growth resumed, and the growth ratio from lowest uptime to peak: (peak minus lowest) divided by lowest.

## Worked solution and explanation

### Why this problem exists in real interviews

This analytics problem uses the `svc_health` table to evaluate ROW_NUMBER for sequential numbering. Candidates must also demonstrate LAG for prior-row comparison, conditional aggregation via CASE. Watch how the `uptime` column interact in the grouping and filtering logic.

> **Trick to Solving**
>
> Whenever the prompt asks you to compare a row to its predecessor or successor, that is a `LAG`/`LEAD` signal.
> 
> 1. Identify the comparison direction (previous vs. next)
> 2. Partition by the grouping key (e.g., user, service)
> 3. Order by the time or sequence column
> 4. Compute the difference in the outer query

---

### Break down the requirements

#### Step 1: Structure the logic with 7 CTEs

Break the problem into named stages: `monthly`, `with_lag`, `trends`, `numbered`, `groups`, `streaks`, `decline_then_growth`. Each CTE isolates one transformation, making the query readable and debuggable.

#### Step 2: Self-join the table to pair rows

Join `svc_health` to itself to compare rows within the same table. The join condition controls which pairs are valid and prevents duplicate mirrors.

#### Step 3: Filter out null values

Exclude rows where `uptime` or `prev_uptime` is NULL. This prevents nulls from polluting aggregations or creating phantom groups.

---

### The solution

**Row-number for service uptime turnaround**

```sql
WITH monthly AS (
    SELECT svc_name, strftime('%Y-%m', checked) AS month, AVG(uptime) AS avg_uptime
    FROM svc_health
    WHERE uptime IS NOT NULL
    GROUP BY svc_name, strftime('%Y-%m', checked)
),
with_lag AS (
    SELECT svc_name, month, avg_uptime,
        LAG(avg_uptime) OVER (PARTITION BY svc_name ORDER BY month) AS prev_uptime
    FROM monthly
),
trends AS (
    SELECT svc_name, month, avg_uptime, prev_uptime,
        CASE
            WHEN avg_uptime < prev_uptime THEN 'decline'
            WHEN avg_uptime > prev_uptime THEN 'growth'
            ELSE 'flat'
        END AS trend
    FROM with_lag
    WHERE prev_uptime IS NOT NULL
),
numbered AS (
    SELECT svc_name, month, avg_uptime, trend,
        ROW_NUMBER() OVER (PARTITION BY svc_name ORDER BY month) AS rn
    FROM trends
),
groups AS (
    SELECT svc_name, month, avg_uptime, trend,
        rn - ROW_NUMBER() OVER (PARTITION BY svc_name, trend ORDER BY month) AS grp
    FROM numbered
),
streaks AS (
    SELECT svc_name, trend, grp, MIN(month) AS start_month, MAX(month) AS end_month,
        COUNT(*) AS streak_len, MIN(avg_uptime) AS min_uptime, MAX(avg_uptime) AS max_uptime
    FROM groups
    GROUP BY svc_name, trend, grp
),
decline_then_growth AS (
    SELECT d.svc_name, d.start_month AS decline_start, g.start_month AS growth_start,
        CAST((g.max_uptime - d.min_uptime) AS DOUBLE) / d.min_uptime AS growth_ratio
    FROM streaks d
    INNER JOIN streaks g ON d.svc_name = g.svc_name
        AND d.trend = 'decline' AND g.trend = 'growth'
        AND g.start_month > d.end_month
)
SELECT svc_name, decline_start, growth_start, growth_ratio
FROM decline_then_growth
```

> **Cost Analysis**
>
> With ~50M rows, the GROUP BY reduces the working set before any downstream operations; the window function runs on the reduced set after filtering and grouping; the join cost depends on the smaller table's cardinality; CTEs materialize intermediate results, which can be beneficial or costly depending on the engine. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you decompose the problem into named, testable stages rather than nesting everything; whether you reach for window functions or attempt a self-join for row-to-row comparison; how you handle NULL values and whether you account for them in filters and aggregations.

> **Common Pitfall**
>
> Forgetting to filter NULLs creates phantom groups or inflated counts. Always check `null_fraction` in the schema before assuming columns are clean.

---

## Common follow-up questions

- The `latency` column in `svc_health` has a 1% null rate. How does your query handle rows where `latency` is NULL, and could that silently change the result count? _(Tests whether the candidate accounts for NULLs in `svc_health.latency` and understands how aggregates skip NULL values.)_
- `svc_health.check_id` has roughly 50,000,000 distinct values. What index strategy would you use to avoid a full scan on `svc_health`? _(Tests indexing knowledge specific to the high-cardinality `check_id` column in `svc_health`.)_
- Your query uses LAG to compare adjacent rows. What happens if the partition has only one row, and how would you handle that? _(Tests edge-case handling when LAG returns NULL for partitions with insufficient rows.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/service_uptime_turnaround)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.