# Job Status Duration

> How long in each job state?

Canonical URL: <https://datadriven.io/problems/job_status_duration>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Our pipeline tracks batch job state transitions: every time a job changes status, the system records the job ID, timestamp, and status (queued, running, completed). Calculate total hours all jobs spent in each status. Duration is the difference between the current status timestamp and the next status change. For each job's final status, assume it lasted 2 hours. Round to 2 decimal places.

## Worked solution and explanation

### Why this problem exists in real interviews

Against the batch_jobs table, lag/lead comparison and grouping on job_name values is the key operation. Interviewers favor this in mid-level screens because it exposes whether candidates handle ties, NULLs, and ordering edge cases correctly.

> **Trick to Solving**
>
> Period-over-period comparison is cleanest with `LAG`.
> 
> 1. Use `LAG(metric) OVER (ORDER BY period)` to get the prior value
> 2. Compute the difference or ratio in the outer query
> 3. Handle the first period where LAG returns NULL

---

### Break down the requirements

#### Step 1: Set up a CTE for the intermediate result

Wrap the first transformation in a `WITH` clause. This names the intermediate result set and keeps the outer query clean.

#### Step 2: Access the next row with LEAD

`LEAD(metric) OVER (ORDER BY ...)` pulls the next row's value into the current row. The last row returns NULL, which must be handled.

#### Step 3: Aggregate with SUM

Group by the output grain and apply `SUM()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 4: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**LEAD-based state duration calculation**

```sql
WITH transitions AS (
    SELECT job_id, status, started,
        LEAD(started) OVER (PARTITION BY job_id ORDER BY started) AS next_started
    FROM batch_jobs
)
SELECT status,
    ROUND(SUM(
        CASE
            WHEN next_started IS NOT NULL
            THEN (JULIANDAY(next_started) - JULIANDAY(started)) * 24
            ELSE 2.0
        END
    ), 2) AS total_hours
FROM transitions
GROUP BY status
```

> **Cost Analysis**
>
> The query scans 600K rows from `batch_jobs`.

> **Interviewers Watch For**
>
> Naming the output grain ("one row per X") before writing the GROUP BY shows you think about data shape, not just syntax. Walking through comparison logic step by step, rather than writing it in one pass, demonstrates structured thinking.

> **Common Pitfall**
>
> Comparing dates stored as TEXT without casting can produce lexicographic instead of chronological ordering. Always confirm the column type.

---

## Common follow-up questions

- What happens to your result if batch_jobs.ended contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on ended.)_
- What does your LAG or LEAD return for the first or last row in each partition of batch_jobs? _(Tests awareness of NULL defaults at partition boundaries and COALESCE usage.)_
- With millions of distinct values in batch_jobs.job_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like job_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/job_status_duration)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.