# Slow Batch Jobs

> Promised by noon. Delivered at midnight.

Canonical URL: <https://datadriven.io/problems/slow_batch_jobs>

Domain: SQL · Difficulty: easy · Seniority: L5

## Problem

Find every batch job where the ended timestamp is later than the started timestamp (meaning it actually completed). For each, compute how many minutes elapsed between start and end. Return job_id, job_name, and minutes_elapsed.

## Worked solution and explanation

### Why this problem exists in real interviews

This problem targets filtering and projection across the `batch_jobs` table. You need to work with the `ended` column to satisfy the requirements.

---

### Break down the requirements

#### Step 1: Filter out null values

Exclude rows where `ended` is NULL. This prevents nulls from polluting aggregations or creating phantom groups.

#### Step 2: Select the target columns

The SELECT clause picks exactly the columns the prompt asks for. Returning extra columns or missing a required alias would fail the grading check.

---

### The solution

**Filter out null values to find slow batch jobs**

```sql
SELECT job_id, job_name, CAST((julianday(ended) - julianday(started)) * 1440 AS INTEGER) AS minutes_elapsed
FROM batch_jobs
WHERE ended IS NOT NULL AND ended > started
```

> **Cost Analysis**
>
> With ~300K rows, the query performs a single sequential scan. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for how you handle NULL values and whether you account for them in filters and aggregations; how you handle date arithmetic and whether you account for edge cases like month boundaries.

> **Common Pitfall**
>
> Forgetting to filter NULLs creates phantom groups or inflated counts. Always check `null_fraction` in the schema before assuming columns are clean.

---

## Common follow-up questions

- What result would you get if every value in `batch_jobs.ended` were NULL? Would your query return an empty set or something unexpected? _(Tests extreme NULL scenarios and whether the candidate guards against edge cases in `ended`.)_
- If `batch_jobs` grew to contain billions of rows, which part of your query would become the bottleneck given the cardinality of `ended`? _(Tests ability to identify performance hotspots related to `batch_jobs.ended` at scale.)_
- How would you modify this query if the business logic required grouping by both `job_id` and `job_name` instead of just one? _(Tests ability to adapt the query structure to changing requirements.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/slow_batch_jobs)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.