# Top Batch Job Under Priority 1

> Priority one. Top performer.

Canonical URL: <https://datadriven.io/problems/top_batch_job_under_priority_1>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The data platform team is benchmarking throughput for the highest-priority batch jobs. Among priority-1 jobs, which one processed the most rows? If multiple jobs tie for the top value, include all of them.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes filtering to the top rows after aggregation against `batch_jobs`. The key signal is whether the candidate recognizes that the grain of the ranking must match the grain of the output.

---

### Break down the requirements

#### Step 1: Aggregate per job_name

`GROUP BY job_name` with the appropriate aggregate function produces one summary row per group from the `batch_jobs` table.

#### Step 2: Rank the results

`ORDER BY` the aggregate descending with `LIMIT` to surface the top entries.

---

### The solution

**Filter batch_jobs for priority 1 then find max rows_done with tie inclusion**

```sql
SELECT
    job_name,
    SUM(retries) AS total_retries
FROM batch_jobs
GROUP BY job_name
ORDER BY total_retries DESC
LIMIT 10
```

> **Cost Analysis**
>
> The GROUP BY reduces the 400K-row `batch_jobs` table to the number of distinct `job_name` values. A covering index on `(job_name, retries)` enables an index-only aggregate scan.

> **Interviewers Watch For**
>
> Interviewers verify you aggregate before sorting. Sorting raw rows gives per-row values, not group totals. The correct grain is one row per `job_name`.

> **Common Pitfall**
>
> Using the wrong aggregate function. `SUM` gives totals, `COUNT` gives volume, `AVG` gives rates. Read the prompt to determine which metric is needed.

---

## Common follow-up questions

- If rows_done is NULL for some priority-1 jobs, does your MAX ignore them or does it affect the result? _(Tests knowledge that MAX skips NULLs; NULL rows_done will not become the maximum.)_
- Would WHERE priority = 1 AND rows_done = (SELECT MAX(rows_done) ...) handle ties correctly? _(Tests the subquery approach; this returns all rows matching the max, handling ties naturally.)_
- If the prompt changed from 'most rows' to 'best throughput (rows per second)', how would you compute that from started and ended columns? _(Tests derived metric calculation: rows_done / EXTRACT(EPOCH FROM ended - started).)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_batch_job_under_priority_1)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.