# High Volume Batch Jobs

> Batch jobs that processed millions.

Canonical URL: <https://datadriven.io/problems/high_volume_batch_jobs>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

Surface all batch jobs that processed more than 5000 rows, showing each job's name, priority, and rows processed, ranked from most to fewest.

## Worked solution and explanation

### Why this problem exists in real interviews

Pipeline operations teams routinely ask 'which jobs are pushing the most volume so we know what to throttle, scale, or prioritize'. Interviewers reach for this prompt because it tests whether a candidate keeps it simple: the right answer is a single SELECT with a WHERE filter and an ORDER BY, no aggregation, no CTE, no window function. Candidates who over-engineer with GROUP BY or subqueries reveal that they jump to patterns instead of reading the prompt.

---

### Break down the requirements

#### Step 1: Filter individual jobs by volume

The threshold is per-row: `WHERE rows_done > 5000`. There is no aggregation here, each row in `batch_jobs` already represents one job execution and `rows_done` is the volume for that execution. A GROUP BY on `job_name` would change the grain and silently sum unrelated runs together.

#### Step 2: Project only the asked columns

Return `job_name`, `priority`, and `rows_done`. The other six columns (`job_id`, `status`, `started`, `ended`, `retries`) stay out, since the prompt asks for those three only.

#### Step 3: Order most to fewest

Sort by `rows_done DESC` so the heaviest jobs surface at the top. The prompt does not specify a tie-breaker, so a single ORDER BY column is enough.

---

### The solution

**Filter and rank by volume in one pass**

```sql
SELECT job_name, priority, rows_done
FROM batch_jobs
WHERE rows_done > 5000
ORDER BY rows_done DESC
```

> **Cost Analysis**
>
> `batch_jobs` is small (400K rows, ~205 MB). A full scan with predicate evaluation is the natural plan. If this query ran constantly an index on `(rows_done DESC)` would let the planner range-scan the qualifying tail, but for a 400K table the sequential scan is fine and adds no operational overhead.

> **Interviewers Watch For**
>
> Interviewers want to see that you do not reach for GROUP BY, SUM, MAX, or window functions when none are needed. Strong candidates state explicitly: 'each row is already one job run, so I just filter and sort'. Weak candidates produce something like `SELECT job_name, SUM(rows_done) GROUP BY job_name`, which silently changes the meaning of the result.

> **Common Pitfall**
>
> Writing `WHERE rows_done >= 5000` instead of `> 5000` includes the boundary value and inflates the result set. The prompt says 'more than 5000', which excludes 5000 exactly. Also, applying GROUP BY by `job_name` collapses 50 distinct job names and aggregates `rows_done` across runs, which answers a different question (cumulative throughput per job name, not high-volume runs).

---

## Common follow-up questions

- If product changed the question to 'jobs whose total rows processed across all runs exceeded 5000', how would your query change? _(Tests whether the candidate can pivot from per-row filtering to GROUP BY aggregation. The new query becomes `SELECT job_name, SUM(rows_done) AS total FROM batch_jobs GROUP BY job_name HAVING SUM(rows_done) > 5000`. The threshold moves from WHERE to HAVING because it now references an aggregate.)_
- `status` has 4 distinct values with a single_hot skew. If the team wanted only successful runs, where in the query does that filter belong, and why? _(Tests filter-placement intuition. `status = 'success'` is a row predicate, so it belongs in WHERE alongside `rows_done > 5000`, not in HAVING. Putting non-aggregate filters in HAVING is legal but wastes work and confuses readers.)_
- `priority` has only 5 distinct values with zipf skew. If the team wanted the high-volume jobs grouped by priority bucket, how would you structure that without losing the per-run detail? _(Tests whether the candidate knows when to use window functions. A `SUM(rows_done) OVER (PARTITION BY priority)` keeps every row visible while adding a per-priority total, which preserves the detail the original query exposes.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/high_volume_batch_jobs)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.