# Top 100 Batch Jobs Total Output

> The hundred biggest jobs. Combined output.

Canonical URL: <https://datadriven.io/problems/top_100_batch_jobs_total_output>

Domain: SQL · Difficulty: easy · Seniority: L4

## Problem

Sum the rows processed by the top 100 batch jobs ranked by rows completed. If jobs are tied, they share the same rank. Return that single total.

## Worked solution and explanation

### Why this problem exists in real interviews

Three ranking functions all look like they answer this prompt and only one does. `ROW_NUMBER` caps at 100 rows but breaks ties arbitrarily. `RANK` honors ties but skips numbers, so `rnk <= 100` may return fewer distinct `rows_done` levels. The prompt's tie rule fingerprints `DENSE_RANK`.

---

### Break down the requirements

#### Step 1: Rank, do not sort

Sort alone cannot express the tie rule. You need a window function so two jobs with equal `rows_done` share a rank value and the next distinct value moves to the next integer.

#### Step 2: Filter on dense_rank

`DENSE_RANK() OVER (ORDER BY rows_done DESC)` then `WHERE rnk <= 100` selects the top 100 distinct `rows_done` levels. Output row count is unbounded; it scales with how many jobs share those values.

#### Step 3: Sum what survives

Sum `rows_done` over every row that passes the rank filter. One scalar out. No `GROUP BY` needed since the outer query is a single aggregate.

---

### The solution

**TOP 100 BATCH JOB TOTAL**

```sql
SELECT SUM(rows_done) AS total_rows
FROM (
  SELECT rows_done,
         DENSE_RANK() OVER (ORDER BY rows_done DESC) AS rnk
  FROM batch_jobs
)
WHERE rnk <= 100
```

> **Cost Analysis**
>
> At 400k rows the window must materialize a sort by `rows_done DESC`. No `PARTITION BY` to parallelize across, so it is single-pass over the table. An index on `rows_done DESC` lets some engines skip the sort entirely.

> **Interviewers Watch For**
>
> Whether you ask which ranking flavor matches the tie rule before typing. `ROW_NUMBER` drops tied jobs silently. `RANK` skips numbers after every tie, so the top 100 by `RANK` may cover fewer distinct values than the prompt requires.

> **Common Pitfall**
>
> `SELECT SUM(rows_done) FROM (SELECT rows_done FROM batch_jobs ORDER BY rows_done DESC LIMIT 100)`. Returns exactly 100 rows with arbitrary tie-breaking. The prompt wants ties included, so this answer is wrong on any data with a popular `rows_done` value.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- Restrict the ranking to jobs where `status` is `succeeded`. _(Tests whether you place the filter inside the subquery so it applies before ranking, not after.)_
- Return the top 100 `rows_done` values per `priority` and sum each bucket. _(Probes `PARTITION BY priority` inside the window and group-level aggregation.)_
- How would the answer differ if we used `RANK()` instead of `DENSE_RANK()`? _(Checks understanding of rank-skipping behavior and how it shrinks the surviving row set.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_100_batch_jobs_total_output)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.