# First Contact

> Every pipeline has a first run. This is what it brought back.

Canonical URL: <https://datadriven.io/problems/first_run_row_count>

Domain: SQL · Difficulty: easy · Seniority: L4

## Problem

The data platform team wants to see how many rows each batch job processed on its very first run, as a baseline for measuring throughput improvements over time. Show the job name and rows done from that first execution.

## Worked solution and explanation

### Why this problem exists in real interviews

Interviewers use the `batch_jobs` table here to probe row numbering within partitions combined with nested subqueries. The columns `job_name`, `status`, `rows_done` force candidates to reason about the correct grain before writing any aggregation.

---

### Break down the requirements

#### Step 1: Partition by `job_name`

`PARTITION BY job_name` creates groups. Within each group, `ORDER BY started ASC` determines the ranking.

#### Step 2: Filter to rank 1

`WHERE rnk = 1` in the outer query selects the target row per group.

---

### The solution

**Row-number for first run row count**

```sql
SELECT *
FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY job_name ORDER BY started ASC) AS rnk
    FROM batch_jobs
) ranked
WHERE rnk = 1
ORDER BY job_name
```

> **Cost Analysis**
>
> Window function sorts within each `job_name` partition. An index on `(job_name, started)` avoids a full sort.

> **Interviewers Watch For**
>
> The interviewer checks whether you use ROW_NUMBER (one row) vs. RANK/DENSE_RANK (ties) based on the prompt requirements.

> **Common Pitfall**
>
> Using GROUP BY with MIN(started) gives the value but not the other columns. ROW_NUMBER gives the full row.

---

## Common follow-up questions

- The `ended` column in `batch_jobs` has roughly 2% NULLs. How does your query handle those rows, and would the result change if NULLs were replaced with zeros? _(Tests whether the candidate understands how NULLs propagate through aggregation functions and whether their WHERE/JOIN conditions implicitly filter them out.)_
- Your window function uses a default frame. What is the implicit frame, and would switching to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW change anything? _(Tests knowledge of default window frames (RANGE vs ROWS) and when the distinction matters.)_
- The `job_name` column in `batch_jobs` has a zipf distribution, meaning a few values dominate. How does that skew affect your query plan and parallelism? _(Tests understanding of data skew: the optimizer may choose a bad plan when histogram statistics are stale.)_
- If the business definition of `status` changed mid-quarter (e.g., a status value was renamed), how would you handle historical consistency? _(Tests awareness of slowly changing dimensions and backward-compatible query design.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/first_run_row_count)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.