# Minimum Parallel Workers

> Too few workers and it stalls.

Canonical URL: <https://datadriven.io/problems/minimum_parallel_workers>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

Determine the minimum number of parallel workers required to run all batch jobs without conflicts. Each job has a start and end timestamp and can overlap with others. Duplicate job entries should be counted once, and jobs missing start or end times should be excluded. Find the peak number of concurrently running jobs at any point.

## Worked solution and explanation

### Why this problem exists in real interviews

This focuses on CTE composition and set union within batch_jobs, specifically around the job_name column. Interviewers present it in senior-level rounds because the edge cases around NULL values and boundary conditions reveal depth of understanding.

> **Trick to Solving**
>
> NULL propagation silently corrupts aggregates and joins. The trick is explicit handling.
> 
> 1. Identify which columns can be NULL (check the schema)
> 2. Use `COALESCE(col, 0)` for numeric defaults or `LEFT JOIN` to preserve unmatched rows
> 3. Verify that your WHERE clause does not accidentally filter out NULLs

---

### Break down the requirements

#### Step 1: Structure the query as multi-step CTEs

This solution uses 2 CTEs to break the logic into readable stages. Each CTE produces an intermediate result that feeds the next.

#### Step 2: Filter to the target rows

Filter for NULL or non-NULL values in the `WHERE` clause. This must happen before aggregation to avoid corrupted results.

#### Step 3: Apply the window function

Use the window function with the correct `PARTITION BY` and `ORDER BY` to compute the required metric per group.

#### Step 4: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**Event-sweep algorithm for peak concurrency**

```sql
WITH events AS (
    SELECT started AS event_time, 1 AS delta
    FROM (
        SELECT DISTINCT job_id, started, ended
        FROM batch_jobs
        WHERE started IS NOT NULL AND ended IS NOT NULL
    )
    UNION ALL
    SELECT ended AS event_time, -1 AS delta
    FROM (
        SELECT DISTINCT job_id, started, ended
        FROM batch_jobs
        WHERE started IS NOT NULL AND ended IS NOT NULL
    )
),
running AS (
    SELECT event_time,
        SUM(delta) OVER (ORDER BY event_time, delta DESC) AS concurrent
    FROM events
)
SELECT MAX(concurrent) AS min_workers
FROM running
```

> **Cost Analysis**
>
> The query scans 1M rows from `batch_jobs`. CTEs in most engines are optimization fences. For production workloads, consider inlining or materializing the intermediate results.

> **Interviewers Watch For**
>
> Explicitly mentioning NULL handling before being asked signals production awareness. Naming the output grain ("one row per X") before writing the GROUP BY shows you think about data shape, not just syntax.

> **Common Pitfall**
>
> NULL values are silently excluded from `COUNT(column)` but included in `COUNT(*)`. Mixing these up produces incorrect totals.

---

## Common follow-up questions

- What happens to your result if batch_jobs.ended contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on ended.)_
- How would you verify that your aggregation on batch_jobs.job_id is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- With millions of distinct values in batch_jobs.job_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like job_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/minimum_parallel_workers)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.