# Busiest Pipeline Month

> One month, more pipeline runs than any other.

Canonical URL: <https://datadriven.io/problems/busiest_pipeline_month>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The data platform team is planning maintenance windows and wants to avoid the busiest month. Which month of the year had the highest number of pipeline executions? Show the month number and the count.

## Worked solution and explanation

### Why this problem exists in real interviews

Maintenance windows on a data platform are scheduled around traffic, and traffic is almost never uniform. The platform team needs the calendar month with the most pipeline executions so they can avoid scheduling downtime then. Interviewers use this as a quick read on whether you can extract a date part, aggregate at that grain, and pick the top group without overthinking it. The prompt deliberately says "month of the year" (1-12), not year-month. Candidates who answer with `'%Y-%m'` are answering a different question.

> **Trick to Solving**
>
> Three things to verify before writing SQL: (1) the date part the prompt actually wants (`'%m'` for month-of-year, not `'%Y-%m'`), (2) the aggregate (`COUNT(*)` of executions, not `SUM(rows_in)` or `MAX(...)`), and (3) the tiebreaker if two months tie on count. Get the grain right and the query is four lines.
> 
> 1. Extract month from `start_at`
> 2. `COUNT(*)` per month
> 3. Order DESC and `LIMIT 1`

---

### Break down the requirements

#### Step 1: Extract month-of-year, not year-month

`start_at` is a TIMESTAMP/TEXT column on `data_pipes`. `STRFTIME('%m', start_at)` returns a two-digit string '01'-'12'. The prompt asks for the month number, so this format is exactly right. Using `'%Y-%m'` would aggregate March 2023 separately from March 2024 and answer a different question.

#### Step 2: Aggregate with COUNT(*) at the month grain

Each row in `data_pipes` represents one pipeline execution, so `COUNT(*)` is the right aggregate. `SUM(rows_in)` would answer "which month moved the most data" - related, but not what the prompt asks. The team is planning maintenance windows, and a window blocks executions, not bytes.

#### Step 3: Handle the 1% NULLs in start_at

`start_at` has a 1% NULL rate. SQLite's `STRFTIME('%m', NULL)` returns NULL, and `GROUP BY` collects all NULLs into a single group. That group will appear in the output unless you filter it out. For this prompt, drop NULLs in a WHERE clause: a NULL `start_at` is a row whose month you don't know, and including it would give a meaningless answer.

#### Step 4: Pick the top group, decide on ties

`ORDER BY pipeline_runs DESC LIMIT 1` returns the busiest month. If two months tie at the top, this returns one of them non-deterministically. In a real interview, ask whether ties matter; if yes, switch to `RANK() OVER (ORDER BY COUNT(*) DESC) = 1` and return all top-ranked rows.

---

### The solution

**Month-of-year with the most pipeline executions**

```sql
SELECT
    STRFTIME('%m', start_at) AS month,
    COUNT(*) AS pipeline_runs
FROM data_pipes
WHERE start_at IS NOT NULL
GROUP BY STRFTIME('%m', start_at)
ORDER BY pipeline_runs DESC
LIMIT 1
```

**Equivalent forms**

```sql
/* Tie-safe: return all months tied for the top */
SELECT month, pipeline_runs
FROM (
  SELECT STRFTIME('%m', start_at) AS month,
         COUNT(*) AS pipeline_runs,
         RANK() OVER (ORDER BY COUNT(*) DESC) AS rk
  FROM data_pipes
  WHERE start_at IS NOT NULL
  GROUP BY STRFTIME('%m', start_at)
)
WHERE rk = 1;

/* ANSI-portable form */
SELECT EXTRACT(MONTH FROM start_at) AS month,
       COUNT(*) AS pipeline_runs
FROM data_pipes
WHERE start_at IS NOT NULL
GROUP BY EXTRACT(MONTH FROM start_at)
ORDER BY pipeline_runs DESC
LIMIT 1;
```

> **Cost Analysis**
>
> `data_pipes` is 80K rows / ~30 MB - small enough that a sequential scan plus a hash aggregate over 12 groups is essentially free. The interesting cost question is what happens at 8B rows. There the scan dominates, and `STRFTIME('%m', start_at)` blocks any plain btree on `start_at` from being used. Mitigations: (a) a generated/computed column `month_of_year SMALLINT` with a btree, (b) range partitioning by `start_at` so the planner can prune to one partition per month, or (c) a pre-aggregated rollup table refreshed daily. Expected plan today: Seq Scan -> Filter (NOT NULL) -> HashAggregate -> Sort Top-N.

> **Interviewers Watch For**
>
> Strong candidates ask "month of the year, or year-month?" before writing anything, and explicitly handle the NULL `start_at` rows. They also call out that the answer can tie and either confirm with the interviewer that LIMIT 1 is fine or switch to RANK. Weak candidates jump to `'%Y-%m'` without reading the prompt, then have to rewrite the query when the interviewer asks for the month number.

> **Common Pitfall**
>
> The silent failure here is using `STRFTIME('%Y-%m', ...)` and answering "which year-month had the most runs." That returns one specific month in one specific year (e.g., '2023-05'), which is not what "month of the year" means. The other pitfall is forgetting that `STRFTIME('%m', NULL)` is NULL, so a NULL group will sit in your GROUP BY result unless you filter it out.

---

## Common follow-up questions

- The platform team comes back and says they actually care about which month moves the most data, not which has the most runs. How does your query change? _(Tests whether the candidate distinguishes execution count from data-volume. The maintenance-window framing in the prompt makes COUNT correct, but a real platform team often cares about both. The right answer is two queries, or one query with both metrics in the SELECT.)_
- If `start_at` is stored as UTC but the team is in PT, a run at 2025-03-01 02:00 UTC is still February locally. How does your query handle that, and where would you push the timezone conversion? _(Tests time-zone awareness. `start_at` stored as UTC TEXT vs. local TEXT changes which month a midnight-adjacent run falls into. Production answer: store UTC, convert at read time with `strftime` over an offset, document the convention.)_
- If `data_pipes` grew to 8 billion rows, what changes about your plan? How big does the hash aggregate state get? _(Tests cardinality intuition. The output is at most 12 rows regardless of input size. Strong candidates note that the cost is dominated by the scan and the hash table is trivial.)_
- If the team only cares about successful runs, where in the query does the status filter belong, and does that change the busiest month? _(Tests whether the candidate sees that `status` is a column with skew (most rows are 'success'). Filtering on status before the GROUP BY changes the answer; the prompt didn't ask for that filter, but a follow-up often does.)_
- Rewrite the query to return all months tied for the top, then explain why you'd use RANK instead of ROW_NUMBER here. _(Tests RANK vs DENSE_RANK vs ROW_NUMBER. Real interviews probe whether you reach for the right window function when ties matter.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/busiest_pipeline_month)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.