# The Merge Counter

> How many builds came from merges?

Canonical URL: <https://datadriven.io/problems/the_merge_counter>

Domain: SQL · Difficulty: easy · Seniority: L4

## Problem

The platform team wants to compare automated build activity on main against the release branch for 2026. Treat any build whose trigger isn't 'manual' as a merge build. For each branch, show how many merge builds happened, expressed both as a whole number and as a decimal, plus the average build duration.

## Worked solution and explanation

### The mental model

Platform teams use this to probe whether you reach for the idiomatic shape or the literal one. `UNION` works, but a colleague reading your code at 2am wants `GROUP BY branch` with `branch IN (...)` because it makes the categorical split obvious. The interviewer is checking whether you pick the more reviewable answer when both are correct.

---

### The three traps

#### Step 1: Trap 1: NULL versus the != filter

`trigger != 'manual'` silently drops rows where `trigger` is NULL. In three-valued logic `NULL != 'manual'` evaluates to NULL, which the WHERE clause treats as false. If the `ci_builds` data dictionary marks `trigger` as non-nullable you are fine, but on most real CI feeds the field is populated by an upstream webhook and the occasional NULL slips through. To count those as merge builds, write `(trigger IS NULL OR trigger != 'manual')`. Either choice is defensible; making it on purpose is the point.

#### Step 2: Trap 2: integer division and the CAST

The prompt asks for the count `both as a whole number and as a decimal`, which is unusual phrasing. The expected query satisfies it with `CAST(COUNT(*) AS REAL)`. The real lesson sits one step downstream: if you later divide one count by another to compute a merge ratio, SQLite and a few other engines will perform integer division on two `COUNT(*)` operands and silently return 0 for any ratio under 1. Casting at least one operand to REAL early is how you avoid the bug before it appears.

#### Step 3: Trap 3: strftime breaks partition pruning

`strftime('%Y', built_at) = '2026'` wraps the column in a function, which defeats any index or partition prune on `built_at`. On 2,000,000 rows the engine reads every row, formats every timestamp, and compares every string. A range predicate, `built_at >= '2026-01-01' AND built_at < '{{YEAR_NEXT}}-01-01'`, is sargable, plays well with partitioned CI logs, and reads exactly the year you want.

---

### The solution

**Branch-by-branch via UNION**

```sql
SELECT 'main' AS branch,
       COUNT(*) AS build_count,
       CAST(COUNT(*) AS REAL) AS build_count_decimal,
       AVG(dur_secs) AS avg_duration
FROM ci_builds
WHERE trigger != 'manual'
  AND strftime('%Y', built_at) = '2026'
  AND branch = 'main'
UNION
SELECT 'release' AS branch,
       COUNT(*) AS build_count,
       CAST(COUNT(*) AS REAL) AS build_count_decimal,
       AVG(dur_secs) AS avg_duration
FROM ci_builds
WHERE trigger != 'manual'
  AND strftime('%Y', built_at) = '2026'
  AND branch = 'release'
```

> **Two scans versus one**
>
> Two million `ci_builds` rows is moderate, not heavy. The UNION form runs the filter-and-aggregate twice and relies on the planner to notice the scans overlap; many planners will not. The single `GROUP BY` form runs the work once. With `built_at` partitioning and an index on `branch`, either shape finishes in milliseconds. Without those, the UNION pays the scan cost twice and the difference is real. Counting on optimizer cleverness to fix a clarity problem is risky in code review and worse on a whiteboard.

> **What to clarify with the interviewer**
>
> Three questions a senior candidate raises before writing SQL. First, `both as a whole number and as a decimal` is unusual phrasing, is the decimal column feeding a downstream ratio computation or just display? Second, is `branch = 'release'` a literal value or a family like `release/1.2`, `release/1.3`? If it is a family, you need `LIKE 'release/%'` or a prefix match. Third, should builds on feature branches count at all, or is this strictly a `main` versus `release` comparison? Asking surfaces requirements that the prompt assumes you will infer.

> **UNION versus UNION ALL**
>
> Writing `UNION` here instead of `UNION ALL` is harmless for correctness because the two rows are guaranteed distinct (the literal `'main'` and `'release'` never collide), but `UNION` still asks the engine to do a deduplication pass. On large result sets that pass is expensive. Default to `UNION ALL` and only reach for `UNION` when you actually need duplicate elimination.

> **The single GROUP BY rewrite**
>
> Half the length, one scan, and the branch list lives in one place. Adding `'staging'` is a one-character edit instead of a copy-paste of an entire SELECT.

**Same answer, one pass**

```sql
SELECT branch,
       COUNT(*) AS build_count,
       CAST(COUNT(*) AS REAL) AS build_count_decimal,
       AVG(dur_secs) AS avg_duration
FROM ci_builds
WHERE trigger != 'manual'
  AND strftime('%Y', built_at) = '2026'
  AND branch IN ('main', 'release')
GROUP BY branch
```

---

## Common follow-up questions

- Why does the expected query CAST(COUNT(*) AS REAL) when COUNT already returns a number? _(Forces you to articulate that the cast is defensive: it makes the second column safe for downstream ratio math even though COUNT itself already returns a number. Without the cast, dividing two counts in SQLite integer-divides to 0.)_
- Rewrite this with GROUP BY in a single pass. Is the result identical? _(Tests whether you can collapse the UNION into `GROUP BY branch` with `branch IN ('main', 'release')`. The row values are identical; only the shape and the scan count change.)_
- What if a build has NULL trigger? Does it count as a merge? _(Surfaces three-valued logic. `NULL != 'manual'` is NULL, so the build is dropped from both the merge count and the non-merge count. The fix is `(trigger IS NULL OR trigger != 'manual')` if NULLs should count as merges.)_
- The platform team adds a third branch `staging`. How many edits to your query? _(The UNION form needs a third copy-pasted SELECT. The GROUP BY form needs one literal added to the IN list. This is the cleanest argument for the GROUP BY shape.)_
- AVG(dur_secs) treats NULL durations how? What if 5 percent of builds have NULL duration? _(AVG skips NULLs entirely (it is AVG over non-null values, not AVG over all rows). With 5 percent NULL durations you understate or overstate depending on whether the missing rows are short or long builds. The honest fix is to filter to non-null durations explicitly or use COALESCE if you have a defensible default.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_merge_counter)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.