# First Half of Page Views

> Half the data. The first half.

Canonical URL: <https://datadriven.io/problems/first_half_of_page_views>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Retrieve the first half of all page view records based on the total row count. Include each record's row number alongside all view details (view ID, page URL, user ID, referrer, duration in milliseconds, device, and viewed-at timestamp). Results should be ordered by view ID.

## Worked solution and explanation

### Why this problem exists in real interviews

Slicing the lower half of a 600M-row table by `view_id` order forces you to think about two things at once: a stable global ordering that survives partitioning, and a bound (`COUNT(*) / 2`) that has to come from somewhere. Both pieces touch the whole table. There is no way to fake it with a `LIMIT`.

---

### Break down the requirements

#### Step 1: Establish a deterministic rank

`ROW_NUMBER() OVER (ORDER BY view_id)` numbers every row globally. Plain `LIMIT n/2` is illegal in most engines when `n` is itself a subquery, and would not preserve a stable position alongside other columns.

#### Step 2: Derive the cutoff

`(SELECT COUNT(*) / 2 FROM page_views)` is a scalar subquery evaluated once. Integer division drops the middle row on odd totals, which is the implicit contract of `first half`.

#### Step 3: Filter and re-order

`WHERE rn <= cutoff` keeps the bottom half. The outer `ORDER BY view_id` is non-negotiable because the window's `ORDER BY` does not bind the result set's order after a filter.

---

### The solution

**FIRST HALF BY ROW NUMBER**

```sql
SELECT *
FROM (
    SELECT *,
           ROW_NUMBER() OVER (ORDER BY view_id) AS rn
    FROM page_views
) sub
WHERE rn <= (SELECT COUNT(*) / 2 FROM page_views)
ORDER BY view_id
```

> **Cost Analysis**
>
> Two full passes: the `ROW_NUMBER` window sorts 600M rows globally (partition pruning on `viewed_at` does not help, `view_id` cuts across partitions), and the scalar `COUNT(*)` is a second scan. On columnar engines the count is cheap from metadata; on row stores it is not.

> **Interviewers Watch For**
>
> Re-stating the outer `ORDER BY`. People assume the window's `ORDER BY view_id` carries through; once `WHERE rn <= k` runs, the engine is free to return rows in any order. The trailing sort is what makes the output reproducible.

> **Common Pitfall**
>
> Putting the `ROW_NUMBER` filter in the same `SELECT` where it is computed. `WHERE rn <= ...` cannot reference an alias defined in the same query level; you need the subquery or a CTE. Trying `QUALIFY` only works on Snowflake, Teradata, and BigQuery.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- If `view_id` has gaps from deleted rows, does `first half` still mean what the business wants? _(Probes whether you distinguish ordinal position from identifier value.)_
- How would you return the median row when the total count is odd? _(Checks understanding of integer division and the off-by-one in `n/2` versus `(n+1)/2`.)_
- Rewrite this so the engine only scans `page_views` once. _(Tests knowledge of windowed aggregates like `COUNT(*) OVER ()` to fold both passes into one.)_

> **One Pass Trick**
>
> `COUNT(*) OVER ()` inside the subquery gives you the total alongside `rn`, eliminating the scalar subquery. The window still sorts everything, but you avoid a second scan of the 600M-row table.

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/first_half_of_page_views)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.