# A New Column on a Billion Rows

> Add and backfill a new column to a billion-row production table with zero downtime.

Canonical URL: <https://datadriven.io/problems/a_new_column_on_a_billion_rows>

Domain: Pipeline Design · Difficulty: hard · Seniority: L6

## Problem

Our product team needs a new column added to the orders table - which has grown to a billion rows and receives 50K writes per minute. The migration must go live next week and production availability cannot be impacted. Design the migration approach and the backfill strategy.

## Worked solution and explanation

### Why this problem exists in real interviews

Adding a column to a billion-row OLTP table without slowing the application, breaking replication, overwriting application writes, or leaving the schema half-migrated. Each property is a known operational practice; the trap is treating any of them as optional and discovering on day three that backfill is overrunning the table lock or has caught the replication slot.

The default reach is to ALTER TABLE add a column with a default value and move on. On a billion rows, the default-value rewrite either takes a maintenance window the business won't approve or holds the table long enough to break the application's latency budget. Backfill running at full speed catches the replication slot and analytics replicas fall behind by hours; downstream consumers stall. When backfill writes to rows the application has already set, the application's value gets overwritten. When the team decides to back out, the column has been written across half the table and removing it isn't clean.

> **Trick to Solving**
>
> Add the column nullable, backfill in throttled chunks that respect replication lag, write only where empty, and rehearse the rollback before the migration starts.
> 
> 1. Add the column as NULLABLE without a default, so the schema change is a metadata-only operation that doesn't rewrite the table.
> 2. Backfill runs in small chunks under an orchestrator that watches replication lag and slows down or pauses when lag crosses a budget.
> 3. Each backfill UPDATE has a `WHERE new_column IS NULL` predicate so application-written values are never overwritten.
> 4. The rollback path (drop the column, drop any auxiliary state) is tested in a non-production environment before the migration begins, and runs cleanly within the rollback window.

---

### Walk the requirements

#### Step 1: Add the column nullable so the schema change doesn't rewrite the table

An ALTER TABLE that adds a column with a default value rewrites every row on most engines (or holds the table for longer than the application can tolerate); on a billion-row table that's the latency-budget breach. Adding the column as NULLABLE without a default is a metadata-only change on most modern engines, completes in seconds, and leaves the application's existing latency profile alone. The backfill that fills in the values runs separately, chunked and throttled.

#### Step 2: Throttle backfill so replication stays within budget

Analytics replicas and the warehouse's CDC consumer depend on replication staying inside an agreed lag budget. Backfill runs in small chunks under the orchestrator with throttling: a chunk runs, the orchestrator checks replication lag, and either continues or pauses until lag recovers. A 'run as fast as possible' backfill is what catches the replication slot and breaks the downstream consumers. The throttle is the contract that keeps replication healthy throughout.

#### Step 3: Backfill writes only where the column is still empty

Once the application starts writing the new column for new rows, the backfill must not overwrite those values. Each UPDATE includes `WHERE new_column IS NULL` so it touches only rows still waiting on the backfill. An application write that ran first wins; the backfill skips that row. Without the predicate, a slow backfill chunk can race the application and overwrite a value the user just wrote.

#### Step 4: Rehearse the rollback before the migration starts

If the migration is abandoned, the rollback removes the column cleanly within the rollback window. The rollback path runs in a non-production environment first to verify it completes inside the window without leaving auxiliary state behind. A rollback that's never been rehearsed is the version that takes longer than the window during a real abandonment, which means the team is committed to a half-migrated production table by the time leadership wants out.

---

### The shape that fits

> **What this design gives up**
>
> Chunked throttled backfill takes longer than running at full speed; the predicate-based UPDATE adds an index lookup per row; rollback rehearsal is upfront work that pays off only if the migration is abandoned. Implementation cost is the price; the win is the application's latency budget intact, replication consumers unbroken, application writes never lost to a backfill race, and a rollback that actually works inside the window.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An orchestration layer runs backfill in chunks with throttling and checkpointing.
> - A CDC path keeps analytics replicas and the warehouse change feed in sync; backfill respects replication-lag budget.
> - Backfill UPDATEs include a 'where the column is empty' predicate so application writes are never overwritten.
> - A tested rollback path removes the column cleanly within the rollback window.

> **The mistake that ships**
>
> What gets shipped runs ALTER TABLE with a default value and a 'fast' backfill that runs straight through. The default-value rewrite holds the table long enough that the application's latency budget breaks; the on-call engineer kills the migration midway. The backfill catches the replication slot; analytics replicas fall behind by hours and the warehouse stops loading. The application starts writing the new column for new rows; a slow backfill chunk overwrites application values the user just wrote. The team decides to back out and discovers the rollback wasn't rehearsed and takes longer than the window. The eventual approach is the chunked, throttled, predicate-bound backfill with a rehearsed rollback.

---

## Common follow-up questions

- Replication lag breaches the budget mid-backfill. What does the orchestrator do, and how does the design recover without losing progress? _(Tests whether the candidate sees that throttling pauses chunk execution until lag recovers, the orchestrator's checkpoint marks where it left off, and resumption picks up from the last completed chunk. The pause is a property of the design, not an operator's call; the alert tells on-call about the pause but doesn't require their intervention to recover.)_
- Two weeks in, the product team decides the column is no longer needed but half the table has been backfilled. What does the rollback do, and how long does it take? _(Tests whether the candidate sees the rehearsed rollback as the path: drop the column (metadata-only on most engines), drop the orchestrator's backfill state, and confirm replication catches up. The rehearsal showed the rollback's runtime; the migration commits to the rollback fitting inside the window.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/a_new_column_on_a_billion_rows)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.