# The Agency That Changes the Columns

> The schema changed overnight. Again.

Canonical URL: <https://datadriven.io/problems/the_agency_that_changes_the_columns>

Domain: Pipeline Design · Difficulty: medium · Seniority: L6

## Problem

We receive data files from a government agency every week. The files contain regulatory reporting data that we need in our warehouse, but the agency changes the file format without warning. Sometimes columns are added, sometimes renamed, sometimes the whole layout shifts. Design an ETL pipeline that can handle these file structure changes dynamically.

## Worked solution and explanation

### Why this problem exists in real interviews

A weekly file from a regulator with a moving format and consequences if you load it wrong is half schema management, half operational discipline. The trap is treating it as a normal ETL where 'schema validation' is a checkbox. Adapting silently to changes corrupts downstream tables; failing on every change blocks the load. The right answer is: detect the change loudly, stop the load, route bad rows for review, replace originals when corrections arrive.

The whiteboard answer is an ETL job that infers the schema from each file, drops rows that fail validation, and appends to the warehouse. The first time the agency adds a column the job loads it cleanly, the warehouse table grows a column nobody noticed, and downstream queries silently exclude it. The first time the agency renames a column the inference takes the new name, the old column appears empty, and compliance reads a regulatory report with half the data missing. When the agency sends a corrected file, the appended rows duplicate the original load.

> **Trick to Solving**
>
> Validate against an expected schema, halt loudly on a change, route bad rows to compliance, replace by period when corrections arrive.
> 
> 1. An expected schema is part of the pipeline config. The load step compares the incoming file's structure to the expected schema and halts if they differ; humans triage before downstream sees the data.
> 2. The warehouse load uses partition-overwrite keyed on the reporting period. A correction for a given week replaces the original rows for that week, not appends next to them.
> 3. Bad rows go to a quarantine table compliance reviews. The good rows keep moving; compliance has the bad rows visible, with the rejection reason.
> 4. An orchestrator owns the Monday 9am SLA: file arrival sensor, validation, load, quality check, with alerts before 9am if any stage is at risk.

---

### Walk the requirements

#### Step 1: Land and validate before Monday 9am, with alerting before

An orchestrator runs the weekly DAG: a sensor watches for the file's arrival, then validation, load, quality checks, and publish. Each stage has its own SLA; alerts fire before 9am Monday if anything is at risk. On-call has hours to fix, not minutes. Without orchestration nothing owns the deadline; without a warehouse target the cleaned data has nowhere to land.

#### Step 2: Detect a format change on arrival, stop the load

An expected schema (column names, types, order) is part of the pipeline config. On arrival, the validator compares the incoming file structure against the expected schema; any drift (added column, renamed column, shifted layout) halts the load and pages on-call before any row touches the warehouse. Humans triage: update the expected schema, replay, or escalate to the agency. A 'we'll adapt automatically' design corrupts downstream the first time the agency renames a column; a 'we'll log a warning and continue' design is the same failure with paperwork.

#### Step 3: Corrections replace the original by reporting period

When the agency sends a corrected file for a previously loaded period, the warehouse load uses partition-overwrite (or a staging-table swap) keyed on the reporting period, so the corrected rows replace the originals atomically. The correction is loaded through the same DAG, with the same validation, against the same partition. An append-style load duplicates rows and turns reconciliation into a week-of-week diff exercise; partition-overwrite keeps the warehouse table the way the agency last said it should be.

#### Step 4: Bad rows quarantine for compliance review

Files routinely contain rows with missing required fields. Validation routes those rows to a quarantine table with a rejection reason, while the rest of the file continues to the warehouse. Compliance reviews the quarantine on its own schedule, fixes upstream issues with the agency, and replays. Silently dropping the bad rows is the move that loses count of how often this happens; failing the whole load on the first bad row blocks compliance over a row they would have ignored.

---

### The shape that fits

> **What this design gives up**
>
> An expected-schema check that halts on drift means more 'failed' runs the team has to triage, instead of a pipeline that quietly absorbs everything. Partition-overwrite on corrections costs more than appends. The quarantine table needs a triage workflow compliance actually uses. The self-adapting pipeline goes; what arrives is one that doesn't corrupt regulatory data and lets compliance see exactly what the agency is sending.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An orchestrator sequences the weekly DAG with sensors and alerts before the Monday 9am SLA.
> - Cleaned regulatory data lands in the warehouse, with corrections written via partition-overwrite by reporting period.

> **The mistake that ships**
>
> What ends up in production infers the schema from each file, drops invalid rows, and appends to the warehouse. The agency adds a column and the warehouse silently grows it but downstream queries don't pick it up. The agency renames a column and half the regulatory report is empty. A corrected file arrives and gets appended, duplicating the period. Compliance flags the discrepancy two reviews later and the team has to backtrack and rebuild three weeks of warehouse state. The eventual fix is an expected-schema check, partition-overwrite on corrections, and a quarantine table. The cleanup runs three weeks while compliance reviews drift in parallel.

---

## Common follow-up questions

- The agency adds a column that's actually meaningful (a new field compliance wants). What does the path through this design look like, from arrival to publish? _(Tests whether the candidate sees the validator as a forcing function, not an obstacle: the schema-drift halt fires, on-call updates the expected schema after compliance signs off, the load replays, and the new column appears in the warehouse with full review. The pipeline doesn't auto-learn; humans approve the change.)_
- A corrected file arrives but only some of the rows in it differ from the original. How does this design avoid loading both copies of the unchanged rows? _(Tests whether the candidate sees that partition-overwrite is by reporting period, not by row, so the corrected file replaces all rows for that period in one transaction. Per-row dedup against the original is a more complex path that's only worth it if the corrected file is huge and mostly identical.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_agency_that_changes_the_columns)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.