# The Bad Row That Broke the Dashboard

> Bad records cannot reach the warehouse.

Canonical URL: <https://datadriven.io/problems/the_bad_row_that_broke_the_dashboard>

Domain: Pipeline Design · Difficulty: medium · Seniority: L6

## Problem

Our application generates a high volume of events that need to land in Snowflake for analytics. We've had quality issues in the past where bad data made it into production tables and broke dashboards. The platform team wants a streaming pipeline where data quality is enforced before anything reaches production. Design the pipeline.

## Worked solution and explanation

### Why this problem exists in real interviews

Streaming quality enforcement with four properties that have to fit together: validation before promotion, recoverable rejection so engineering can replay, sub-minute end-to-end, and a producer contract that catches schema changes upstream. The trap is letting bad records through 'just to keep the pipeline moving' or letting validation block all events on a single bad row.

The default reach is a streaming load that writes everything and 'fixes' bad rows in a downstream cleanup. The first malformed event corrupts a dashboard; the team adds a parser fix; another field changes a week later and the same thing happens. Rejected records get logged to a file nobody reads. Producers ship a breaking schema change because the wiki page about discipline didn't catch it.

> **Trick to Solving**
>
> Validate in-stream before promotion, route failures to a recoverable store, enforce the schema contract at publish, do all of it inside the minute.
> 
> 1. An in-stream quality gate validates each record (schema, types, business rules) before any write to the production warehouse; failures route to a recoverable store with the original payload and failure reason.
> 2. The producer publishes through a schema contract; an incompatible change is rejected at publish, before consumers see it.
> 3. End-to-end stays inside the minute: streaming consumer, validator, warehouse load all run in a continuous flow.
> 4. An orchestrator monitors validation pass-rate and rejection backlog; sustained spikes alert before they pile up.

---

### Walk the requirements

#### Step 1: Validate before promotion; bad records never reach the warehouse

An in-stream quality gate runs schema, type, and business-rule checks on each record before the warehouse load. Records that pass write to the production table; records that fail route to a rejection store with the failure reason. Dashboards read only from validated rows. Without a quality-check tier the validation has no place to live; without a streaming tier the validation can't keep up with sub-minute events.

#### Step 2: Rejected records stay recoverable for replay

Each rejection writes the original payload and the failure reason to a rejection store keyed on event id and timestamp. Engineering queries the store to find rejected records, traces the root cause (parser bug, producer change, business-rule mismatch), fixes upstream, and replays the affected events back through the pipeline. Logging rejections to a file is the version where they disappear into ops; the rejection store is what makes recovery a workflow.

#### Step 3: Sub-minute end-to-end including validation

Events flow through a streaming consumer, the validator, and into the warehouse within the minute. The validator is part of the streaming path, not a downstream batch; the warehouse load reads only validated rows. A 'load first, validate later' design is the version where bad data is briefly in production and downstream consumers see it. In-stream validation is what closes the gap.

#### Step 4: Schema contract at the bus catches producer changes at publish

Product teams add and remove fields weekly. The bus's schema-contract layer rejects publishes that don't conform to the contract; producers find out at publish time, not when consumers crash at midnight. An additive change (new optional field) is allowed through compatibility rules; a breaking change is rejected. Without the contract every producer release is a coordination tax on every consumer.

---

### The shape that fits

> **What this design gives up**
>
> An in-stream quality gate runs validation on every record, which costs CPU; the rejection store grows with the failure rate; the schema contract requires producers to integrate with the registry. Implementation cost is the price; the win is dashboards that don't break, rejected records engineering can replay, and producer changes that don't surprise consumers.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An in-stream quality gate validates each record before promotion to the production warehouse.
> - Rejected records preserve the original payload and the failure reason in a recoverable store that supports replay.
> - A streaming pipeline lands records and decisions inside the minute.
> - A schema-contract layer at the bus rejects incompatible producer changes at publish time.

> **The mistake that ships**
>
> What gets shipped writes everything to the warehouse and runs a downstream cleanup for bad rows. The first dashboard breaks within a week. Rejection logs go to a file nobody reads. Producers ship a breaking schema change and downstream consumers crash at midnight. The eventual rebuild adds in-stream validation, a recoverable rejection store, and the schema contract , each was reachable up front if 'we won't have quality issues again' had been treated as a contract rather than a hope.

---

## Common follow-up questions

- A new business rule has to validate one new field; the rest of the pipeline can't change. What in this design lets the rule ship safely? _(Tests whether the candidate sees the quality gate as the extension point: a new validation rule deploys to the gate, runs against new and recent records, and the rejection store flags any failures. The schema contract, the streaming consumer, and the warehouse don't change. The rule can also run in shadow mode (logging without rejecting) before going live.)_
- Engineering replays a batch of rejected records after fixing the root cause. What does the design do to ensure the warehouse doesn't end up with duplicates or stale rows? _(Tests whether the candidate sees that the warehouse's load is idempotent on event id; the replay re-emits records and the upsert merges them in cleanly. The rejection store records the resolution so the same record isn't replayed twice. The dedup contract is at the warehouse, not 'whatever the replay tool remembers.')_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_bad_row_that_broke_the_dashboard)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.