# A Clean Number for Every Merchant

> Raw payment logs in. Clean merchant summaries out.

Canonical URL: <https://datadriven.io/problems/a_clean_number_for_every_merchant>

Domain: Pipeline Design · Difficulty: hard · Seniority: L7

## Problem

Our payments platform generates a daily log containing every transaction across all merchants. The finance and BI teams need a clean merchant-level summary - total volume, transaction count, average amount - but the raw logs have duplicates, schema inconsistencies, and no restart safety. Design a batch pipeline that reliably produces this summary.

## Worked solution and explanation

### Why this problem exists in real interviews

An L7 batch problem dressed as 'just produce a daily summary.' The hard work is the four properties that have to hold together: the 7am deadline, dedup against retried gateway events, an honest path for bad rows, and reruns that produce identical output. Any one of these is solvable; the trap is solving them in series and discovering on day one that the rerun changes the numbers because the dedup wasn't deterministic.

The default is a nightly job that reads the raw log, drops bad rows, groups by merchant, and writes a summary. The first morning finance points out that two merchants' totals are inflated because the gateway retried some events and the job aggregated them all. Rerunning the day produces a slightly different number than the original because the dropped rows weren't dropped consistently. Bad rows nobody saw silently shrank the total. By the second week the team is hand-reconciling against the gateway log and finance has stopped trusting the pipeline.

> **Trick to Solving**
>
> Dedup deterministically before aggregation, write through staging with partition-overwrite by date, route bad rows to a quarantine, let the orchestrator alert before 7am.
> 
> 1. Dedup runs before aggregation on a stable payment id, so a retried event collapses to one row regardless of the order events arrive.
> 2. The summary writes via partition-overwrite keyed on the report date, through a staging table, so a rerun replaces the day cleanly rather than appending or merging into a half-finished prior run.
> 3. Validation routes invalid rows to a quarantine; the rest of the file continues to aggregation. Finance sees both the totals and the quarantined rows.

---

### Walk the requirements

#### Step 1: Land the merchant summary before 7am, with alerting before

An orchestrator owns the daily DAG: ingest the raw log, dedup, validate, aggregate, write the summary. Each stage has its own SLA and its own sensor; the orchestrator alerts before 7am if any stage is at risk. Without an orchestration layer there's nothing watching the deadline; finance discovers the missed window when they open the dashboard. Without a warehouse tier the summary has nowhere to land.

#### Step 2: Dedup on a stable payment id, before aggregation

Gateway retries deliver the same payment more than once. Dedup runs before aggregation on the gateway's stable payment id; the same id collapses to one row regardless of how many times it arrived or in what order. Aggregating first and then trying to undo duplicates is the version where the totals look right on Monday and inflate by Tuesday. Dedup at the right boundary, on the key the gateway controls, is the contract that keeps merchant totals honest.

#### Step 3: Quarantine bad rows; the rest keeps moving

Records with missing merchant or amount can't go into the merchant total and can't be silently dropped either. Validation routes those rows to a quarantine table with the rejection reason; the rest of the file flows into aggregation. Finance reads both: the merchant summary for reconciliation, the quarantine for review. Silently filtering bad rows is the move that makes the pipeline's totals quietly diverge from the gateway's reported totals; failing the entire run on the first bad row blocks finance over a row they would have set aside anyway.

#### Step 4: Reruns produce the same numbers, by construction

When a stage fails midway, the operator reruns the day. The rerun has to land on the same numbers, not slightly different ones, or finance loses trust in the warehouse. Two properties get you there: dedup is deterministic on the payment id, and the summary writes via partition-overwrite keyed on the report date, through a staging table. Whatever stage failed, restarting the DAG for that date replaces the partition cleanly with the same end state. An append-style write is what makes the second run different from the first.

---

### The shape that fits

> **What this design gives up**
>
> Dedup before aggregation needs an index pass on the raw log. Staging-then-overwrite roughly doubles storage during the build window. Quarantine adds a triage workflow finance has to actually run. Pipeline simplicity is the cost; the win is a summary that lands on time, totals that match the gateway, bad rows that nobody loses, and reruns that produce the same numbers.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An orchestration layer schedules the daily DAG with sensors and alerts before the 7am finance deadline.
> - Dedup happens before aggregation on a stable payment id so retried events collapse.
> - The merchant summary writes to a warehouse via partition-overwrite by date so reruns produce identical output.
> - Invalid rows route to a quarantine for finance review without affecting the merchant totals.

> **The mistake that ships**
>
> What gets shipped reads the raw log, drops anything that doesn't parse, groups by merchant, and writes the summary. Two merchants' totals inflate because retried gateway events were aggregated more than once; finance reconciles to the gateway and the gap is unexplained. Bad rows that quietly dropped account for the rest of the gap. The on-call engineer reruns to fix a partial failure and the rerun's numbers differ from the original by a few thousand dollars because dedup wasn't deterministic. Finance escalates to the head of data; the team rebuilds with dedup-on-payment-id, staging-and-overwrite, and a quarantine, but the trust takes longer to come back than the rebuild.

---

## Common follow-up questions

- A merchant's total looks higher today than yesterday's, even though the merchant says volume was flat. Where in this design do you start, and what would you query? _(Tests whether the candidate uses the quarantine table as a triage signal first (was a chunk of yesterday's bad rows fixed and replayed today?), then the dedup output (did dedup collapse fewer events than expected?), then the gateway log. The pipeline produces evidence; the candidate has to know where to read it.)_
- Finance asks for the merchant summary for a date six months ago, but the partition has been compacted in the warehouse. What in this design lets the answer still be the same? _(Tests whether the candidate sees that the summary is reproducible from the raw log: rerunning the DAG for that date through dedup, validate, aggregate produces the same partition. Compaction is irrelevant because partition-overwrite by date with deterministic logic is the contract.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/a_clean_number_for_every_merchant)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.