# The Firehose and the Ledger

Canonical URL: <https://datadriven.io/problems/the-firehose-and-the-ledger>

Domain: Pipeline Design · Difficulty: hard · Seniority: senior

## Problem

We ingest payment events from a few hundred partner APIs, some pushing webhooks and some we poll on a schedule, totaling around 2 billion events a day. Operations needs a near-real-time view of transaction volume and error rates within seconds, while finance needs an exactly-once daily ledger that reconciles to the cent. Partners go down and resend the same events when they recover, so the platform has to retry, deduplicate, and never lose or double-count a transaction.

## Worked solution and explanation

### Why this problem exists in real interviews

This is two correctness budgets fed by one unreliable firehose, dressed up as a platform design. Operations wants an approximate view in seconds; finance wants an exact ledger once a day. The trap is the single pipeline that aggregates everything into one table both teams read: it is too slow to feel live for ops and too approximate to bill on for finance, and the day a partner replays its backlog it double-counts the ledger. The skill being probed is whether you split the budgets and put exactness where it belongs, in batch, not in the stream.

The other half of the trap is the ingestion edge. A few hundred partner APIs, two thirds pushing webhooks and the rest polled, will burst and fail independently. If processing reads partners directly, one slow partner backs up everyone and a recovering partner's replay floods the system. The events that look like noise (retries, resends a day later) are exactly what the design has to absorb without losing or double-counting a cent.

> **Trick to Solving**
>
> One source, two budgets, decoupled at the edge.
> 
> 1. A durable queue at the front absorbs webhook bursts and scheduled polls, so unreliable partners are decoupled from processing.
> 2. Every raw arrival lands immutably in the lake before anything touches it; this is the replay source, not the partner APIs.
> 3. Ops reads a streaming path: approximate, sub-minute, fine to be briefly off.
> 4. Finance reads a batch path: dedup on the stable transaction id over the full day, exactly-once, into the warehouse.

---

### Walk the requirements

#### Step 1: Buffer the ingestion edge before processing reads anything

Webhook pushes and scheduled pulls land in a durable queue first. The queue is what makes one bursty or offline partner a non-event for everyone else: producers write at their own pace, consumers drain at theirs. Reading partner APIs straight into the stream processor is the version where a single partner's recovery backlog stalls the whole platform during business hours.

#### Step 2: Land raw events immutably as the replay source

From the queue, every event is written as-is into the raw lake before any dedup or curation. When your own logic is wrong (a dedup bug, a ledger miscount), you reprocess the affected day from the lake and overwrite the partition. Re-pulling from hundreds of rate-limited partner APIs, some of which have already aged the data out, is not a recovery plan.

#### Step 3: Give ops an approximate, seconds-fresh stream

A stream processor reads the queue and emits per-partner volume and error rate into a serving store the dashboard reads within seconds. Approximate is the budget: a transaction counted twice for a few seconds during a retry is invisible to someone watching trend lines. Putting the exact ledger logic on this path is what makes the ops view lag enough that operations stops trusting it.

#### Step 4: Make exactness a property of the daily batch, on a stable id

A batch job reads the day's raw events from the lake and deduplicates on the partner-assigned transaction id over the full day, because a resend can arrive a day after the original. The deduplicated, reconciled result is written idempotently into the warehouse: rerunning the day produces the same ledger. This is the only place exactly-once can actually be guaranteed, since the dedup window is the whole day, not a short streaming window.

---

### The shape that fits

**One pipeline both teams read**

A single stream aggregates everything into one table. Ops sees lag because the table carries the heavier exactness logic; finance double-counts when a partner replays its backlog because the streaming dedup window is too short to catch a day-late resend. Both teams are unhappy in different ways.

**Two paths, one source**

Ops reads a light streaming path tuned for seconds and approximate counts; finance reads a daily batch that dedups over the whole day and reconciles to the cent. The same raw events feed both; the budgets diverge after the lake.

> **Scale + Cost**
>
> At 2 billion events a day near 600 bytes each, that is roughly 1.2 TB/day landing in the lake, peaking around 5x average in business hours. Cost concentrates in two places: the queue partitioning that has to sustain peak ingest, and the batch dedup shuffle over a day of events. The bottleneck is the daily dedup, not the stream; keep the stream cheap and approximate and spend the compute where exactness is required.

> **Common Pitfall**
>
> Skew in the daily dedup. A handful of large partners contribute most of the volume, so partitioning the batch job by partner creates a few hour-long tasks while the rest finish in minutes. The Spark UI shows one straggler stage on a hot key. Partition by a hash of the transaction id, or salt the hot partner keys, so dedup spreads evenly. Adding executors does nothing here; it is a layout problem, not a compute problem.

> **Interviewers Watch For**
>
> Naming the dedup key and its window out loud (stable transaction id, full day, because resends arrive a day late). Treating the raw lake as the replay source rather than re-pulling partners. Saying explicitly that ops is approximate and finance is exact, and that this is why there are two paths. And instrumenting per-partner error and dedup rates so a bad partner is caught before finance reconciliation does it for you.

---

## Common follow-up questions

- A partner was offline all day and replays 400 million events at 2am, after your daily ledger already ran. How does the design produce a correct ledger? _(Tests whether the candidate uses the immutable lake plus idempotent partition-overwrite to reprocess the day, rather than trying to patch the warehouse in place or losing the late events.)_
- Operations now wants per-partner error alerts within 5 seconds, not just a dashboard. What changes? _(Tests pushing alerting onto the streaming path and an alert destination, while keeping the exact accounting on batch; the candidate should not move alerting into the daily job.)_
- The daily dedup job has started missing its morning SLA as volume grew. Where do you look first? _(Tests skew diagnosis (hot-partner partitioning), file layout in the lake, and shuffle sizing before reaching for a bigger cluster.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the-firehose-and-the-ledger)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.