# End of Day Is Too Late

> Every swipe tells a story.

Canonical URL: <https://datadriven.io/problems/end_of_day_is_too_late>

Domain: Pipeline Design · Difficulty: medium · Seniority: L6

## Problem

Our fraud and risk teams need visibility into card transactions as they happen. Right now there's no real-time view; everything is end-of-day batch. Design a data streaming pipeline.

## Worked solution and explanation

### Why this problem exists in real interviews

Fraud needs the next transaction within seconds; finance needs every transaction counted exactly once at end of day; PCI says the card number can't sit in any of those stores; and the bad event that comes in malformed can't be allowed to halt the thousands of good ones behind it. Any one of these is solvable. Together they kill the obvious 'just put Kafka in front of the batch job' design.

The whiteboard answer is a stream that writes to a table the fraud dashboard reads, and the same stream forks to the lake at end of day. Bad event arrives, the parser throws, the consumer restarts, the offset gets replayed, fraud sees the same alert twice and finance's daily total comes out double-counted. Meanwhile the PAN is still sitting in the raw topic and the lake. Three of the four requirements have failed quietly.

> **Trick to Solving**
>
> A streaming pipeline isn't done when it's fast; it's done when it's exactly-once, PCI-clean, and survives a poison pill.
> 
> 1. Tokenize at the edge. The PAN never sits in the queue, the lake, or the fraud store; only a token does. PCI scope shrinks to the tokenization service.
> 2. Exactly-once is enforced at the sinks, not assumed by the stream. Idempotent writes keyed on transaction id, plus a checkpoint contract with the source.
> 3. Bad events go to a dead-letter queue, not to the floor. The main consumer keeps draining the topic; a separate process triages the DLQ.

---

### Walk the requirements

#### Step 1: Put fraud on a streaming path that lands in seconds

Card transactions flow into a queue and a stream processor that updates the fraud online lookup tier and dashboard within seconds. The fraud team's view reads from the stream-fed store, not from yesterday's batch. Whatever stream tech you pick (Flink, Kafka Streams, Spark Streaming), the property that matters is sub-minute end-to-end. End-of-day batch is the named problem; if the path to fraud isn't streaming, the requirement is unaddressed.

#### Step 2: Make exactly-once a property of the sinks, not a hope

Each transaction has a stable transaction id from the issuer. Both sinks (the fraud online lookup tier and the lake) write idempotently keyed on that id: an upsert into the online lookup tier, partition-overwrite into the lake. The stream processor uses checkpointed offsets against the source so a restart doesn't replay events into a non-idempotent sink. Lose one and fraud misses a signal; double-count one and finance's volume number is wrong, both stakeholders care about the same property from opposite directions.

#### Step 3: Tokenize before anything else writes

The PAN has to be replaced with a token at the very front of the pipeline, before the queue, the stream processor, the online lookup tier, or the lake see it. A tokenization service holds the only mapping; everything downstream sees only the token. PCI scope shrinks to one service instead of every store the data touches. Stripping in the warehouse at the end is too late; by then the PAN has been written to durable storage and the auditor has a finding.

#### Step 4: Send poison pills to a DLQ, keep draining

Validation lives on the stream processor: if a record can't be parsed or fails schema, the processor routes it to a dead-letter queue and acks the source so the main path keeps moving. A separate consumer triages the DLQ on its own schedule. If the main path halts on a bad event, the entire fraud team is blind for as long as that event takes to fix, which can be hours. The DLQ pattern is what lets the pipeline keep its SLA while still preserving the bad records for later review.

---

### The shape that fits

> **What this design gives up**
>
> Tokenization at the edge adds a hop and a service that everything depends on; if it goes down, the pipeline stops. Idempotent sinks cost more than append-only ones because you need a key index. A DLQ means triage work, someone has to actually look at it. The simpler pipeline goes; what arrives is one finance, fraud, and the PCI auditor will all sign off on.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - A streaming path delivers card transactions to the fraud online lookup tier and dashboard within seconds.
> - Tokenization happens before the queue, so no PAN ever lands in the lake or online lookup tier.
> - Validation failures route to a DLQ that doesn't block the main consumer.

> **The mistake that ships**
>
> The team's first cut uses Kafka in front of Spark Streaming, masks the PAN in a transformation step inside the stream, lets bad events kill the consumer (which restarts and replays), and writes to the fraud store with append. By week two, the fraud team is seeing duplicate alerts on the same card, finance has reconciliation tickets every Monday because volume is inflated, the PCI assessor finds raw PANs in the Kafka topic and the S3 raw zone, and an unparseable event from a misconfigured terminal halts the entire fraud pipeline during business hours and the team is blind until somebody patches the parser.

---

## Common follow-up questions

- An issuer integration starts sending occasional duplicate transactions with the same id. What in this design protects you, and what doesn't? _(Tests whether the candidate sees that the upsert sinks already protect against duplicate ids, but a duplicate in the source feed (same id, same data, sent twice) is silently absorbed, while a duplicate that mutates the data (same id, different amount) is a real problem the dedup key alone doesn't solve.)_
- Auditors want to see every PAN that flowed through the system in the last 90 days. Where do you go? _(Tests whether the candidate keeps PCI scope tight: the only place the PAN exists is in the tokenization service's mapping. The lake and the fraud store have only tokens. Auditing PANs is a query against the tokenization service, not a scan of the lake.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/end_of_day_is_too_late)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.