# Stores and the Site, Together

> The registers never stop ringing.

Canonical URL: <https://datadriven.io/problems/stores_and_the_site_together>

Domain: Pipeline Design · Difficulty: hard · Seniority: L7

## Problem

We're a global retailer with physical stores and e-commerce. The merchandising team needs near-real-time visibility into sales performance across all channels. Design an architecture to process real-time sales data in a data lake.

## Worked solution and explanation

### Why this problem exists in real interviews

An L7 multi-consumer retail pipeline. Merchandising wants minutes; finance wants daily reconciliation; data science wants weekly stable training data; supply chain reads inventory triggers. PCI says card numbers can't sit in the lake; both ingestion paths retry. The trap is forcing all four consumers onto one storage shape and treating PCI as a downstream filter.

The default reach is one streaming pipeline that lands every transaction in one warehouse table all four teams read. Merchandising is happy. Finance reconciles and the daily total disagrees with the gateway because retries from POS and e-commerce both produced duplicates. Data science queries the same table and gets a moving target every week. PCI auditors find raw PANs in the lake because masking happened in a downstream view. Supply chain's inventory triggers fire late because the table they query is sized for analyst scans, not point-lookups.

> **Trick to Solving**
>
> Tokenize at ingest, dedup once on transaction id, fan out to four stores tuned to four query shapes.
> 
> 1. Tokenization runs at the ingest boundary; the lake and every consumer store hold tokens, not card numbers. PCI scope shrinks to the tokenization service.
> 2. Dedup runs once at ingest on a stable transaction id from each source; both POS and e-commerce paths feed the same dedup before fan-out.
> 3. Four downstream stores tuned to four workloads: a streaming merchandising store for minutes-fresh dashboards, a warehouse for finance's daily reconciliation, a feature/training surface in the lake for data science, an inventory-trigger store for supply chain.

---

### Walk the requirements

#### Step 1: Sales reach merchandising in minutes; finance reads slower

POS and e-commerce events flow through tokenization and dedup into a streaming merchandising store within minutes; the same events also land in the lake for finance, data science, and supply chain on slower cadences. Merchandising's dashboard reads from the streaming store. Without a streaming tier the named goal is unaddressed; without the lake the four consumers can't share a long-retention source of truth.

#### Step 2: Four consumers, four stores, one source

Merchandising runs interactive dashboards (streaming store for minutes-fresh aggregates), finance reconciles daily (warehouse for daily totals against the gateway), data science trains weekly (lake-backed feature surface with stable history), supply chain reads inventory triggers (low-latency point-lookup store keyed on SKU). All four are fed from the deduped, tokenized stream; each layout matches its consumer's query pattern. Forcing all four onto one shared store means at least three suffer.

#### Step 3: Tokenize at ingest, before any store sees card data

PCI says card numbers can't sit in the lake. Tokenization runs at the ingest boundary; the tokenization service holds the only mapping. Every downstream store , lake, warehouse, merchandising store, supply chain store , sees only tokens. PCI scope shrinks to the tokenization service rather than every store the data touches. Masking at the warehouse view is the version where raw PANs sit in underlying tables an auditor will query.

#### Step 4: Dedup once at ingest on transaction id

Both ingestion paths can retry. A single dedup step runs at ingest on the stable (source, transaction_id) key so each transaction collapses to one row before any consumer sees it. The downstream stores write idempotently on the same key. Counting at each consumer is the version where retries silently inflate different views differently and finance's reconciliation diverges from merchandising's count.

---

### The shape that fits

> **What this design gives up**
>
> Four consumer stores is more operational machinery than one shared store; tokenization at ingest adds a service every event passes through; dedup at the boundary needs a unique-id index. Implementation cost is the price; the win is merchandising in minutes, finance reconciliation that doesn't require manual cleanup, data science training that's stable, supply chain triggers that fire on time, and PCI scope that stays inside the tokenization service.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - A streaming path delivers POS and e-commerce sales to merchandising within minutes; finance reads from a slower batch path.
> - Card numbers are tokenized at ingest; the lake and downstream stores hold only tokens.
> - Dedup runs once on transaction id before fan-out; retries from either source collapse.
> - Four downstream stores tuned to four query patterns rather than one shared shape.

> **The mistake that ships**
>
> What gets shipped streams every transaction into one warehouse table four teams query. PCI auditors find raw PANs in the underlying tables because masking lived in a view downstream of the warehouse. Retries from either source produce duplicates that inflate finance's totals; the gateway reconciliation finds the gap. Supply chain reads from a table sized for analyst scans and inventory triggers fire late. The eventual rebuild is tokenize-at-ingest, dedup-once, and four stores fed from one stream.

---

## Common follow-up questions

- Data science wants to train against last quarter's transactions including the original card numbers (just for fraud features). What in this design lets them, and what doesn't? _(Tests whether the candidate keeps PCI scope tight: data science cannot pull raw PANs; the lake holds only tokens. If a fraud feature needs PAN-derived signals (BIN, country), those derive at the tokenization boundary and flow through as fields, not as raw PANs. Reaching back to the tokenization service for raw PANs is a separate audited path with stricter access.)_
- Merchandising wants the same minutes-fresh view for a new region. What changes in this design, and what doesn't? _(Tests whether the candidate sees the streaming merchandising store as the extension surface: the new region's events flow through the same tokenization and dedup, land in the same store, and merchandising's dashboard adds the region as a filter. The four-store layout, the tokenization, and the dedup are unchanged.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/stores_and_the_site_together)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.