# Analysts Are Slowing the Store Down

> Orders placed. Data warehouse hungry.

Canonical URL: <https://datadriven.io/problems/analysts_are_slowing_the_store_down>

Domain: Pipeline Design · Difficulty: medium · Seniority: L5

## Problem

We run an e-commerce platform with hundreds of thousands of merchants and millions of daily orders, but our analytics team currently queries the production database directly, which is creating performance problems for the application. We need to move analytics onto a proper warehouse without disrupting the live platform. Design the pipeline that replicates order and product data from the OLTP system into the analytics layer with sufficient freshness for business reporting.

## Worked solution and explanation

### Why this problem exists in real interviews

Analytics is hitting the read replica that the application also depends on, lag spikes, customers feel it. The fix sounds obvious: move analytics off the replica. The trap is the obvious fix is incomplete: a warehouse with nightly batch satisfies analytics but breaks the merchant dashboard's freshness; a warehouse merchants share without per-merchant policy is one missing filter from a leak.

The natural shape is a nightly batch from the replica into a new warehouse. Analytics is happy, the replica recovers, and merchants discover their dashboards are showing yesterday's orders. Somebody adds a 'realtime view' over the replica, which puts the load right back. Meanwhile the new warehouse exposes orders to merchants without row-level security, and a query without a `WHERE merchant_id = ?` returns another merchant's data the first time someone forgets it.

> **Trick to Solving**
>
> CDC off the log into a warehouse, streaming path for merchant dashboards, row-level security enforced by the platform, no queries against the OLTP.
> 
> 1. Log-based CDC reads the binlog/WAL the database is already writing for replication. Zero query load on either the primary or the replica.
> 2. Two consumer paths from the same change stream: a streaming path that updates a merchant-facing store within minutes, and a batch path that lands in the warehouse for analytics.
> 3. Row-level security is a property of the warehouse, applied per merchant. The policy follows the user, not the query.

---

### Walk the requirements

#### Step 1: Move analytics off the replica with log-based CDC

Replication has to add zero load to the production system. Log-based CDC (Debezium, Datastream, built-in connectors) reads the binlog or WAL that the database is already writing, runs as a separate process, and emits change events to a stream. Analytics reads from the warehouse fed by that stream; nothing analytics-shaped touches the OLTP or the replica. Without a CDC capture path the replication is a query against production, which is the named cause of the problem.

#### Step 2: Streaming path serves merchant dashboards in minutes

Merchants expect their orders within minutes. The same change stream feeds a streaming consumer that updates a merchant-facing serving store; the dashboard reads from there. Analytics runs on a slower batch path into the warehouse. One stream, two consumer paths sized to the consumer; merchants don't pay the latency of the analytics batch and analytics doesn't pay the cost of a streaming serving tier. A 'we'll just make analytics fast and let merchants read it too' design ends up paying streaming prices for analytics.

#### Step 3: Per-merchant row-level security at the warehouse

Merchants see only their own orders. The warehouse enforces row-level security on the orders table: when a merchant queries, the warehouse appends `merchant_id = <this user's merchant>` automatically, no matter what query the merchant wrote. A 'we'll filter in the BI tool' approach is one forgotten clause from an exposure that travels to support. Policy lives on the table, enforced at query time; merchants and analytics readers get different views from the same physical table.

---

### The shape that fits

> **What this design gives up**
>
> Log-based CDC plus a streaming serving store plus a warehouse is more pieces than one nightly batch from the replica. Row-level security adds a query-rewrite cost on every merchant read. Operational simplicity is the cost; in return, the application stops being slowed by analytics, merchants believe their dashboards, and the row boundary survives any forgotten filter.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - A change-data-capture path off the production database log adds zero query load.
> - Two consumer paths off the change stream serve merchant dashboards within minutes and analytics in the warehouse.
> - Row-level security on the warehouse enforces per-merchant visibility regardless of how the query is written.

> **The mistake that ships**
>
> The design the team ships does nightly batch from the replica into a new warehouse and exposes the orders table to merchants without row-level security. Replica lag stops spiking; merchants discover their dashboards lag a day. Somebody runs a query without `WHERE merchant_id` and another merchant's orders show up in a CSV export. The team adds a streaming path and row-level security and switches replication to log-based CDC after the fact. The rebuild walks back the three properties the original cut decided to skip.

---

## Common follow-up questions

- An analyst asks for a cross-merchant aggregate ('sum of revenue across all merchants this week'). What in this design lets them, and what doesn't? _(Tests whether the candidate sees row-level security as merchant-scoped and analyst-scoped differently: analysts have an internal role that sees all rows, while merchants have per-merchant policies. The cross-merchant query is allowed because the analyst's role flips the policy.)_
- The CDC stream falls behind during a maintenance window on the production database. What does the merchant dashboard show, and what does the analytics warehouse do? _(Tests whether the candidate has thought about staleness: the dashboard either freezes at the last update or surfaces a 'stale' indicator; the warehouse keeps catching up when the stream resumes; the lag alert fires before either consumer is meaningfully impacted.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/analysts_are_slowing_the_store_down)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.