# Doubling Every Six Months

> Tuesdays are quiet. Black Friday is not.

Canonical URL: <https://datadriven.io/problems/doubling_every_six_months>

Domain: Pipeline Design · Difficulty: hard · Seniority: L7

## Problem

We're a fast-growing marketplace and our data volume has been doubling every six months. We keep throwing more servers at the problem but it doesn't scale and the costs are exploding. Three teams read this data on completely different cadences (operations live, analytics hourly, data science weekly), flash sales drive 10x traffic spikes that take hours to recover from, and three new enterprise customers signed contracts requiring their data be kept separate from everyone else's. Design a pipeline that scales automatically without manual provisioning, brings cost per unit down as we grow, and holds all three constraints at once.

## Worked solution and explanation

### Why this problem exists in real interviews

Volume doubles every six months. The CTO wants cost per unit going down, not up. Three teams read the same data on completely different clocks, flash sales drive 10x spikes, and three new enterprise contracts say their data has to stay separate from everyone else's. Any one of these is a normal pipeline problem; together they kill any design that picks 'one cluster, one table, one schedule.'

First instinct is to keep the existing fixed cluster but make it bigger and turn on auto-scaling on the worker count. The cluster still idles at the floor capacity 80% of the time, the storage layer still looks like one giant partition because nobody changed it, queries still scan the world, and tenant isolation still depends on a `WHERE customer_id = ?` filter in the BI tool. Cost goes up, query time gets worse, and the enterprise audit fails because the data is still commingled.

> **Trick to Solving**
>
> When data doubles, the answer isn't a bigger cluster; it's letting compute be zero when nobody's reading.
> 
> 1. Anchor on cheap, partitioned object storage. The economics only bend down if storage is cold by default and compute spins up on demand.
> 2. Three consumers want sub-minute, hourly, and weekly. That's three paths, not three priorities on one tier. Forcing a shared tier sized for the fastest is what made the bill explode.
> 3. Tenant isolation lives in the layout, not in the query. Per-customer prefixes / partitions take the question out of the application layer where it's been failing audits.

---

### Walk the requirements

#### Step 1: Anchor on cheap, partitioned storage so compute can be zero

Cost per unit only falls if storage is cheap and compute doesn't run when nobody's reading. That points at object storage like S3, GCS, or ADLS with date / tenant partitioning underneath, and a serverless query engine on top, anything from Athena to BigQuery to Trino. The shape is 'pay for storage, pay for queries, idle is free.' A fixed cluster running 24/7 is the opposite of that bargain.

#### Step 2: Run a path per consumer cadence, not per consumer

Operations needs sub-minute, analytics is hourly, data science is weekly. Three cadences, but only two paths: a streaming path for ops, and a batch path that lands in cold storage and is queried on demand by analytics and data science. Whatever stream processor you reach for, Flink, Spark Streaming, Kafka Streams, the streaming path is the expensive one; keep it narrow. Everything that doesn't need to be there shouldn't be.

#### Step 3: Absorb spikes in a buffer, not in compute

10x flash-sale spikes were taking the cluster hours to recover from because there was nowhere for events to wait. Put a queue or log between producers and processing, anything from Kafka to Kinesis to Pub/Sub does this, so the spike's worth of data sits in the buffer while the consumers catch up. Recovery time becomes 'how long until the buffer drains' instead of 'how long until the cluster stops failing.'

#### Step 4: Put tenant isolation in the layout, not the query

Three enterprise customers signed contracts requiring isolation. A `WHERE` clause in BI is not isolation, and it's the kind of thing that fails an audit when someone in the wrong tenant runs the wrong query. Per-tenant prefixes (or buckets) make the boundary a property of where the data lives, enforced by storage permissions, not by every consumer remembering to filter.

---

### The shape that fits

> **What this design gives up**
>
> On-demand compute is cheaper but cold-starts hurt. The first query of the morning is slower than the tenth. Per-tenant layout costs you cross-tenant query convenience: if marketing wants 'all customers in the east region', they're now reading across many partitions. You're choosing predictable cost and clean isolation over the convenience of one big table.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Data is doubling every six months and the CTO wants cost per unit to fall as we grow.
> - Operations needs sub-minute freshness, analytics is hourly, and data science runs weekly.

> **The mistake that ships**
>
> The version that ships keeps one cluster, one big table, makes the cluster auto-scale on workers, and adds a `customer_id` filter in BI. The bill goes up because the cluster never goes to zero, the spike still saturates compute because nothing buffers, and during an audit a marketing analyst's query returns rows from an enterprise customer because they forgot the filter. The team scrambles to add a code-review rule about filters. The actual fix is that compute has to be on-demand and isolation has to be in the layout, not in queries.

---

## Common follow-up questions

- When a flash sale doubles traffic for an hour, what saturates first and what catches up first? _(Tests where back-pressure lives. Buffer depth, stream-processor scaling speed, on-demand query cold-start, and which consumer notices the spike last.)_
- How do you give an enterprise customer a hard ceiling on their query cost without rewriting the pipeline? _(Tests per-tenant compute isolation: separate query slots, per-tenant credentials, or a chargeback mechanism off the partition layout you already have.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/doubling_every_six_months)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.