# Every Region Exports Its Own Way

> Sales data, BigQuery, Dataflow. Make it all sing.

Canonical URL: <https://datadriven.io/problems/every_region_exports_its_own_way>

Domain: Pipeline Design · Difficulty: medium · Seniority: L5

## Problem

Our sales organization runs entirely on GCP and needs an end-to-end data pipeline to move transactional sales data from multiple regional databases into a centralized analytics layer. Right now every regional team runs its own ad-hoc exports. Design a unified pipeline that handles ingestion, transformation, and storage and delivers query performance that meets business SLAs.

## Worked solution and explanation

### Why this problem exists in real interviews

Three constraints from three different rooms. Sales wants same-day actuals; the CFO wants the bill down; legal wants region isolation enforced. The trap is that ad-hoc per-region exports already 'work' the way each region wants and consolidating into one warehouse without a deliberate layout is expensive in all three dimensions: it can be slow, expensive, and leak across regions all at once.

The default move is one nightly job that pulls every region into a single warehouse table. Same-day actuals slip when the slowest region runs long. Common queries scan the full table because nothing's been laid out for the actual access pattern, and the bill grows. A regional sales rep runs a query missing the region filter and gets another region's deals in their result, which is the kind of thing that shows up in a Slack screenshot before the team finds out.

> **Trick to Solving**
>
> Per-region orchestration into a unified warehouse, partitioned and clustered for the common query, row-level security on the deal table.
> 
> 1. Each region runs its own ingest task on its own schedule, gated by a sensor for that region's data. The orchestrator runs them independently; one slow region doesn't block the others.
> 2. Warehouse layout matches the common query. Partition by date and cluster by region (or product) so a regional dashboard reads a slice, not the world.
> 3. Row-level security on the deal table enforces the region boundary at the warehouse, regardless of how the query is written.

---

### Walk the requirements

#### Step 1: Per-region ingest, sized to land before each region's morning deadline

An orchestrator schedules each region's ingest as its own task, with sensors that fire before the region's morning deadline if any region is at risk. Each region's data lands in the warehouse independently; a global refresh follows once the regions it depends on have completed. One slow region doesn't block the others; on-call sees the issue with hours to fix, not minutes. Without an orchestration layer there's nothing watching the per-region SLA; without a warehouse tier the unified view has nowhere to live.

#### Step 2: Lay out the warehouse so common queries scan a slice

The CFO wants the bill down without slowing reports. Partition the deal fact by date so a 'last week' query reads only the relevant days; cluster by region (or product family) so a regional dashboard scans only its slice. Common queries become cheap because they read few partitions and few clusters; rare queries pay for what they scan. A 'one big flat table' approach pays warehouse compute for every query at full table scan, which is the cost shape the CFO is reacting to.

#### Step 3: Row-level security on deals at the warehouse

Sales reps see their region's deals; finance sees everything. The warehouse owns the policy: when a sales rep queries the deal table, the warehouse appends the region filter automatically based on who they are. A 'we'll filter in the BI tool' approach is one forgotten clause from a leak; row-level security enforces the boundary regardless of how the query is written. The policy lives on the table, not on the screen.

---

### The shape that fits

> **What this design gives up**
>
> Per-region orchestration is more DAG configuration than one nightly job. Partition and cluster keys constrain query patterns: a query that doesn't filter by date or region won't get the cost benefit and will look slower than expected. Row-level security adds query-rewrite cost on every read. Some pipeline simplicity is the cost; the win is same-day per-region freshness, a bill that scales with what's actually queried, and a region boundary that survives a forgotten filter.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An orchestrator schedules per-region ingest with per-region SLA alerting before each region's morning deadline.
> - The warehouse layout matches the common query (date partition, region cluster) so dashboards scan a slice.
> - Row-level security on the deal table enforces region visibility regardless of how the query is written.

> **The mistake that ships**
>
> What goes out the door first does one nightly job over all regions, lands in a flat warehouse table, and applies region filters in the BI tool. One slow region pushes the morning report past its deadline. The CFO asks why the bill is climbing and the answer is full-table scans on every dashboard. A sales rep exports a CSV without the region filter and another region's deals end up in a screenshot. The team rebuilds with per-region orchestration, partition+cluster layout, and row-level security. The CFO's bill question stays open across two quarters while the layout is rebuilt.

---

## Common follow-up questions

- A regional dashboard query is filtering by date but still reading too many bytes. What do you look at first, and what changes? _(Tests whether the candidate sees that partition + cluster only help when the filter touches the partition key (date) and the cluster key (region). If the slow query filters by, say, sales rep id without region, the cluster doesn't help; the fix is either adding region to the filter, adding rep id to the cluster keys, or building a derived table for that pattern.)_
- A finance analyst is moved into a regional role. What changes in this design, and where? _(Tests whether the candidate sees row-level security as a property of the user's role: changing the role flips visibility platform-wide; no query rewrites, no exports to redistribute.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/every_region_exports_its_own_way)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.