# Greenfield Build for Six Sources

> Infrastructure as code. Meaning as a service.

Canonical URL: <https://datadriven.io/problems/greenfield_build_for_six_sources>

Domain: Pipeline Design · Difficulty: hard · Seniority: L5

## Problem

Our company is starting fresh with Databricks as the core data platform. We have six data sources that need to be ingested, transformed, and exposed through a consistent semantic layer for business analysts. Design the end-to-end platform architecture - including infrastructure-as-code configuration for each source, the orchestration DAG, and how the semantic layer sits on top.

## Worked solution and explanation

### Why this problem exists in real interviews

A greenfield platform is the rare interview where you can put the right shape down without legacy fighting you. The trap is treating it as 'wire the six sources to the warehouse' and stopping. The real questions are upstream of plumbing: how does the team agree on what MRR means, who has the authority to deny finance access to a column, and how does anyone know at 7am that the 8am dashboards are at risk?

Most candidates draw each source straight to the warehouse with its own job, let analysts write their own SQL on top, and add row-level security if and when somebody asks. Each analyst's MRR slowly diverges from every other analyst's MRR; leadership sees three different numbers in three meetings. A finance analyst opens a payments table and sees raw card data because nothing enforced column access. Two sources fail overnight and nobody notices until the dashboards are blank at 8am.

> **Trick to Solving**
>
> An orchestrator owns the schedule, a semantic layer owns the definitions, the platform owns the access; analysts write queries, not policy.
> 
> 1. Orchestration is the spine: every source ingest, every transform, every quality check is a task with declared dependencies, sized to land before the 8am deadline with margin for the slowest source.
> 2. A semantic layer (dbt models, Databricks Unity, LookML) owns the canonical KPI definitions; dashboards read named metrics from the layer, not raw tables.
> 3. Column-level access is enforced at the warehouse, not by which table an analyst happens to open. The platform owns the policy; queries don't.
> 4. Each source has IaC for its connector, its tables, its access rules; promoting a change is a PR, not a runbook.

---

### Walk the requirements

#### Step 1: Orchestrator owns the cadence and the alerting

An orchestrator schedules each source's ingest, the dependent transforms, and the quality checks, with the deadline being 'dashboards land before 8am.' The DAG is sized so the slowest source has margin; sensors fire on each source's expected landing time and alert before 8am if anything is at risk, not at 8am when the analyst sees a blank dashboard. Without an orchestration layer there's nothing responsible for the sequence or the SLA.

#### Step 2: Semantic layer holds one definition per KPI

Leadership wants one MRR, one Net Revenue Retention, one number across every dashboard. The fix is a semantic layer between the warehouse tables and the dashboards: dbt models or a metrics layer that publishes named metrics with canonical SQL, version-controlled, reviewable. Dashboards consume the named metric, not raw SQL. An analyst who writes their own MRR has to either use the semantic-layer definition or get the new variant added to the layer with a code review; either way, leadership stops seeing three numbers.

#### Step 3: Column-level access enforced by the platform

Finance analysts can't see raw card data, full stop. Column-level access policies sit on the payments tables and the gold-layer views built from them; when finance queries the table, the warehouse returns the column with the sensitive values masked or omitted regardless of how the query is written. A 'we'll just not give finance a SELECT on the column' approach via permissions on the underlying view leaks the moment somebody builds a derived table. Policy lives on the column, not on the role's table list.

---

### The shape that fits

> **What this design gives up**
>
> An orchestrator and a semantic layer are upfront infrastructure that doesn't show ROI on day one; six sources running ad-hoc would land faster. Column-level masking adds a query-time cost on every read of those columns. IaC for every source means PRs to ship a connector change instead of a console click. Speed-to-first-dashboard is the cost; the win is a platform that holds together when six sources become twelve, three KPIs become thirty, and a fourth team needs access to half the warehouse.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An orchestrator schedules each source's ingest and the dependent transforms, with sensors and alerts before the 8am dashboard SLA.
> - A semantic layer publishes one canonical KPI definition; dashboards read named metrics through it.
> - Column-level access policies enforce sensitive-field masking at the warehouse, regardless of query.

> **The mistake that ships**
>
> The design the team ships wires six sources to the warehouse and stops there. Three months in, leadership sees different MRR numbers in three meetings and asks what's going on; the answer is six analysts each wrote their own SQL. A finance analyst opens a payments table looking for invoice data and sees raw card numbers because nothing enforced column access at query time. A source fails overnight and dashboards are blank at 8am because no sensor was watching. The team rebuilds with an orchestrator, a semantic layer, and column policies. Each was reachable in the original conversation if it had gone past 'connect the sources.'

---

## Common follow-up questions

- Two analysts disagree on the right MRR definition. What does the semantic layer do, and what does it not do? _(Tests whether the candidate sees the semantic layer as a forcing function for the conversation, not a tiebreaker. The two analysts argue and one definition wins (or both are added as named variants); the layer makes the choice visible and reviewable, but the choice is still a human one.)_
- A new source is added by an external vendor that delivers data on an unpredictable schedule. What in this design needs to change? _(Tests whether the candidate sees orchestration sensors and the dependency graph as the extension point: the unpredictable source becomes a sensor with a wider window, downstream transforms gate on it, and dashboards that depend on it carry their own SLA. Nothing in the warehouse layout or the semantic layer changes.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/greenfield_build_for_six_sources)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.