Design Patterns

There is no single right way to organize a data warehouse. Different architectures optimize for different tradeoffs: flexibility vs simplicity, auditability vs query speed, team autonomy vs consistency. This lesson covers the major architectural patterns you will encounter in production: Medallion (bronze/silver/gold), Data Vault, One Big Table, semantic layers, and pipeline DAG design. For each pattern, you will learn when it fits, when it does not, and what breaks when you choose wrong.

Medallion Architecture

Daily Life

Interviews

Organize data into bronze, silver, gold layers

Bronze, Silver, Gold: Progressive Refinement

Medallion architecture organizes a data lake into three layers, each with a clear purpose. Raw data enters at bronze. Cleaned and validated data lives in silver. Business-ready aggregates and curated datasets live in gold. The architectural principle is separation of concerns: raw ingestion should not depend on business logic, and business transformations should not depend on source system quirks.

This is the most common architecture at companies using Databricks, Delta Lake, or any modern lakehouse platform. If you work with a data lake, you will encounter this pattern.

BronzeSilverGold

Bronze

Raw Ingestion Layer

Data exactly as received from the source. No transformations, no cleaning, no type conversions. If the API sends malformed JSON, bronze stores it as-is. This layer is about capture, not interpretation.

Silver

Cleaned and Validated

Parse JSON into typed columns. Deduplicate records. Handle nulls consistently. Convert timestamps to UTC. Silver is the single source of truth for validated, queryable data.

Gold

Business-Ready Aggregates

Pre-computed metrics, curated datasets, and business-specific models. Gold tables are what dashboards, ML features, and reports consume. Optimized for specific use cases.

Layer	Contents	Who Consumes It	Update Pattern
Bronze	Raw JSON/CSV/Parquet from sources	Silver ETL jobs only	Append-only, never modified
Silver	Cleaned, typed, deduplicated tables	Gold ETL jobs, analysts for ad-hoc	Merge/overwrite as new data arrives
Gold	Business metrics, curated datasets	Dashboards, ML, reports, APIs	Refreshed by scheduled pipeline

Bronze is append-only and never cleaned. This is intentional. If your silver transformation has a bug, you can fix it and reprocess from bronze. If bronze was already cleaned, you cannot go back. Bronze is your insurance policy against ETL bugs.

When Medallion Does Not Fit

Medallion adds operational complexity: three layers means three sets of tables, three ETL jobs, and three failure points. For small teams with simple data flows, a two-layer architecture (raw + curated) is often sufficient. Do not adopt three layers because Databricks recommends it. Adopt it when your data sources are messy enough to justify a separate cleaning layer.

TIP

The most common mistake: putting business logic in the bronze-to-silver step. Silver should be generic cleaning (dedup, type conversion, null handling). Business logic (metric definitions, entity resolution, domain-specific rules) belongs in silver-to-gold.

Data Vault

Daily Life

Interviews

Build a flexible, audit-ready data warehouse

Hubs, Links, and Satellites

Data Vault is a modeling methodology designed for agility and auditability in enterprise data warehouses. It decomposes data into three types of tables: Hubs (business keys), Links (relationships between hubs), and Satellites (descriptive attributes with history). The structure is more complex than star schemas but handles change more gracefully.

Hubs store only the business key (customer_id, product_sku) and a hash key. Links store the relationship between two hubs (customer_X_ordered_product_Y). Satellites store the descriptive attributes (customer_name, customer_address) with effective dates for full history.

hub_customer

PKhash_key

customer_id

load_date

source

sat_customer_details

FKhash_key

name

city

effective_from

effective_to

link_customer_order

PKlink_hash

FKcustomer_hash

FKorder_hash

load_date

Data Vault separates structure (hubs and links) from content (satellites). Adding a new attribute means adding a satellite, not modifying the hub or link. Adding a new relationship means adding a link, not modifying any existing table. This makes Data Vault highly agile for evolving requirements.

Data Vault vs Star Schema

✓Star Schema

Optimized for query performance
Simpler structure (fact + dimensions)
Fewer JOINs per query
Best for: stable, well-understood domains
Analysts can query directly

✓Data Vault

Optimized for change agility
More complex structure (hubs, links, satellites)
More JOINs per query
Best for: evolving, multi-source enterprises
Typically queries go through a star-schema presentation layer

In practice, Data Vault is used as a staging/integration layer, not as the query layer. Most teams build Data Vault in silver and then project star schemas in gold for analyst consumption. This gives you the agility of Data Vault for ETL and the simplicity of star schemas for queries.

TIP

Do not adopt Data Vault unless you have a specific need: multiple conflicting source systems, frequent schema changes, or regulatory audit requirements. For most teams, a well-designed star schema is simpler and faster.

One Big Table (OBT)

Daily Life

Interviews

Flatten data for simple analytics queries

Flatten Everything Into One Table

One Big Table (OBT) is exactly what it sounds like: pre-join all your dimensions into the fact table and serve everything from a single wide table. No JOINs at query time. Every column an analyst might need is on the same row. Simple, fast, and easy to understand.

OBT is a consumption pattern, not a source-of-truth pattern. You build the OBT from normalized source tables. The OBT is a derived, denormalized view optimized for a specific set of queries. If the source data changes, you rebuild the OBT.

Characteristic	OBT	Star Schema
JOINs at query time	Zero	1-5 per query
Table count	1	5-15 (fact + dimensions)
Query complexity	Simple: SELECT ... WHERE ... GROUP BY	Moderate: multiple JOINs
Flexibility to change	Low: adding a column requires rebuilding the entire table	High: add a column to a dimension, all facts benefit
Storage	High: dimension attributes duplicated on every fact row	Lower: dimension attributes stored once

OBT works well for BI tools that struggle with JOINs (some Tableau configurations, Google Sheets connected to BigQuery). It also works for ML feature tables where the model needs a flat input. It does not work well as a general-purpose analytical model because every schema change requires a full rebuild.

When OBT Is the Right Choice

The consumer is a BI tool or notebook that cannot handle multi-table JOINs effectively.

The query pattern is well-known and stable (always the same columns, filters, and groupings).

The table is a leaf-node consumption layer, not a source of truth.

Do not use OBT as your only table. Always maintain the normalized source tables alongside it.

Adding a new dimension attribute requires rebuilding and backfilling the entire OBT.

TIP

OBT is a presentation pattern, not a modeling pattern. Build it as the final step in your pipeline, derived from properly modeled star schema tables. If the OBT breaks, you can always rebuild it from the star schema. If the OBT IS your only table, you cannot.

Semantic Layers

Daily Life

Interviews

Define business metrics once for all tools

Define Metrics Once, Use Everywhere

A semantic layer is a logical layer that defines business metrics in one place and makes them available to every consumer: dashboards, notebooks, APIs, and ad-hoc queries. Without one, every team defines 'revenue' slightly differently. Marketing counts refunds. Sales does not. Finance uses a different exchange rate. The numbers never match.

The semantic layer sits between the physical tables and the consumers. It defines: what 'revenue' means (SUM(amount) WHERE status != 'refunded'), what dimensions are valid for grouping, and what filters are available. Every consumer uses the same definitions.

Tools and Approaches

Define metrics in YAML alongside your dbt models. Metrics are SQL-generated at query time. Tight integration with dbt's transformation layer.
Define dimensions, measures, and joins in LookML. All Looker queries go through the semantic layer. Tightly coupled to Looker's BI platform.
Open-source semantic layer that defines metrics in JavaScript/YAML. Supports multiple BI tools. Pre-aggregation built in.
Enterprise semantic layers that sit between the warehouse and any BI tool. Universal metric definitions across tools.

The semantic layer solves the 'which number is right?' problem. When the CEO asks for revenue and gets three different answers from three teams, the root cause is almost always: no shared metric definition. A semantic layer eliminates this by making the definition authoritative.

When You Need One

If your company has one analyst and one dashboard, you do not need a semantic layer. If you have 10 analysts, 5 dashboards, and 3 teams all computing 'revenue,' you need one. The inflection point is when metric inconsistency starts causing confusion or incorrect business decisions.

Pipeline DAG Design

Daily Life

Interviews

Structure pipeline dependencies correctly

Structuring Pipeline Dependencies

A pipeline DAG (Directed Acyclic Graph) defines the execution order of your data transformations. Table A depends on Table B. Table B depends on Table C. The DAG ensures C runs before B, and B runs before A. Getting the DAG wrong means either: tables run before their dependencies are ready (wrong data), or the entire pipeline serializes unnecessarily (slow).

The DAG should mirror the medallion layers. Bronze tasks (ingestion) have no dependencies on each other. Silver tasks depend on their bronze source. Gold tasks depend on their silver inputs. No gold task should depend directly on bronze. No task should run backward through the layers.

DAG Design Principles

Layer isolation: bronze tasks never depend on silver or gold. Silver never depends on gold. Dependencies flow downward only.

Idempotent tasks: every task can be re-run safely. If a task fails, retry it without side effects.

Quality gates between layers: run data quality checks after silver before triggering gold. If checks fail, gold does not run.

Failure isolation: a failed bronze task for source A should not block silver tasks for source B. Independent sources run independently.

Common DAG Anti-Patterns

✗Anti-Pattern

One giant DAG with 500 tasks
Cross-layer dependencies (gold reads bronze)
Serial execution of independent sources
No quality gates between layers

✓Best Practice

Separate DAGs per domain or layer
Strict layer ordering: bronze -> silver -> gold
Parallel ingestion of independent sources
dbt tests or Great Expectations between silver and gold

The practical test: draw your DAG on a whiteboard. If it looks like a tangled web with arrows going in every direction, you have a dependency management problem. If it looks like clean layers with arrows flowing top to bottom, your architecture is sound.

Cross-layer dependencies are the most common DAG anti-pattern. A gold table that reads directly from bronze bypasses the cleaning and validation in silver. When the source schema changes, the gold table breaks. If it read from silver, the silver-to-gold contract would catch the change.

TIP

Never let gold read from bronze. The entire point of the layered architecture is that each layer provides a clean, stable interface to the layer below it. Skipping layers is like accessing a database directly from the frontend: it works until it does not, and then it is very expensive to fix.

❯❯❯PUTTING IT ALL TOGETHER

> You are building a data platform for a mid-size e-commerce company with 3 data sources (PostgreSQL, Shopify API, Google Analytics), 2 analysts, and a growing BI need.

You choose Medallion architecture: bronze tables for raw ingestion from each source, silver for cleaned/deduped tables, gold for business-ready models (daily revenue, customer cohorts).

You skip Data Vault (too complex for 3 sources and 2 analysts) and build star schemas directly in gold. An OBT for the main revenue dashboard where the BI tool struggles with JOINs.

You define 'revenue' and 'active customer' in dbt metrics so both analysts use the same definitions. DAG: 3 parallel bronze ingestions > silver cleaning > quality gate > gold models.

KEY TAKEAWAYS

Medallion (Bronze/Silver/Gold): separate raw ingestion from cleaning from business logic; bronze is your insurance policy

Data Vault: hubs, links, satellites for audit-ready, change-agile enterprise warehouses; use as integration layer

One Big Table: pre-joined consumption layer for BI tools; always derived from properly modeled source tables

Semantic layers: define metrics once so every consumer gets the same number; eliminate the 'which revenue?' problem

DAG design: layer isolation, idempotent tasks, quality gates between layers, never let gold read bronze

Blueprints for building data systems

Category: Data Modeling
Duration: 21 minutes
Challenges: 12 hands-on challenges

Topics covered: Medallion Architecture, Data Vault, One Big Table (OBT), Semantic Layers, Pipeline DAG Design

Lesson Sections

Medallion Architecture (concepts: dmMedallion)
Bronze, Silver, Gold: Progressive Refinement Medallion architecture organizes a data lake into three layers, each with a clear purpose. Raw data enters at bronze. Cleaned and validated data lives in silver. Business-ready aggregates and curated datasets live in gold. The architectural principle is separation of concerns: raw ingestion should not depend on business logic, and business transformations should not depend on source system quirks. This is the most common architecture at companies usin
Data Vault (concepts: dmDataVault)
Hubs, Links, and Satellites Data Vault is a modeling methodology designed for agility and auditability in enterprise data warehouses. It decomposes data into three types of tables: Hubs (business keys), Links (relationships between hubs), and Satellites (descriptive attributes with history). The structure is more complex than star schemas but handles change more gracefully. Hubs store only the business key (customer_id, product_sku) and a hash key. Links store the relationship between two hubs (
One Big Table (OBT) (concepts: dmOneBigTable)
Flatten Everything Into One Table One Big Table (OBT) is exactly what it sounds like: pre-join all your dimensions into the fact table and serve everything from a single wide table. No JOINs at query time. Every column an analyst might need is on the same row. Simple, fast, and easy to understand. OBT is a consumption pattern, not a source-of-truth pattern. You build the OBT from normalized source tables. The OBT is a derived, denormalized view optimized for a specific set of queries. If the sou
Semantic Layers (concepts: dmSemanticLayer)
Define Metrics Once, Use Everywhere A semantic layer is a logical layer that defines business metrics in one place and makes them available to every consumer: dashboards, notebooks, APIs, and ad-hoc queries. Without one, every team defines 'revenue' slightly differently. Marketing counts refunds. Sales does not. Finance uses a different exchange rate. The numbers never match. The semantic layer sits between the physical tables and the consumers. It defines: what 'revenue' means (SUM(amount) WHER
Pipeline DAG Design (concepts: dmMedallion)
Structuring Pipeline Dependencies A pipeline DAG (Directed Acyclic Graph) defines the execution order of your data transformations. Table A depends on Table B. Table B depends on Table C. The DAG ensures C runs before B, and B runs before A. Getting the DAG wrong means either: tables run before their dependencies are ready (wrong data), or the entire pipeline serializes unnecessarily (slow). The DAG should mirror the medallion layers. Bronze tasks (ingestion) have no dependencies on each other.