Design Patterns

There is no single right way to organize a data warehouse. Different architectures optimize for different tradeoffs: flexibility vs simplicity, auditability vs query speed, team autonomy vs consistency. This lesson covers the major architectural patterns you will encounter in production: Medallion (bronze/silver/gold), Data Vault, One Big Table, semantic layers, and pipeline DAG design. For each pattern, you will learn when it fits, when it does not, and what breaks when you choose wrong.

Medallion Architecture

Daily Life
Interviews

Organize data into bronze, silver, gold layers

Bronze, Silver, Gold: Progressive Refinement

Medallion architecture organizes a data lake into three layers, each with a clear purpose. Raw data enters at bronze. Cleaned and validated data lives in silver. Business-ready aggregates and curated datasets live in gold. The architectural principle is separation of concerns: raw ingestion should not depend on business logic, and business transformations should not depend on source system quirks.
This is the most common architecture at companies using Databricks, Delta Lake, or any modern lakehouse platform. If you work with a data lake, you will encounter this pattern.
BronzeSilverGold
Bronze
Raw Ingestion Layer
Data exactly as received from the source. No transformations, no cleaning, no type conversions. If the API sends malformed JSON, bronze stores it as-is. This layer is about capture, not interpretation.
Silver
Cleaned and Validated
Parse JSON into typed columns. Deduplicate records. Handle nulls consistently. Convert timestamps to UTC. Silver is the single source of truth for validated, queryable data.
Gold
Business-Ready Aggregates
Pre-computed metrics, curated datasets, and business-specific models. Gold tables are what dashboards, ML features, and reports consume. Optimized for specific use cases.
LayerContentsWho Consumes ItUpdate Pattern
BronzeRaw JSON/CSV/Parquet from sourcesSilver ETL jobs onlyAppend-only, never modified
SilverCleaned, typed, deduplicated tablesGold ETL jobs, analysts for ad-hocMerge/overwrite as new data arrives
GoldBusiness metrics, curated datasetsDashboards, ML, reports, APIsRefreshed by scheduled pipeline

Bronze is append-only and never cleaned. This is intentional. If your silver transformation has a bug, you can fix it and reprocess from bronze. If bronze was already cleaned, you cannot go back. Bronze is your insurance policy against ETL bugs.

When Medallion Does Not Fit

Medallion adds operational complexity: three layers means three sets of tables, three ETL jobs, and three failure points. For small teams with simple data flows, a two-layer architecture (raw + curated) is often sufficient. Do not adopt three layers because Databricks recommends it. Adopt it when your data sources are messy enough to justify a separate cleaning layer.
TIP
The most common mistake: putting business logic in the bronze-to-silver step. Silver should be generic cleaning (dedup, type conversion, null handling). Business logic (metric definitions, entity resolution, domain-specific rules) belongs in silver-to-gold.

Data Vault

Daily Life
Interviews

Build a flexible, audit-ready data warehouse

Hubs, Links, and Satellites

Data Vault is a modeling methodology designed for agility and auditability in enterprise data warehouses. It decomposes data into three types of tables: Hubs (business keys), Links (relationships between hubs), and Satellites (descriptive attributes with history). The structure is more complex than star schemas but handles change more gracefully.
Hubs store only the business key (customer_id, product_sku) and a hash key. Links store the relationship between two hubs (customer_X_ordered_product_Y). Satellites store the descriptive attributes (customer_name, customer_address) with effective dates for full history.
hub_customerPKhash_keycustomer_idload_datesourcesat_customer_detailsFKhash_keynamecityeffective_fromeffective_tolink_customer_orderPKlink_hashFKcustomer_hashFKorder_hashload_date

Data Vault separates structure (hubs and links) from content (satellites). Adding a new attribute means adding a satellite, not modifying the hub or link. Adding a new relationship means adding a link, not modifying any existing table. This makes Data Vault highly agile for evolving requirements.

Data Vault vs Star Schema

Star Schema
  • Optimized for query performance
  • Simpler structure (fact + dimensions)
  • Fewer JOINs per query
  • Best for: stable, well-understood domains
  • Analysts can query directly
Data Vault
  • Optimized for change agility
  • More complex structure (hubs, links, satellites)
  • More JOINs per query
  • Best for: evolving, multi-source enterprises
  • Typically queries go through a star-schema presentation layer
In practice, Data Vault is used as a staging/integration layer, not as the query layer. Most teams build Data Vault in silver and then project star schemas in gold for analyst consumption. This gives you the agility of Data Vault for ETL and the simplicity of star schemas for queries.
TIP
Do not adopt Data Vault unless you have a specific need: multiple conflicting source systems, frequent schema changes, or regulatory audit requirements. For most teams, a well-designed star schema is simpler and faster.

One Big Table (OBT)

Daily Life
Interviews

Flatten data for simple analytics queries

Flatten Everything Into One Table

One Big Table (OBT) is exactly what it sounds like: pre-join all your dimensions into the fact table and serve everything from a single wide table. No JOINs at query time. Every column an analyst might need is on the same row. Simple, fast, and easy to understand.
OBT is a consumption pattern, not a source-of-truth pattern. You build the OBT from normalized source tables. The OBT is a derived, denormalized view optimized for a specific set of queries. If the source data changes, you rebuild the OBT.
CharacteristicOBTStar Schema
JOINs at query timeZero1-5 per query
Table count15-15 (fact + dimensions)
Query complexitySimple: SELECT ... WHERE ... GROUP BYModerate: multiple JOINs
Flexibility to changeLow: adding a column requires rebuilding the entire tableHigh: add a column to a dimension, all facts benefit
StorageHigh: dimension attributes duplicated on every fact rowLower: dimension attributes stored once

OBT works well for BI tools that struggle with JOINs (some Tableau configurations, Google Sheets connected to BigQuery). It also works for ML feature tables where the model needs a flat input. It does not work well as a general-purpose analytical model because every schema change requires a full rebuild.

When OBT Is the Right Choice

check
The consumer is a BI tool or notebook that cannot handle multi-table JOINs effectively.
check
The query pattern is well-known and stable (always the same columns, filters, and groupings).
check
The table is a leaf-node consumption layer, not a source of truth.
alert
Do not use OBT as your only table. Always maintain the normalized source tables alongside it.
alert
Adding a new dimension attribute requires rebuilding and backfilling the entire OBT.
TIP
OBT is a presentation pattern, not a modeling pattern. Build it as the final step in your pipeline, derived from properly modeled star schema tables. If the OBT breaks, you can always rebuild it from the star schema. If the OBT IS your only table, you cannot.

Semantic Layers

Daily Life
Interviews

Define business metrics once for all tools

Define Metrics Once, Use Everywhere

A semantic layer is a logical layer that defines business metrics in one place and makes them available to every consumer: dashboards, notebooks, APIs, and ad-hoc queries. Without one, every team defines 'revenue' slightly differently. Marketing counts refunds. Sales does not. Finance uses a different exchange rate. The numbers never match.
The semantic layer sits between the physical tables and the consumers. It defines: what 'revenue' means (SUM(amount) WHERE status != 'refunded'), what dimensions are valid for grouping, and what filters are available. Every consumer uses the same definitions.

Tools and Approaches

  • Define metrics in YAML alongside your dbt models. Metrics are SQL-generated at query time. Tight integration with dbt's transformation layer.
  • Define dimensions, measures, and joins in LookML. All Looker queries go through the semantic layer. Tightly coupled to Looker's BI platform.
  • Open-source semantic layer that defines metrics in JavaScript/YAML. Supports multiple BI tools. Pre-aggregation built in.
  • Enterprise semantic layers that sit between the warehouse and any BI tool. Universal metric definitions across tools.

The semantic layer solves the 'which number is right?' problem. When the CEO asks for revenue and gets three different answers from three teams, the root cause is almost always: no shared metric definition. A semantic layer eliminates this by making the definition authoritative.

When You Need One

If your company has one analyst and one dashboard, you do not need a semantic layer. If you have 10 analysts, 5 dashboards, and 3 teams all computing 'revenue,' you need one. The inflection point is when metric inconsistency starts causing confusion or incorrect business decisions.

Pipeline DAG Design

Daily Life
Interviews

Structure pipeline dependencies correctly

Structuring Pipeline Dependencies

A pipeline DAG (Directed Acyclic Graph) defines the execution order of your data transformations. Table A depends on Table B. Table B depends on Table C. The DAG ensures C runs before B, and B runs before A. Getting the DAG wrong means either: tables run before their dependencies are ready (wrong data), or the entire pipeline serializes unnecessarily (slow).
The DAG should mirror the medallion layers. Bronze tasks (ingestion) have no dependencies on each other. Silver tasks depend on their bronze source. Gold tasks depend on their silver inputs. No gold task should depend directly on bronze. No task should run backward through the layers.

DAG Design Principles

pipeline
Layer isolation: bronze tasks never depend on silver or gold. Silver never depends on gold. Dependencies flow downward only.
partition
Idempotent tasks: every task can be re-run safely. If a task fails, retry it without side effects.
check
Quality gates between layers: run data quality checks after silver before triggering gold. If checks fail, gold does not run.
alert
Failure isolation: a failed bronze task for source A should not block silver tasks for source B. Independent sources run independently.

Common DAG Anti-Patterns

Anti-Pattern
  • One giant DAG with 500 tasks
  • Cross-layer dependencies (gold reads bronze)
  • Serial execution of independent sources
  • No quality gates between layers
Best Practice
  • Separate DAGs per domain or layer
  • Strict layer ordering: bronze -> silver -> gold
  • Parallel ingestion of independent sources
  • dbt tests or Great Expectations between silver and gold
The practical test: draw your DAG on a whiteboard. If it looks like a tangled web with arrows going in every direction, you have a dependency management problem. If it looks like clean layers with arrows flowing top to bottom, your architecture is sound.

Cross-layer dependencies are the most common DAG anti-pattern. A gold table that reads directly from bronze bypasses the cleaning and validation in silver. When the source schema changes, the gold table breaks. If it read from silver, the silver-to-gold contract would catch the change.

TIP
Never let gold read from bronze. The entire point of the layered architecture is that each layer provides a clean, stable interface to the layer below it. Skipping layers is like accessing a database directly from the frontend: it works until it does not, and then it is very expensive to fix.
PUTTING IT ALL TOGETHER

> You are building a data platform for a mid-size e-commerce company with 3 data sources (PostgreSQL, Shopify API, Google Analytics), 2 analysts, and a growing BI need.

You choose Medallion architecture: bronze tables for raw ingestion from each source, silver for cleaned/deduped tables, gold for business-ready models (daily revenue, customer cohorts).
You skip Data Vault (too complex for 3 sources and 2 analysts) and build star schemas directly in gold. An OBT for the main revenue dashboard where the BI tool struggles with JOINs.
You define 'revenue' and 'active customer' in dbt metrics so both analysts use the same definitions. DAG: 3 parallel bronze ingestions > silver cleaning > quality gate > gold models.
KEY TAKEAWAYS
Medallion (Bronze/Silver/Gold): separate raw ingestion from cleaning from business logic; bronze is your insurance policy
Data Vault: hubs, links, satellites for audit-ready, change-agile enterprise warehouses; use as integration layer
One Big Table: pre-joined consumption layer for BI tools; always derived from properly modeled source tables
Semantic layers: define metrics once so every consumer gets the same number; eliminate the 'which revenue?' problem
DAG design: layer isolation, idempotent tasks, quality gates between layers, never let gold read bronze

Blueprints for building data systems

Category
Data Modeling
Duration
21 minutes
Challenges
12 hands-on challenges

Topics covered: Medallion Architecture, Data Vault, One Big Table (OBT), Semantic Layers, Pipeline DAG Design

Lesson Sections

  1. Medallion Architecture (concepts: dmMedallion)

    Bronze, Silver, Gold: Progressive Refinement Medallion architecture organizes a data lake into three layers, each with a clear purpose. Raw data enters at bronze. Cleaned and validated data lives in silver. Business-ready aggregates and curated datasets live in gold. The architectural principle is separation of concerns: raw ingestion should not depend on business logic, and business transformations should not depend on source system quirks. This is the most common architecture at companies usin

  2. Data Vault (concepts: dmDataVault)

    Hubs, Links, and Satellites Data Vault is a modeling methodology designed for agility and auditability in enterprise data warehouses. It decomposes data into three types of tables: Hubs (business keys), Links (relationships between hubs), and Satellites (descriptive attributes with history). The structure is more complex than star schemas but handles change more gracefully. Hubs store only the business key (customer_id, product_sku) and a hash key. Links store the relationship between two hubs (

  3. One Big Table (OBT) (concepts: dmOneBigTable)

    Flatten Everything Into One Table One Big Table (OBT) is exactly what it sounds like: pre-join all your dimensions into the fact table and serve everything from a single wide table. No JOINs at query time. Every column an analyst might need is on the same row. Simple, fast, and easy to understand. OBT is a consumption pattern, not a source-of-truth pattern. You build the OBT from normalized source tables. The OBT is a derived, denormalized view optimized for a specific set of queries. If the sou

  4. Semantic Layers (concepts: dmSemanticLayer)

    Define Metrics Once, Use Everywhere A semantic layer is a logical layer that defines business metrics in one place and makes them available to every consumer: dashboards, notebooks, APIs, and ad-hoc queries. Without one, every team defines 'revenue' slightly differently. Marketing counts refunds. Sales does not. Finance uses a different exchange rate. The numbers never match. The semantic layer sits between the physical tables and the consumers. It defines: what 'revenue' means (SUM(amount) WHER

  5. Pipeline DAG Design

    Structuring Pipeline Dependencies A pipeline DAG (Directed Acyclic Graph) defines the execution order of your data transformations. Table A depends on Table B. Table B depends on Table C. The DAG ensures C runs before B, and B runs before A. Getting the DAG wrong means either: tables run before their dependencies are ready (wrong data), or the entire pipeline serializes unnecessarily (slow). The DAG should mirror the medallion layers. Bronze tasks (ingestion) have no dependencies on each other.