Design Patterns
Medallion Architecture
Organize data into bronze, silver, gold layers
Bronze, Silver, Gold: Progressive Refinement
| Layer | Contents | Who Consumes It | Update Pattern |
|---|---|---|---|
| Bronze | Raw JSON/CSV/Parquet from sources | Silver ETL jobs only | Append-only, never modified |
| Silver | Cleaned, typed, deduplicated tables | Gold ETL jobs, analysts for ad-hoc | Merge/overwrite as new data arrives |
| Gold | Business metrics, curated datasets | Dashboards, ML, reports, APIs | Refreshed by scheduled pipeline |
Bronze is append-only and never cleaned. This is intentional. If your silver transformation has a bug, you can fix it and reprocess from bronze. If bronze was already cleaned, you cannot go back. Bronze is your insurance policy against ETL bugs.
When Medallion Does Not Fit
Data Vault
Build a flexible, audit-ready data warehouse
Hubs, Links, and Satellites
Data Vault separates structure (hubs and links) from content (satellites). Adding a new attribute means adding a satellite, not modifying the hub or link. Adding a new relationship means adding a link, not modifying any existing table. This makes Data Vault highly agile for evolving requirements.
Data Vault vs Star Schema
- Optimized for query performance
- Simpler structure (fact + dimensions)
- Fewer JOINs per query
- Best for: stable, well-understood domains
- Analysts can query directly
- Optimized for change agility
- More complex structure (hubs, links, satellites)
- More JOINs per query
- Best for: evolving, multi-source enterprises
- Typically queries go through a star-schema presentation layer
One Big Table (OBT)
Flatten data for simple analytics queries
Flatten Everything Into One Table
| Characteristic | OBT | Star Schema |
|---|---|---|
| JOINs at query time | Zero | 1-5 per query |
| Table count | 1 | 5-15 (fact + dimensions) |
| Query complexity | Simple: SELECT ... WHERE ... GROUP BY | Moderate: multiple JOINs |
| Flexibility to change | Low: adding a column requires rebuilding the entire table | High: add a column to a dimension, all facts benefit |
| Storage | High: dimension attributes duplicated on every fact row | Lower: dimension attributes stored once |
OBT works well for BI tools that struggle with JOINs (some Tableau configurations, Google Sheets connected to BigQuery). It also works for ML feature tables where the model needs a flat input. It does not work well as a general-purpose analytical model because every schema change requires a full rebuild.
When OBT Is the Right Choice
Semantic Layers
Define business metrics once for all tools
Define Metrics Once, Use Everywhere
Tools and Approaches
- Define metrics in YAML alongside your dbt models. Metrics are SQL-generated at query time. Tight integration with dbt's transformation layer.
- Define dimensions, measures, and joins in LookML. All Looker queries go through the semantic layer. Tightly coupled to Looker's BI platform.
- Open-source semantic layer that defines metrics in JavaScript/YAML. Supports multiple BI tools. Pre-aggregation built in.
- Enterprise semantic layers that sit between the warehouse and any BI tool. Universal metric definitions across tools.
The semantic layer solves the 'which number is right?' problem. When the CEO asks for revenue and gets three different answers from three teams, the root cause is almost always: no shared metric definition. A semantic layer eliminates this by making the definition authoritative.
When You Need One
Pipeline DAG Design
Structure pipeline dependencies correctly
Structuring Pipeline Dependencies
DAG Design Principles
Common DAG Anti-Patterns
- One giant DAG with 500 tasks
- Cross-layer dependencies (gold reads bronze)
- Serial execution of independent sources
- No quality gates between layers
- Separate DAGs per domain or layer
- Strict layer ordering: bronze -> silver -> gold
- Parallel ingestion of independent sources
- dbt tests or Great Expectations between silver and gold
Cross-layer dependencies are the most common DAG anti-pattern. A gold table that reads directly from bronze bypasses the cleaning and validation in silver. When the source schema changes, the gold table breaks. If it read from silver, the silver-to-gold contract would catch the change.
> You are building a data platform for a mid-size e-commerce company with 3 data sources (PostgreSQL, Shopify API, Google Analytics), 2 analysts, and a growing BI need.
Blueprints for building data systems
- Category
- Data Modeling
- Duration
- 21 minutes
- Challenges
- 12 hands-on challenges
Topics covered: Medallion Architecture, Data Vault, One Big Table (OBT), Semantic Layers, Pipeline DAG Design
Lesson Sections
- Medallion Architecture (concepts: dmMedallion)
Bronze, Silver, Gold: Progressive Refinement Medallion architecture organizes a data lake into three layers, each with a clear purpose. Raw data enters at bronze. Cleaned and validated data lives in silver. Business-ready aggregates and curated datasets live in gold. The architectural principle is separation of concerns: raw ingestion should not depend on business logic, and business transformations should not depend on source system quirks. This is the most common architecture at companies usin
- Data Vault (concepts: dmDataVault)
Hubs, Links, and Satellites Data Vault is a modeling methodology designed for agility and auditability in enterprise data warehouses. It decomposes data into three types of tables: Hubs (business keys), Links (relationships between hubs), and Satellites (descriptive attributes with history). The structure is more complex than star schemas but handles change more gracefully. Hubs store only the business key (customer_id, product_sku) and a hash key. Links store the relationship between two hubs (
- One Big Table (OBT) (concepts: dmOneBigTable)
Flatten Everything Into One Table One Big Table (OBT) is exactly what it sounds like: pre-join all your dimensions into the fact table and serve everything from a single wide table. No JOINs at query time. Every column an analyst might need is on the same row. Simple, fast, and easy to understand. OBT is a consumption pattern, not a source-of-truth pattern. You build the OBT from normalized source tables. The OBT is a derived, denormalized view optimized for a specific set of queries. If the sou
- Semantic Layers (concepts: dmSemanticLayer)
Define Metrics Once, Use Everywhere A semantic layer is a logical layer that defines business metrics in one place and makes them available to every consumer: dashboards, notebooks, APIs, and ad-hoc queries. Without one, every team defines 'revenue' slightly differently. Marketing counts refunds. Sales does not. Finance uses a different exchange rate. The numbers never match. The semantic layer sits between the physical tables and the consumers. It defines: what 'revenue' means (SUM(amount) WHER
- Pipeline DAG Design
Structuring Pipeline Dependencies A pipeline DAG (Directed Acyclic Graph) defines the execution order of your data transformations. Table A depends on Table B. Table B depends on Table C. The DAG ensures C runs before B, and B runs before A. Getting the DAG wrong means either: tables run before their dependencies are ready (wrong data), or the entire pipeline serializes unnecessarily (slow). The DAG should mirror the medallion layers. Bronze tasks (ingestion) have no dependencies on each other.