Loading lesson...
Blueprints for building data systems
Blueprints for building data systems
Topics covered: Medallion Architecture, Data Vault, One Big Table (OBT), Semantic Layers, Pipeline DAG Design
Bronze, Silver, Gold: Progressive Refinement Medallion architecture organizes a data lake into three layers, each with a clear purpose. Raw data enters at bronze. Cleaned and validated data lives in silver. Business-ready aggregates and curated datasets live in gold. The architectural principle is separation of concerns: raw ingestion should not depend on business logic, and business transformations should not depend on source system quirks. This is the most common architecture at companies usin
Hubs, Links, and Satellites Data Vault is a modeling methodology designed for agility and auditability in enterprise data warehouses. It decomposes data into three types of tables: Hubs (business keys), Links (relationships between hubs), and Satellites (descriptive attributes with history). The structure is more complex than star schemas but handles change more gracefully. Hubs store only the business key (customer_id, product_sku) and a hash key. Links store the relationship between two hubs (
Flatten Everything Into One Table One Big Table (OBT) is exactly what it sounds like: pre-join all your dimensions into the fact table and serve everything from a single wide table. No JOINs at query time. Every column an analyst might need is on the same row. Simple, fast, and easy to understand. OBT is a consumption pattern, not a source-of-truth pattern. You build the OBT from normalized source tables. The OBT is a derived, denormalized view optimized for a specific set of queries. If the sou
Define Metrics Once, Use Everywhere A semantic layer is a logical layer that defines business metrics in one place and makes them available to every consumer: dashboards, notebooks, APIs, and ad-hoc queries. Without one, every team defines 'revenue' slightly differently. Marketing counts refunds. Sales does not. Finance uses a different exchange rate. The numbers never match. The semantic layer sits between the physical tables and the consumers. It defines: what 'revenue' means (SUM(amount) WHER
Structuring Pipeline Dependencies A pipeline DAG (Directed Acyclic Graph) defines the execution order of your data transformations. Table A depends on Table B. Table B depends on Table C. The DAG ensures C runs before B, and B runs before A. Getting the DAG wrong means either: tables run before their dependencies are ready (wrong data), or the entire pipeline serializes unnecessarily (slow). The DAG should mirror the medallion layers. Bronze tasks (ingestion) have no dependencies on each other.