Data Modeling

Medallion Architecture: Bronze, Silver, Gold Layers

Medallion architecture organizes a data lakehouse into three progressive layers of refinement. Interviewers use it to test whether you understand pipeline design, data quality boundaries, and the separation between raw ingestion and business logic.

This pattern appears in system design rounds, pipeline architecture discussions, and data modeling interviews. Even when the interviewer does not say "medallion," they are often testing the same concept: how do you organize data from source to analytics?

What Is Medallion Architecture?

Medallion architecture is a data organization pattern that divides a data lakehouse into three layers, each named after a medal: bronze, silver, and gold. Data enters the platform raw in bronze, gets cleaned and standardized in silver, and is shaped into business-ready datasets in gold. Each layer acts as a quality boundary where specific guarantees are enforced before data progresses to the next stage.

The pattern was popularized by Databricks as part of the lakehouse model, but the underlying idea (progressive refinement with replay capability) predates any single vendor. Teams running on Snowflake, BigQuery, or open-source stacks use the same layered approach, sometimes calling the layers "raw / staging / mart" or "landing / curated / consumption." The names differ; the principle is identical.

The core benefit is isolation. A bug in your aggregation logic corrupts gold tables, but silver and bronze are untouched. A source system changes its schema, and bronze absorbs the change while silver transformation logic is updated deliberately. Without this separation, a single bad deployment can corrupt every layer simultaneously, turning a fixable bug into a multi-day recovery.

The Three Layers Explained

Each layer has a distinct responsibility, a specific set of transformations it performs, and quality guarantees it provides. Interviewers want to hear you articulate these boundaries clearly.

Bronze Layer (Raw Ingestion)

Bronze is the landing zone. Data arrives exactly as it looked in the source system, with no transformations applied. JSON blobs, CSV dumps, CDC change events, API responses: all of it lands in bronze unchanged. You typically add ingestion metadata (source system, load timestamp, batch ID) but never alter the payload itself.

What belongs here

Raw event streams from Kafka, full database extracts, third-party API responses, log files, CDC change records. Everything is append-only. You never update or delete rows in bronze.

Why it matters

Bronze is your insurance policy. When a transformation bug corrupts silver or gold, you replay from bronze. When a new business question requires data you previously filtered out, it is still in bronze. Without a clean bronze layer, every pipeline error becomes a re-extraction from source, which may be impossible for event data that has already aged out of the source system's retention window.

Interview tip

When an interviewer asks 'what goes in bronze,' emphasize immutability and completeness. Bronze should be a faithful mirror of the source. If you mention cleaning or filtering data in bronze, the interviewer will flag that as a misunderstanding of the pattern.

Silver Layer (Cleaned and Conformed)

Silver is where raw data becomes usable. You parse nested JSON into flat columns. You deduplicate CDC events to get the latest state per key. You cast data types, rename columns to a consistent naming convention, filter out corrupted or test records, and apply basic data quality checks. Silver tables have defined schemas and are queryable by analysts who need clean, row-level data.

What belongs here

Deduplicated, typed, and schema-enforced versions of bronze data. One silver table per logical entity (customers, orders, events). Joins between source systems to create conformed entities happen here. Data quality rules (NOT NULL checks, referential integrity, range validation) are enforced at the silver boundary.

Why it matters

Silver is the layer that absorbs source system complexity so downstream consumers do not have to. If every gold table independently parses raw JSON from bronze, you have N implementations of the same cleaning logic, and they will inevitably diverge. Silver centralizes that logic.

Interview tip

The interviewer is testing whether you understand that silver is not just 'bronze with a few columns removed.' Silver involves real engineering: deduplication strategies, schema evolution handling, data quality gates, and conforming disparate sources into a single namespace.

Gold Layer (Business Aggregates)

Gold is the analytics layer. Tables here are designed for specific business use cases: dashboards, ML feature stores, reporting, reverse ETL. Gold tables are often aggregated (daily revenue by region), denormalized (wide tables joining multiple silver entities), or structured as dimensional models (star schemas with facts and dimensions). Gold is where business logic lives.

What belongs here

KPI tables, dimensional models, materialized aggregates, feature tables for ML, and any dataset shaped for a specific consumer. Gold tables have clear owners, documented business logic, and SLAs.

Why it matters

Gold tables are the contract between the data team and the business. Their schemas should be stable, their definitions should be documented, and their refresh cadence should match the business need. Unstable gold tables erode trust in the data platform faster than any other failure.

Interview tip

When describing gold, name specific output shapes: 'daily revenue by product category for the finance dashboard,' 'user feature vectors for the churn prediction model.' Vague answers like 'aggregated data' do not demonstrate real-world experience.

How Data Flows Through the Layers

In an interview, you will often be asked to describe data flow from ingestion to analytics. Here is the progression, step by step.

1

Source systems (databases, APIs, event streams, files) emit raw data.

2

Ingestion pipelines land that data in bronze with no transformation, only appending metadata like load_timestamp and source_id.

3

Cleaning pipelines read from bronze, apply deduplication, type casting, schema enforcement, and data quality checks, then write to silver.

4

Business logic pipelines read from silver (sometimes joining multiple silver tables), apply aggregation, denormalization, or dimensional modeling, and write to gold.

5

Consumers (dashboards, ML models, reverse ETL) read exclusively from gold. They never query bronze or silver directly.

When Interviewers Ask About Medallion Architecture

Interviewers rarely say "explain medallion architecture." Instead, they ask questions like "how do you organize your data lake?" or "describe the layers in your pipeline." The underlying test is always the same: do you understand progressive data refinement?

What they are really evaluating: Can you explain why raw data should be preserved? Do you know where cleaning logic belongs? Can you articulate the boundary between transformation and business logic? Do you understand the trade-off between pipeline complexity and operational resilience?

A weak answer lists the three layers by name. A strong answer explains the design rationale: isolation of failures, replay capability, separation of concerns between data engineering (bronze to silver) and analytics engineering (silver to gold). The strongest answers include specific examples from real pipelines, naming technologies, data volumes, and failure scenarios they have handled.

Common Interview Questions

These questions come up in system design and pipeline architecture rounds. For each one, the guidance shows how a strong candidate structures the answer.

Walk me through how data flows from source to analytics in your pipeline.

This is the most common medallion question, often asked without explicitly naming the pattern. Structure your answer as: ingestion to bronze (raw, immutable), transformation to silver (cleaned, conformed), aggregation to gold (business-ready). Name specific technologies at each layer. For example: 'Kafka lands CDC events in bronze as Parquet on S3. A Spark job deduplicates by primary key and writes typed tables to silver in Delta Lake. dbt models aggregate silver tables into daily KPI tables in gold that Looker queries directly.'

What goes in bronze vs silver? How do you decide?

Bronze stores what the source system sent, unchanged. Silver stores the cleaned, typed, deduplicated version of that data. The decision boundary is: if you are altering the payload (parsing JSON, casting types, deduplicating, filtering), it belongs in the silver transformation, not in bronze ingestion. A common mistake candidates make is describing bronze as 'slightly cleaned' data. That defeats the purpose of having a raw layer you can replay from.

How do you handle schema evolution across layers?

In bronze, schema evolution is a non-issue because you store raw payloads. A new field in the source just appears in the JSON. The challenge is in silver, where you have defined schemas. Approaches: (1) Use a format like Delta Lake or Iceberg that supports schema evolution (ADD COLUMN). (2) Have the silver pipeline detect new fields and automatically add them. (3) Use a schema registry to version changes and alert when breaking changes (removed or renamed columns) arrive. In gold, schema changes require coordination with consumers since dashboards and ML models depend on stable column names.

When would you NOT use medallion architecture?

When the overhead is not justified. A startup with one data source feeding one dashboard does not need three layers. When latency requirements are sub-second, the multi-hop pattern adds too much delay. When data arrives already clean and structured from a well-governed internal system, the silver layer adds minimal value. The interviewer wants to see that you do not treat medallion as a universal solution. The right answer demonstrates pragmatism: 'I would skip it when the cost of maintaining three layers exceeds the value of the separation.'

How does medallion architecture relate to data quality?

Each layer boundary is a quality gate. Bronze-to-silver checks: Are the expected fields present? Do data types parse correctly? Are there duplicate primary keys? Silver-to-gold checks: Do referential integrity constraints hold? Are aggregation inputs complete (no missing partitions)? Do metric values fall within expected ranges? If a quality check fails, the pipeline halts and the downstream layer keeps its last good state. This is the core benefit of the layered approach: failures are isolated and do not cascade to business-facing tables.

How do you handle late-arriving data in a medallion architecture?

Late data lands in bronze like any other data. The silver pipeline must handle it by doing upserts (MERGE) rather than append-only writes, keyed on the natural key plus event timestamp. Gold tables that are partitioned by date may need to recompute affected partitions. The key design decision is whether to reprocess all downstream layers or only the affected partition. In Delta Lake, you can use MERGE with a watermark to efficiently update only the rows affected by the late arrival.

Medallion Architecture vs Other Patterns

Interviewers sometimes ask how medallion architecture compares to other data modeling approaches. The key insight: medallion is a data organization pattern, while most alternatives are modeling methodologies. They operate at different levels and often coexist.

vs Data Vault

Data Vault and medallion architecture solve different problems and can coexist. Data Vault (hubs, links, satellites) is a modeling methodology for the integration layer. Medallion is a data organization pattern for the entire lakehouse. In practice, you might use medallion layers with Data Vault modeling in silver (raw vault) and dimensional modeling in gold (business vault). The interviewer may ask you to compare them, but the sophisticated answer is that they operate at different levels of abstraction.

vs Star Schema (Kimball)

Star schema is a modeling pattern for the gold layer, not an alternative to medallion. A well-designed medallion architecture often produces star schemas in gold: fact tables with foreign keys to dimension tables, optimized for BI tool consumption. Saying 'we use medallion instead of star schema' is a category error that will concern the interviewer.

vs One Big Table (OBT)

OBT is a gold-layer design choice where you pre-join everything into a single wide, denormalized table. It trades storage and update complexity for query simplicity. OBT works well for specific dashboards with predictable access patterns. It does not replace the medallion pattern; it is one possible output shape in gold.

vs Lambda Architecture

Lambda separates batch and streaming paths, then merges results. Medallion layers can exist within either path. Some teams use a streaming bronze (real-time event landing) and a batch bronze (daily full extracts) that both feed into the same silver layer. The medallion pattern is more about progressive refinement than about batch vs stream.

Frequently Asked Questions

What is medallion architecture?+
Medallion architecture is a data organization pattern that structures a data lakehouse into three layers: bronze (raw ingestion), silver (cleaned and conformed), and gold (business-level aggregates). Each layer progressively refines the data, with quality gates at each boundary. The pattern was popularized by Databricks but applies to any lakehouse or data platform. It provides clear separation of concerns, makes pipelines easier to debug, and allows replay from raw data when transformations need to change.
Is medallion architecture the same as ETL?+
No. ETL (extract, transform, load) describes a process. Medallion architecture describes a data organization pattern. ETL pipelines are the mechanism that moves data between medallion layers, but the architecture itself is about how you structure your storage and define quality boundaries. You could implement medallion layers using ETL, ELT, or streaming pipelines.
Do I need all three layers?+
Not always. Some teams skip bronze and land partially cleaned data directly into silver, especially when the source is already well-structured. Some teams collapse silver and gold when the business logic is simple enough that cleaning and aggregation can happen in one step. The three-layer pattern is a guideline, not a rule. What matters is that you can articulate why you chose to include or skip a layer.
How does medallion architecture work with Delta Lake or Iceberg?+
Table formats like Delta Lake and Apache Iceberg provide the transactional guarantees that make medallion architecture practical. ACID transactions mean a failed silver write does not leave the table in a corrupted state. Time travel means you can query previous versions of silver or gold tables without maintaining separate snapshots. Schema evolution support means silver tables can absorb new source columns without manual DDL changes. These features are not required for medallion architecture, but they remove significant operational friction.
Is medallion architecture asked about in interviews?+
Yes, frequently. It appears in system design rounds ('design a data pipeline for X') and in data modeling discussions. Interviewers may not use the term 'medallion' explicitly. They might ask 'how do you organize your data lake' or 'walk me through your pipeline layers.' The underlying concept of progressive data refinement with quality gates at each boundary is what they are testing.

Practice Pipeline Architecture Questions

DataDriven covers pipeline design, data modeling, SQL, and Python with hands-on challenges at interview difficulty. Build fluency in the concepts interviewers actually test.

Start Practicing