Medallion Architecture

The three-layer lakehouse pattern in plain terms. What bronze, silver, and gold are for, where the quality gates sit, when to skip a layer, and how interviewers ask about it without using the word.

Medallion architecture is a 3-layer pattern for organizing a data lakehouse: bronze for raw, silver for cleaned, gold for business-ready. Databricks popularized the name; the idea is older. It's the "raw / staging / mart" pattern with a fresh coat of paint. The reason it shows up in interviews is that the layer boundaries are natural quality gates, and being able to articulate those gates is a senior signal.

The pattern is not always right. If you have one source feeding one dashboard, 3 layers is overhead with no payoff. The right answer in an interview is to describe the layers, name the cases where you'd skip one, and not pitch it as universal.

One entity, 3 maturities

The same business concept lands raw in bronze, gets cleaned and conformed in silver, and is reshaped for a specific consumer in gold. Each arrow crosses a quality gate. Failures at a gate halt the next layer instead of corrupting it.

BronzeRaw payload preserved. Append-only. Load timestamp and source identifier are the only metadata you add. Replay-friendly.

SilverParsed, typed, deduped. One row per business event. Schema enforced. Failed quality checks halt the layer.

GoldReshaped for a specific consumer. Aggregated, denormalized, owned. Schema stability matters more than completeness.

Bronze: raw, immutable, replay-friendly

Bronze is the landing zone. Data arrives exactly as the source emitted it: JSON blobs, CDC change events, CSV dumps, API responses. You add ingestion metadata (load timestamp, batch ID, source identifier) and otherwise leave the payload untouched. Everything is append-only. You never update or delete rows in bronze.

The reason this layer exists is replay. When a transformation bug corrupts silver or gold, you reprocess from bronze without going back to the source system. When a new business question needs data you previously filtered out, it's still here. Without a faithful bronze layer, every pipeline error becomes a re-extraction from source, which is often impossible for event data that has already aged out.

The interview tell: if you describe bronze as 'slightly cleaned' or 'lightly typed,' you've revealed you haven't run this in production. Bronze is immutable and complete. The cleaning lives in silver.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a data modeling query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1fact_orders

2 order_id bigint PK

3 customer_sk bigint FK

4 order_date date SCD2

Execute your solution0.4s avg.

PinterestInterview question

Solve a problem

Silver: cleaned, conformed, queryable

Silver is where raw becomes usable. You parse nested JSON into flat columns. You deduplicate CDC events to get the latest state per primary key. You cast types, rename to consistent conventions, drop corrupted or test records, and enforce data quality rules at the layer boundary. Silver tables have defined schemas and are safe for analysts to query directly when they need row-level data.

One silver table per logical entity, generally. Joins that conform disparate sources into a single namespace happen here: a customer record stitched from CRM, billing, and support. The data quality checks (NOT NULL on required fields, referential integrity, range validation) are the gates between bronze and silver, and failures in those gates should pause downstream processing rather than corrupting gold.

Silver does real engineering. Schema evolution, deduplication strategies that survive late-arriving updates, watermarking for streaming inputs. The candidates who describe silver as 'bronze with bad rows removed' do not get the senior signal.

Marketplace Sales Warehouse

> We run a two-sided marketplace where buyers and sellers transact. The analytics team needs a self-service warehouse to analyze GMV, conversion rates, and seller performance. There is no provided schema. You are expected to establish the entities, their relationships, and the dimensional model from scratch. Start by asking clarifying questions before designing anything.

+ Table

+ Column

Architecture

Data Modeling

Model the schema.

Click + Table in the toolbar, or right-click the canvas to add one.

Drag from a key column's edge dot to another column to draw a foreign key.

Gold: shaped for a specific consumer

Gold is the analytics layer. Tables are designed for specific use cases: a dashboard, an ML feature store, a reverse-ETL syndication, a finance KPI page. They might be dimensional models (star schemas with facts and conformed dimensions), wide denormalized one-big-tables for predictable query patterns, aggregated rollups, or pre-joined feature vectors. The shape is dictated by the consumer, not by an abstract data ideal.

Gold tables are the contract between the data team and the business. Their schemas should be stable, their definitions should be documented, their refresh cadence should match the business need, and their owners should be named. Unstable gold tables erode trust in the data platform faster than any other failure mode.

When describing gold in an interview, name specific output shapes. 'Daily revenue by product category, partitioned by date, materialized for the finance dashboard.' 'User feature vectors keyed by user_id, refreshed hourly, consumed by the churn model.' Vague answers ('aggregated data') don't read as production experience.

The end-to-end flow

01
Source emits
Databases, APIs, event streams, files. The source has its own schema, its own NULLs, its own malformed rows. You do not control any of this.
02
Ingest to bronze
Lossless write with metadata appended. No filtering, no parsing, no schema enforcement. Bronze is just the source plus a load timestamp.
03
Clean to silver
Read bronze, dedupe, parse, cast, apply quality checks, write silver. Quality-check failures halt the layer instead of cascading bad data downstream. This is the layer where most pipeline complexity lives.
04
Shape to gold
Read silver (sometimes joining multiple silver tables), aggregate or denormalize for the consumer, write gold. Business logic lives here, not in silver.
05
Consumers read gold only
Dashboards, ML models, reverse-ETL, embedded analytics. They do not query bronze or silver directly. When they do, you've lost the contract.

What interviewers actually ask

Walk me through how data flows from source to analytics.

Structure your answer as three layers and name technologies. 'Kafka CDC events land in bronze as Parquet on S3. A Spark job deduplicates by primary key and writes typed Delta tables to silver. dbt models aggregate silver into daily KPI tables in gold that Looker queries directly.' Naming specific tools at each layer reads as production experience. Vague abstractions read as having read the Databricks docs.

What goes in bronze versus silver?

Bronze stores what the source emitted, unchanged. Silver stores the cleaned, typed, deduplicated version. The boundary: if you're altering the payload, it's silver. If you're only adding ingestion metadata, it's bronze. The candidates who describe bronze as 'slightly cleaned' have miscategorized the layer. Bronze is immutable and complete; silver does the work.

How do you handle schema evolution across layers?

Bronze is immune: raw payloads absorb new fields automatically. Silver is where the work happens. Options: a table format like Delta Lake or Iceberg with ADD COLUMN support, a silver pipeline that auto-detects new fields, or a schema registry that versions changes and alerts on breaking ones. Gold schema changes require consumer coordination because dashboards and ML models depend on stable column names. Senior candidates name all three options and pick one with a reason.

When would you not use medallion architecture?

Three cases. One: when overhead exceeds value, like a startup with one source feeding one dashboard. Two: when latency is sub-second and the multi-hop pattern adds too much delay. Three: when the source is already clean and well-governed, in which case silver adds nothing. The right answer in an interview demonstrates pragmatism: 'I'd skip it when the cost of three layers exceeds the value of the separation, and that decision happens at the design phase, not after.'

How does this relate to data quality?

Each layer boundary is a quality gate. Bronze-to-silver checks for parseability, expected fields, type coercion, duplicate primary keys. Silver-to-gold checks referential integrity, partition completeness, range validation on derived metrics. Failed checks halt the pipeline and the downstream layer keeps its last good state. This is the core operational benefit: failures isolate, they don't cascade.

How do you handle late-arriving data?

Late records land in bronze like everything else. Silver must handle them with MERGE keyed on the natural key plus event timestamp, not append-only writes. Gold tables partitioned by date may need affected partitions reprocessed; with watermarking you reprocess only the affected window. Delta Lake's MERGE with deletion vectors makes this cheap. The interview signal is naming the merge strategy and bounding the reprocessing scope.

How it relates to other patterns

Data Vault. Different problem. Data Vault (hubs, links, satellites) is a modeling methodology for the integration layer. Medallion is an organization pattern for the whole lakehouse. They compose: Data Vault inside silver (raw vault), dimensional inside gold (business vault). The category-error answer treats them as alternatives; the senior answer notes they operate at different levels of abstraction. Background in the Data Vault guide.

Star schema. A modeling pattern for the gold layer, not an alternative to medallion. A good medallion platform often produces star schemas in gold. "We use medallion instead of star schema" is a category error that will concern the interviewer. More in the star schema guide.

One big table. A gold-layer design choice: pre-join everything into a wide denormalized table, trading storage and update complexity for query simplicity. Works for dashboards with predictable access patterns. Doesn't replace medallion; it's one possible output shape in gold.

Lambda architecture. Separates batch and streaming paths, then merges. Medallion layers can exist within either path. Some teams have streaming bronze for real-time landing and batch bronze for daily extracts, both feeding the same silver. Medallion is about progressive refinement, not about batch versus stream.

Common questions

Is medallion the same as ETL?+

No. ETL is a process. Medallion is an organization pattern. ETL pipelines move data between medallion layers, but the architecture itself is about how you structure storage and define quality boundaries. You can implement medallion with ETL, ELT, or streaming pipelines.

Do I need all three layers?+

Not always. Some teams skip bronze when the source is well-structured and replay isn't a concern. Some collapse silver and gold when business logic is trivial. The three-layer pattern is a default, not a requirement. What matters in an interview is that you can name the layer you're skipping and defend the decision.

How does Delta Lake or Iceberg fit?+

Table formats provide the transactional guarantees that make medallion practical. ACID transactions stop partial writes from corrupting silver. Time travel lets you query previous versions instead of maintaining snapshots. Schema evolution absorbs new source fields without manual DDL. Not strictly required, but the friction without them is enough that most teams converge on one.

Is the term 'medallion' the only name?+

No, and interviewers know this. 'Raw / staging / mart' is the older Kimball-adjacent name. 'Landing / curated / consumption' is the AWS phrasing. 'L0 / L1 / L2' shows up at some big tech companies. The pattern is the same; the naming reflects which vendor's docs the team grew up reading.

Does the interviewer always use the word 'medallion'?+

Often not. They'll ask 'walk me through your lakehouse layers' or 'how do you organize your data lake' or 'where does business logic live in your pipeline.' The medallion answer fits all three. Being able to use the pattern without depending on the word is the senior version.

02 / Why practice

Walk a layered pipeline against a real prompt

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
The round is won on tradeoffs, not on the diagram
Grain, star vs snowflake, SCD type, conformed dimensions, late-arriving data. Modeling under live pushback is what separates the bands, and it is the half almost nobody rehearses

Open a design problem

More on data modeling

Modeling interview questions→

The bank by topic, with rubric notes.

Data Vault→

Hubs, links, satellites and where they sit in a medallion stack.

Dimensional modeling→

Kimball, facts vs dimensions, the four-step design.