Canonical Data Models

The integration pattern that turns N-by-M mappings into N-plus-M. When it's worth the governance overhead, when it isn't, and the version interviewers actually want to hear.

A canonical data model is the single agreed-upon definition of a business entity inside an organization. One Customer schema. One Order schema. One Employee schema. Every source system maps into it once. Every downstream consumer reads from it. The point is not the diagram. The point is the math: ten sources and eight consumers stop needing eighty point-to-point mappings and start needing eighteen.

The term is older than data engineering, going back to enterprise integration patterns in the 2000s, but the problem it solves is the one every growing data team eventually hits: four teams have four definitions of "active customer," and the CMO's dashboard and the CFO's board deck disagree by fourteen percent. A canonical model is what gets built after that meeting.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a data modeling query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1fact_orders

2 order_id bigint PK

3 customer_sk bigint FK

4 order_date date SCD2

Execute your solution0.4s avg.

PinterestInterview question

Solve a problem

The math, in one paragraph

Without a canonical model, every source talks to every consumer. With N sources and M consumers, that's N times M mappings, and every new source means M new mappings, every new consumer means N new ones. With a canonical model, sources map in once and consumers map out once: N plus M mappings, growing linearly. At 3 sources and 2 consumers the math is irrelevant; at 10 sources and 8 consumers it's the difference between 80 and 18.

This is the entire reason the pattern exists. Everything else is implementation detail. If your data platform will never grow past 3 sources, you don't need a canonical model. If you can already see 6 sources on the roadmap, you do.

The shape of a canonical Customer model

Three source systems map their version of Customer into one canonical entity. Three downstream consumers read from the canonical entity. Every source-side mapping lives in one place; every consumer change requires zero source changes. This is the picture interviewers want to see when you describe an integration layer.

Inbound mappingCRM's first_name + last_name parses to given_name + family_name. Billing's full_name splits on the last space. Support's email_address normalizes to lowercase. The parsing lives in the mapping, not the canonical entity.

Canonical contractThe middle entity is the single agreed-upon definition. Schema lives in a registry. Producers can't publish a payload that fails validation; consumers can't read a payload they don't understand.

Outbound interfaceWarehouse, ML feature store, reverse-ETL each read from canonical. Adding a 4th consumer is 1 new mapping, not N. Changing CRM's name format is 1 mapping change, not 4.

When it earns its overhead

Multi-source data integration. Five or more source systems with overlapping entities. CRM stores first_name and last_name. Billing stores full_name. Support stores contact_name. Without a canonical model, every analytics tool, every ML feature pipeline, every reverse-ETL job writes its own parsing logic for the same field. The canonical model is one place where given_name and family_name are defined once, and the CRM-to-canonical mapping is the only place that knows about full_name parsing.

Event-driven microservices. Services publish events to Kafka or a similar bus. Without a shared contract, the order service emits orders in its format, payment emits in its format, and the analytics consumer parses both plus the inevitable schema drift. The canonical model is the contract: published as an Avro or Protobuf schema in the schema registry, enforced at publish time. In domain-driven design terms, the mapping layer in each service is the anti-corruption layer. The schema registry is the enforcement mechanism. The canonical model is the design decision.

Enterprise data platforms. Workday, Salesforce, Jira, NetSuite, HubSpot, each with its own definition of Employee. Building a unified org chart or access control system requires a canonical Employee entity: employee_id, given_name, family_name, department, hire_date, manager_id, cost_center. Workday's worker_id maps to employee_id. Salesforce's owner_name parses to given_name plus family_name. The canonical entity becomes the single definition every downstream tool consumes.

Marketplace Sales Warehouse

> We run a two-sided marketplace where buyers and sellers transact. The analytics team needs a self-service warehouse to analyze GMV, conversion rates, and seller performance. There is no provided schema. You are expected to establish the entities, their relationships, and the dimensional model from scratch. Start by asking clarifying questions before designing anything.

+ Table

+ Column

Architecture

Data Modeling

Model the schema.

Click + Table in the toolbar, or right-click the canvas to add one.

Drag from a key column's edge dot to another column to draw a foreign key.

When it doesn't earn its overhead

One or two source systems. The math doesn't favor it; the governance overhead does favor not building it. Start with point-to-point and revisit when a third source lands.

A small team where the canonical schema would have no owner. Without an owner who reviews change requests and enforces compliance, the model drifts and turns back into point-to-point with extra steps. A canonical model without governance is a worse version of no canonical model.

When all your sources already use the same SaaS family (Salesforce ecosystem, Microsoft ecosystem) and the entity definitions are already standardized. Building a canonical layer on top of an already-canonical stack is paperwork.

How to build one

There is no shortcut on step one. The teams who skip it end up redoing the work after the canonical model collides with reality.

01
Inventory the sources
List every source system that will feed in. Document the entities, the attributes, the types, the naming conventions, and the known quality issues. The inventory reveals which entities are universal (everyone has a customer) and which are source-specific (only billing has a payment method). Expect this to take longer than you think because the documentation in source systems is wrong about a third of the time.
02
Pick the canonical entities
From the inventory, extract the entities that span sources. Customer, Product, Employee, Transaction, Location are the usual suspects. Use business-neutral names. Not salesforce_contact, not hubspot_lead, just person. The right people to define these entities are the teams that own each source plus one tiebreaker who can say no when teams disagree about which definition wins.
03
Define attributes and types
More granular is better. If CRM has given_name and family_name and billing has full_name, the canonical entity has given_name and family_name, and the billing-to-canonical mapping does the parsing. Define required versus optional fields explicitly. Document in a schema registry, a shared repository, or a Protobuf/Avro definition. The format matters less than that there is one place to look.
04
Build the inbound mappings
Each source gets a mapping that transforms source data into the canonical format. This is where the unglamorous work is. Parsing full_name into given_name and family_name. Reconciling status enums (CRM uses active/inactive, billing uses 1/0). Handling the source's NULLs and the times when a required field isn't present. Each mapping is its own pipeline component, tested independently. The mapping layer absorbs all the source-specific weirdness so the canonical model stays clean.
05
Build the outbound interfaces
Downstream consumers each get a mapping from canonical to whatever shape they need. This layer stays thin because the canonical model is already clean. A new consumer needs exactly one new mapping. This is where the N-plus-M math pays off in practice: adding the ninth consumer to a platform with ten sources is one mapping, not ten.
06
Decide who owns it
The canonical model fails without an owner. Pick a team, give them the change-review process and the schema registry permissions, and write down the version policy. Most teams use schema-registry-style compatibility rules: adding optional fields is forward-compatible, removing fields requires a deprecation window. Without this, you'll have a canonical model in name and a point-to-point system in practice within six months.

What interviewers actually ask

What is a canonical data model and why would you use one?

It's a standardized data representation that sits between sources and consumers. Sources map in once, consumers read out, and the whole thing exists to avoid the N-by-M mapping explosion that point-to-point integration produces. Use it when you have five or more sources with overlapping entities and you expect to add more. Don't use it when you have two sources and no roadmap to add a third.

How does a canonical model relate to a schema registry?

Schema registry is the enforcement tool. Canonical model is the design decision. The canonical model says 'an Order has these fields, these types, these constraints.' The schema registry stores that as an Avro or Protobuf definition and validates every event against it. You can have a canonical model without a schema registry (Google Docs and discipline) but you'll be sorry. You can't have a schema registry without something like a canonical model, because the schemas have to come from somewhere.

Ten microservices, no canonical model, inconsistent data. What do you do?

Start with the entity that hurts most. Almost always Customer or User. Pull the product and engineering leads from each service into one room and define the canonical version. Publish to a schema registry. Require new inter-service events involving that entity to conform. Build the anti-corruption layer in each service to translate between internal model and canonical. Roll out incrementally, entity by entity. Don't try to canonicalize everything at once.

What are the actual downsides?

Three. Governance overhead: someone has to own it. Mapping maintenance: every source schema change is a mapping change. Lowest-common-denominator risk: if you define the canonical model as the intersection of source schemas, you lose useful source-specific data. The mitigation is to define it as the union of meaningful attributes, with optional fields tagged for source-specific cases. Senior candidates name all three downsides; mid candidates only name one.

Common questions

Canonical model vs data warehouse schema?+

Different layers. Canonical model is an integration pattern that standardizes data in transit. A warehouse schema (star, snowflake, data vault) is a storage pattern optimized for analytical queries. They compose: sources map to canonical, canonical feeds the warehouse, the warehouse organizes for queries. Don't confuse them in an interview; they sit at different points in the pipeline.

Do I need one with a single source?+

No. The pattern solves multi-source integration. With one source you have one mapping, and a canonical model is paperwork. Revisit when a second source is on the roadmap. The cost of building it later is lower than people assume because the second source forces the abstraction anyway.

Canonical model vs master data management?+

Related, not the same. Canonical model defines the schema. MDM goes further: deduplicates records, establishes golden records, manages lifecycle across systems. You can have a canonical model without MDM. You can't do MDM without something canonical-model-shaped to match records against. In practice the canonical model is step one and MDM is step three or four.

Is this the same as a data contract?+

Overlapping but not identical. Data contracts are the broader concept: a producer-consumer agreement that includes schema, semantics, SLAs, ownership, and lifecycle. A canonical model is one piece of that, specifically the schema portion. Modern data contract platforms (Gable, Snowplow's Iglu, Confluent's data contract product) implement canonical models as the structural backbone of broader contract enforcement.

02 / Why practice

Practice the design conversation

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
The round is won on tradeoffs, not on the diagram
Grain, star vs snowflake, SCD type, conformed dimensions, late-arriving data. Modeling under live pushback is what separates the bands, and it is the half almost nobody rehearses

Open a modeling problem

More on data modeling

Dimensional modeling→

Kimball, fact types, the four-step design process.

Star schema→

When star wins over snowflake, with worked examples.

Modeling interview questions→

The bank organized by topic, with rubric notes.