Data Vault Modeling for Data Engineer Interviews

Data Vault is the enterprise integration methodology that handles what Kimball was never designed for: dozens of source systems, frequent schema changes, and full auditability. Interviewers test whether you understand hubs, links, and satellites, and whether you know when Data Vault is the right choice vs overkill.

What Is Data Vault Modeling?

Data Vault is a data modeling methodology created by Dan Linstedt for building enterprise data warehouses that integrate data from many source systems. Unlike Kimball dimensional modeling (which optimizes for query performance and business user comprehension), Data Vault optimizes for agility, auditability, and resilience to change.

The model is built from three types of tables: hubs (business keys), links (relationships), and satellites (descriptive attributes). This decomposition means that when a source system changes its schema, adds new attributes, or is replaced entirely, the impact is isolated to specific satellites. Hubs and links remain stable because business keys and their relationships transcend any single source system.

In practice, Data Vault is used as the integration layer (sometimes called the 'raw vault') of a multi-layer warehouse. Most teams build a 'business vault' on top, which consists of Kimball-style star schemas materialized from the raw vault for BI consumption. This layered approach gives you the integration benefits of Data Vault and the query simplicity of Kimball, at the cost of more tables and more ETL jobs.

Core Components: Hubs, Links, Satellites

Hubs: Business keys, the anchor of the model

A hub table represents a core business concept: Customer, Product, Order, Account. Each hub contains exactly three types of columns: a hash key (surrogate key derived from the business key), the business key itself, a load timestamp, and a record source. Nothing else. No descriptive attributes, no foreign keys to other hubs. The hub is intentionally minimal because its job is to be the stable anchor that never changes, regardless of how source systems evolve. Why it matters: Hubs are the most stable objects in a Data Vault model. Once a business key is loaded, the hub row never changes. This stability is what makes Data Vault resilient to source system changes. If a source system is replaced, the hub remains untouched because business keys (customer ID, product SKU) transcend any single system. Interview tip: When explaining hubs, emphasize that they contain only the business key, not descriptive attributes. A common mistake is putting customer_name or product_category in the hub. Those belong in satellites. If you make this error in an interview, the interviewer will know your Data Vault understanding is superficial.

Links: Relationships between business concepts

A link table captures a relationship between two or more hubs. An Order-Product link connects the Order hub to the Product hub, representing the fact that a specific order contained a specific product. Links contain hash keys of the participating hubs, a hash key for the link itself, a load timestamp, and a record source. Like hubs, links carry no descriptive attributes. Why it matters: Links model relationships as many-to-many by default. This is a deliberate design choice. In traditional dimensional modeling, you must decide upfront whether a relationship is one-to-many or many-to-many. In Data Vault, every relationship is modeled as many-to-many, which means you never have to restructure when a business rule changes (for example, when a product that was previously in one category can now be in multiple). Interview tip: Interviewers test whether you understand that links are always many-to-many. If asked how Data Vault handles a one-to-many relationship, the answer is: the same way it handles many-to-many. The link table structure does not change. Cardinality is enforced by the data, not the schema.

Satellites: Descriptive attributes with full history

Satellites store all descriptive attributes and their history. A Customer satellite contains name, email, address, phone, and any other attributes. Each change creates a new row with a load timestamp, preserving full history automatically. Satellites attach to either a hub or a link. A hub can have multiple satellites, often split by source system or rate of change. Why it matters: Satellites are where all the descriptive richness lives, and where Data Vault gets its auditability. Because every change is a new row with a timestamp and source reference, you have a complete, immutable audit trail. You can answer 'what did system X say about customer Y on date Z?' without any special SCD machinery. History tracking is built into the structure. Interview tip: Two things to get right about satellites: (1) They attach to a single parent (one hub or one link), never to multiple parents. (2) Splitting satellites by source system or by rate of change is a key design decision. Attributes that change daily (like a customer's session count) should not be in the same satellite as attributes that change yearly (like their address), because each change to either group would create a new row for both.

Data Vault 2.0: What Changed

Hash Keys

Data Vault 2.0 replaces sequence-based surrogate keys with deterministic hash keys (typically MD5 or SHA-1 of the business key). This enables parallel loading: multiple pipelines can compute the same hash independently without coordinating on a sequence generator. Hash keys also make it possible to load hubs, links, and satellites in any order, because the foreign key values are deterministic. Practical impact: Hash keys eliminate the sequence bottleneck that made Data Vault 1.0 difficult to scale. In a distributed processing environment (Spark, Snowflake), this is essential. The trade-off is that hash collisions are theoretically possible, though vanishingly rare with MD5 on business keys of reasonable length.

Ghost Records

A ghost record is a placeholder row in a hub or link that represents 'unknown' or 'not yet arrived.' Instead of using NULL foreign keys (which break joins and complicate queries), Data Vault 2.0 uses a well-known ghost hash key (typically all zeros) that every satellite and link can reference. This simplifies query logic and eliminates outer join requirements for missing relationships. Practical impact: Ghost records solve the 'early-arriving fact' problem cleanly. When a transaction references a customer that has not yet been loaded into the hub, the transaction link points to the ghost record. When the customer eventually arrives, the link is not updated; the hub simply gets a new row, and the ghost reference in historical data remains as an accurate record of what was known at load time.

Point-in-Time (PIT) Tables

PIT tables are pre-computed lookup tables that, for each hub key and each point in time, store the correct satellite hash keys. Without PIT tables, querying the current state of an entity requires joining to every satellite and filtering by the latest timestamp, which is expensive. PIT tables pre-resolve these temporal joins. Practical impact: PIT tables are what make Data Vault queryable at production speed. Raw Data Vault queries (joining hubs to multiple satellites with temporal filtering) are complex and slow. PIT tables reduce these to simple equi-joins. Think of PIT tables as a query optimization layer, not a core modeling concept.

Bridge Tables

Bridge tables pre-compute multi-hop link traversals. If you frequently need to query from Customer through Order through Product, a bridge table pre-joins these links into a single lookup. Like PIT tables, bridges are a performance optimization for the query layer. Practical impact: Without bridges, every analytical query requires joining through the full hub-link-hub chain, which can involve 5 to 10 joins for a single business question. Bridges collapse these into 1 to 2 joins, making Data Vault practical for BI tool consumption.

When to Use Data Vault vs Kimball

Use Data Vault when...

You integrate data from many source systems (10+) that have overlapping but inconsistent definitions of the same business entities. Source systems change frequently: new systems are added, old ones are retired, schemas evolve without warning. You need full auditability: regulators or compliance teams require that you can trace any data value back to its source system, load time, and original payload. Your data engineering team is large enough to support the additional complexity. Data Vault requires more tables, more ETL jobs, and more documentation than Kimball. The warehouse serves multiple downstream consumers with different analytical needs, and no single dimensional model can serve all of them.

Use Kimball (Dimensional Modeling) when...

You have a small number of well-understood source systems (1 to 5) with stable schemas. The primary consumers are BI tools and dashboards that need fast, simple queries. Your team is small and cannot afford the operational overhead of maintaining hubs, links, satellites, PIT tables, and bridges. Requirements are relatively stable: the business questions are known, and new dimensions are added infrequently. Time-to-value matters more than architectural purity. Kimball delivers queryable results faster than Data Vault.

Interview Questions with Guidance

What are hubs, links, and satellites in Data Vault?

Hubs store business keys (the anchor, never changes). Links store relationships between hubs (always many-to-many). Satellites store descriptive attributes with full history (attached to one hub or one link). The key point: hubs and links carry no descriptive data. All attributes live in satellites. This separation is what gives Data Vault its resilience to change: when a source system is replaced, only satellites change. Hubs and links remain stable.

When would you choose Data Vault over Kimball dimensional modeling?

Data Vault when: many source systems, frequent schema changes, auditability requirements, large team. Kimball when: few sources, stable schemas, BI-focused consumers, small team, speed of delivery matters. The sophisticated answer is that they are not mutually exclusive: many enterprises use Data Vault for the integration/raw vault layer and build Kimball star schemas on top as the presentation layer (business vault). If you frame it as an either-or choice, the interviewer may push back.

How does Data Vault handle schema changes from source systems?

When a source system adds a new attribute, you add a new satellite (or add a column to an existing satellite if the rate of change is similar). Hubs and links are unaffected. When a source system is replaced entirely, you create new satellites from the new source but the same hubs and links remain, because business keys transcend source systems. This is the core architectural benefit: the structural layer (hubs and links) is insulated from source volatility.

What is the difference between Data Vault 1.0 and 2.0?

Four key changes: (1) Hash keys replace sequences, enabling parallel and order-independent loading. (2) Ghost records replace NULLs for unknown references. (3) PIT tables pre-compute temporal joins for query performance. (4) Bridge tables pre-compute multi-hop joins. The underlying philosophy is the same; DV 2.0 makes it practical for modern distributed processing and BI consumption.

How do you query a Data Vault model? Is it not just a lot of joins?

Raw Data Vault queries are complex, yes. A simple 'show me customer orders with product details' requires joining Customer Hub to Order Link to Product Hub, plus satellites for each. That is 5 to 7 joins for one question. In practice, you build a query layer on top: PIT tables for temporal resolution, bridge tables for multi-hop traversal, and often a 'business vault' of star schemas materialized from the raw vault. End users and BI tools never query the raw vault directly.

How would you split satellites for a Customer hub?

Split by source system and by rate of change. Example: (1) sat_customer_crm: attributes from the CRM (name, email, phone), changes infrequently. (2) sat_customer_web: attributes from the web platform (last_login, session_count, preferred_language), changes frequently. (3) sat_customer_billing: attributes from the billing system (payment_method, credit_score), changes on subscription events. This split prevents high-frequency changes in one source from creating unnecessary rows in satellites that track stable attributes from another source.

Your company has a Data Vault model but analysts complain queries are too slow. What do you do?

Three approaches: (1) Build PIT tables for frequently queried hubs so temporal joins are pre-computed. (2) Build bridge tables for common multi-hop query paths. (3) Create a business vault layer with star schemas materialized from the raw vault, and point BI tools at the business vault instead of the raw vault. The root cause is that raw Data Vault is not designed for direct analytical querying. It is designed for integration and auditability. The query layer must be built on top.

Frequently Asked Questions

What is Data Vault modeling?+

Data Vault is a data modeling methodology designed for enterprise data warehouses that integrate data from many source systems. It structures the warehouse into three types of tables: hubs (business keys), links (relationships), and satellites (descriptive attributes with full history). The methodology was created by Dan Linstedt and is now in its second version (Data Vault 2.0), which adds hash keys, PIT tables, and other optimizations for modern distributed processing.

Is Data Vault better than Kimball?+

Neither is universally better. Data Vault excels at enterprise-scale integration with many source systems, auditability requirements, and frequent schema changes. Kimball excels at delivering fast, queryable analytics for BI tools. Many organizations use both: Data Vault for the integration layer (raw vault) and Kimball star schemas for the presentation layer (business vault). The right answer in an interview is 'it depends on the requirements,' followed by a clear explanation of when each approach is appropriate.

Do data engineering interviews ask about Data Vault?+

Data Vault questions are less common than Kimball questions in interviews, but they appear regularly at enterprise-scale companies (finance, insurance, healthcare, telecom) and at companies that explicitly mention Data Vault in their job descriptions. Even if not asked directly, knowing Data Vault as a contrast to Kimball strengthens your data modeling answers by showing you understand multiple methodologies.

How many tables does a typical Data Vault model have?+

Significantly more than a Kimball model for the same business domain. A simple e-commerce domain might have 5 to 8 hubs, 8 to 12 links, and 15 to 30 satellites, plus PIT tables and bridges. Compare this to 3 to 5 fact tables and 6 to 10 dimension tables in a Kimball model. The table count is a real operational cost: more ETL jobs to maintain, more metadata to manage, and a steeper learning curve for new team members.

Can Data Vault work with dbt?+

Yes. Several dbt packages (dbtvault/automate-dv, datavault4dbt) provide macros for generating hub, link, and satellite load patterns. These packages reduce the boilerplate significantly, making it practical to maintain a Data Vault model with a small team. Mentioning dbt-based Data Vault implementation in an interview shows practical, modern experience rather than purely theoretical knowledge.

02 / Why practice

Build Data Modeling Fluency

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
The round is won on tradeoffs, not on the diagram
Grain, star vs snowflake, SCD type, conformed dimensions, late-arriving data. Modeling under live pushback is what separates the bands, and it is the half almost nobody rehearses

Open the problems

Related Guides

Data Modeling Interview Questions→

All data modeling topics interviewers test

Dimensional Modeling→

Kimball methodology, the primary alternative to Data Vault

Medallion Architecture→

Bronze, silver, gold layers for progressive data refinement