Data Modeling
Data Vault is the enterprise integration methodology that handles what Kimball was never designed for: dozens of source systems, frequent schema changes, and full auditability. Interviewers test whether you understand hubs, links, and satellites, and whether you know when Data Vault is the right choice vs overkill.
Data Vault questions appear at enterprise-scale companies and in roles that involve integrating data from many source systems. Even if your target company uses Kimball, knowing Data Vault as a contrast strengthens your modeling answers.
Data Vault is a data modeling methodology created by Dan Linstedt for building enterprise data warehouses that integrate data from many source systems. Unlike Kimball dimensional modeling (which optimizes for query performance and business user comprehension), Data Vault optimizes for agility, auditability, and resilience to change.
The model is built from three types of tables: hubs (business keys), links (relationships), and satellites (descriptive attributes). This decomposition means that when a source system changes its schema, adds new attributes, or is replaced entirely, the impact is isolated to specific satellites. Hubs and links remain stable because business keys and their relationships transcend any single source system.
In practice, Data Vault is used as the integration layer (sometimes called the "raw vault") of a multi-layer warehouse. Most teams build a "business vault" on top, which consists of Kimball-style star schemas materialized from the raw vault for BI consumption. This layered approach gives you the integration benefits of Data Vault and the query simplicity of Kimball, at the cost of more tables and more ETL jobs.
These three table types are the building blocks of every Data Vault model. Understanding what goes into each, and what does not, is the foundation of every Data Vault interview question.
Business keys, the anchor of the model
A hub table represents a core business concept: Customer, Product, Order, Account. Each hub contains exactly three types of columns: a hash key (surrogate key derived from the business key), the business key itself, a load timestamp, and a record source. Nothing else. No descriptive attributes, no foreign keys to other hubs. The hub is intentionally minimal because its job is to be the stable anchor that never changes, regardless of how source systems evolve.
Hubs are the most stable objects in a Data Vault model. Once a business key is loaded, the hub row never changes. This stability is what makes Data Vault resilient to source system changes. If a source system is replaced, the hub remains untouched because business keys (customer ID, product SKU) transcend any single system.
When explaining hubs, emphasize that they contain only the business key, not descriptive attributes. A common mistake is putting customer_name or product_category in the hub. Those belong in satellites. If you make this error in an interview, the interviewer will know your Data Vault understanding is superficial.
Relationships between business concepts
A link table captures a relationship between two or more hubs. An Order-Product link connects the Order hub to the Product hub, representing the fact that a specific order contained a specific product. Links contain hash keys of the participating hubs, a hash key for the link itself, a load timestamp, and a record source. Like hubs, links carry no descriptive attributes.
Links model relationships as many-to-many by default. This is a deliberate design choice. In traditional dimensional modeling, you must decide upfront whether a relationship is one-to-many or many-to-many. In Data Vault, every relationship is modeled as many-to-many, which means you never have to restructure when a business rule changes (for example, when a product that was previously in one category can now be in multiple).
Interviewers test whether you understand that links are always many-to-many. If asked 'how does Data Vault handle a one-to-many relationship,' the answer is: the same way it handles many-to-many. The link table structure does not change. Cardinality is enforced by the data, not the schema.
Descriptive attributes with full history
Satellites store all descriptive attributes and their history. A Customer satellite contains name, email, address, phone, and any other attributes. Each change creates a new row with a load timestamp, preserving full history automatically. Satellites attach to either a hub or a link. A hub can have multiple satellites, often split by source system or rate of change.
Satellites are where all the descriptive richness lives, and where Data Vault gets its auditability. Because every change is a new row with a timestamp and source reference, you have a complete, immutable audit trail. You can answer 'what did system X say about customer Y on date Z?' without any special SCD machinery. History tracking is built into the structure.
Two things to get right about satellites: (1) They attach to a single parent (one hub or one link), never to multiple parents. (2) Splitting satellites by source system or by rate of change is a key design decision. Attributes that change daily (like a customer's session count) should not be in the same satellite as attributes that change yearly (like their address), because each change to either group would create a new row for both.
Data Vault 2.0 introduced four practical improvements that made the methodology viable for modern distributed data platforms. These are worth knowing for interviews because they address the most common criticisms of the original approach.
Data Vault 2.0 replaces sequence-based surrogate keys with deterministic hash keys (typically MD5 or SHA-1 of the business key). This enables parallel loading: multiple pipelines can compute the same hash independently without coordinating on a sequence generator. Hash keys also make it possible to load hubs, links, and satellites in any order, because the foreign key values are deterministic.
Hash keys eliminate the sequence bottleneck that made Data Vault 1.0 difficult to scale. In a distributed processing environment (Spark, Snowflake), this is essential. The trade-off is that hash collisions are theoretically possible, though vanishingly rare with MD5 on business keys of reasonable length.
A ghost record is a placeholder row in a hub or link that represents 'unknown' or 'not yet arrived.' Instead of using NULL foreign keys (which break joins and complicate queries), Data Vault 2.0 uses a well-known ghost hash key (typically all zeros) that every satellite and link can reference. This simplifies query logic and eliminates outer join requirements for missing relationships.
Ghost records solve the 'early-arriving fact' problem cleanly. When a transaction references a customer that has not yet been loaded into the hub, the transaction link points to the ghost record. When the customer eventually arrives, the link is not updated; the hub simply gets a new row, and the ghost reference in historical data remains as an accurate record of what was known at load time.
PIT tables are pre-computed lookup tables that, for each hub key and each point in time, store the correct satellite hash keys. Without PIT tables, querying the current state of an entity requires joining to every satellite and filtering by the latest timestamp, which is expensive. PIT tables pre-resolve these temporal joins.
PIT tables are what make Data Vault queryable at production speed. Raw Data Vault queries (joining hubs to multiple satellites with temporal filtering) are complex and slow. PIT tables reduce these to simple equi-joins. Think of PIT tables as a query optimization layer, not a core modeling concept.
Bridge tables pre-compute multi-hop link traversals. If you frequently need to query from Customer through Order through Product, a bridge table pre-joins these links into a single lookup. Like PIT tables, bridges are a performance optimization for the query layer.
Without bridges, every analytical query requires joining through the full hub-link-hub chain, which can involve 5 to 10 joins for a single business question. Bridges collapse these into 1 to 2 joins, making Data Vault practical for BI tool consumption.
This is the comparison interviewers ask about most. The answer is not "one is better." It is "they solve different problems and often coexist."
Data Vault questions in interviews fall into three categories. First, definitional: "what are hubs, links, and satellites?" This is table stakes. If you cannot answer this clearly, the interviewer moves on.
Second, comparative: "when would you choose Data Vault over Kimball?" This tests judgment. The interviewer wants to hear specific criteria (number of sources, schema stability, auditability requirements, team size) rather than generic preferences. Saying "Data Vault is better because it is more modern" is a red flag.
Third, practical: "you have a Data Vault model and queries are slow. What do you do?" This tests whether you have actually worked with Data Vault or just read about it. The answer involves PIT tables, bridges, and business vault star schemas. Candidates who have only read the theory often do not know these query optimization patterns exist.
These questions cover the range of what interviewers ask about Data Vault, from definitions to design decisions to performance optimization.
Hubs store business keys (the anchor, never changes). Links store relationships between hubs (always many-to-many). Satellites store descriptive attributes with full history (attached to one hub or one link). The key point: hubs and links carry no descriptive data. All attributes live in satellites. This separation is what gives Data Vault its resilience to change: when a source system is replaced, only satellites change. Hubs and links remain stable.
Data Vault when: many source systems, frequent schema changes, auditability requirements, large team. Kimball when: few sources, stable schemas, BI-focused consumers, small team, speed of delivery matters. The sophisticated answer is that they are not mutually exclusive: many enterprises use Data Vault for the integration/raw vault layer and build Kimball star schemas on top as the presentation layer (business vault). If you frame it as an either-or choice, the interviewer may push back.
When a source system adds a new attribute, you add a new satellite (or add a column to an existing satellite if the rate of change is similar). Hubs and links are unaffected. When a source system is replaced entirely, you create new satellites from the new source but the same hubs and links remain, because business keys transcend source systems. This is the core architectural benefit: the structural layer (hubs and links) is insulated from source volatility.
Four key changes: (1) Hash keys replace sequences, enabling parallel and order-independent loading. (2) Ghost records replace NULLs for unknown references. (3) PIT tables pre-compute temporal joins for query performance. (4) Bridge tables pre-compute multi-hop joins. The underlying philosophy is the same; DV 2.0 makes it practical for modern distributed processing and BI consumption.
Raw Data Vault queries are complex, yes. A simple 'show me customer orders with product details' requires joining Customer Hub to Order Link to Product Hub, plus satellites for each. That is 5 to 7 joins for one question. In practice, you build a query layer on top: PIT tables for temporal resolution, bridge tables for multi-hop traversal, and often a 'business vault' of star schemas materialized from the raw vault. End users and BI tools never query the raw vault directly.
Split by source system and by rate of change. Example: (1) sat_customer_crm: attributes from the CRM (name, email, phone), changes infrequently. (2) sat_customer_web: attributes from the web platform (last_login, session_count, preferred_language), changes frequently. (3) sat_customer_billing: attributes from the billing system (payment_method, credit_score), changes on subscription events. This split prevents high-frequency changes in one source from creating unnecessary rows in satellites that track stable attributes from another source.
Three approaches: (1) Build PIT tables for frequently queried hubs so temporal joins are pre-computed. (2) Build bridge tables for common multi-hop query paths. (3) Create a business vault layer with star schemas materialized from the raw vault, and point BI tools at the business vault instead of the raw vault. The root cause is that raw Data Vault is not designed for direct analytical querying. It is designed for integration and auditability. The query layer must be built on top.
DataDriven covers data modeling, SQL, and system design with hands-on challenges at interview difficulty. Practice the concepts interviewers test, from Kimball to Data Vault to pipeline architecture.
Start Practicing