Data vault 2.0 interview questions for data engineer roles. Hubs for business keys, links for relationships between hubs, satellites for descriptive attributes with full versioning. The case for vault is governance, audit trail, and parallel ingestion. The case against vault is query complexity (every business question becomes a 6-table join).

Data vault 2.0 is an alternative to Kimball star schemas in data warehouse design. The vault structure separates the model into three concepts. Hubs hold business keys: hub_customer with customer_id, load_timestamp, record_source; hub_product with product_id, load_timestamp, record_source. One row per natural key, ever. Links hold relationships between hubs: link_order with hub_customer_hash, hub_product_hash, hub_order_hash, load_timestamp. One row per unique combination of hubs, ever. Satellites hold descriptive attributes with full versioning: sat_customer_address with customer_hash, load_timestamp, address fields. Every change to an attribute is a new satellite row; nothing is ever updated.

The case for data vault. Governance: every value in the warehouse can be traced to a load_timestamp and a record_source, supporting full audit. Compliance: regulated industries (banking, pharma, healthcare) need provable history of every value, which vault provides natively. Parallel ingestion: multiple teams can load into different hubs, links, and satellites independently because the structure is append-only and there are no cross-team joins at ingest time. Schema evolution: adding a new source system means adding new hubs, links, and satellites without modifying existing structures.

The case against data vault. Query complexity: every business question becomes a 6-table join (hub + link + satellite for both sides of a relationship, plus point-in-time logic to select the right satellite version). Analysts cannot query the raw vault directly. Storage cost: the append-only structure produces 10-100x the storage of a Kimball star for the same data. Operational complexity: hash key generation, load_timestamp standardization, satellite splitting (one satellite per source system per attribute group) are non-trivial.

The pragmatic solution most data vault shops use is to build a downstream business vault or Kimball-style data mart on top of the raw vault. The raw vault sits in the silver layer with full audit. The downstream marts sit in the gold layer with star schemas optimized for query. This pattern shows up at large enterprises (banking, healthcare, government) and at companies with regulatory requirements that Kimball alone cannot satisfy.

When a data engineer interviewer asks "Kimball or vault" for a specific domain, the default answer in 2026 is Kimball unless the domain has explicit audit, compliance, or multi-team-parallel-ingest requirements. Banking warehouses often use vault (regulatory audit). Healthcare claims warehouses often use vault (HIPAA audit). Pharma trial data often uses vault (FDA audit). Most tech-company analytics warehouses use Kimball (no regulatory audit, single team ingest, query-first optimization). The senior data engineer signal is naming the trade-off explicitly and picking based on the domain's requirements, not the default.

Data Vault Interview Questions

Data vault 2.0 modeling problems for data engineer interview prep.

57 practice problems matching this filter. Difficulty: medium (32), easy (8), hard (17).

Data Modeling (57)

Common questions

What are the three components of a data vault model?
Hubs hold business keys (hub_customer with customer_id, load_timestamp, record_source). One row per natural key. Links hold relationships between hubs (link_order with hub_customer_hash, hub_product_hash, hub_order_hash). One row per unique combination. Satellites hold descriptive attributes with full versioning (sat_customer_address with customer_hash, load_timestamp, address fields). Every change is a new satellite row.
When does a data engineer pick data vault over Kimball star schema?
When the domain has explicit audit, compliance, or multi-team-parallel-ingest requirements. Banking warehouses (regulatory audit). Healthcare claims warehouses (HIPAA). Pharma trial data (FDA). Multi-source enterprise warehouses with many ingest teams loading in parallel. Most tech-company analytics warehouses do not need vault and pick Kimball.
What is the main downside of data vault?
Query complexity. Every business question becomes a 6-table join (hub plus link plus satellite for both sides, plus point-in-time logic to pick the right satellite version). Analysts cannot query the raw vault directly. The pragmatic solution is to build a downstream business vault or Kimball-style data mart on top of the raw vault for query. The raw vault sits in silver; the marts sit in gold.
Why does data vault produce 10-100x storage versus Kimball?
Append-only satellites mean every attribute change becomes a new row instead of an in-place update. A customer's address changing 5 times produces 5 satellite rows in vault and 1 dim_customer row updated in place in Kimball (or 5 rows with SCD Type 2). Multiply across all attributes and customers and the storage multiplier grows. Vault accepts this cost in exchange for governance and audit.
What is a business vault and how does it differ from raw vault?
Business vault is a derived layer on top of the raw vault that pre-computes business rules, derived satellites, and bridges, but stays in the hub-link-satellite structure. Sits between raw vault and any downstream Kimball mart. The business vault separates raw source data (raw vault) from derived business interpretations (business vault) so changes to business rules do not corrupt the audit trail of source data.
How does point-in-time correctness work in data vault?
Satellites use load_timestamp as the version key. A query for 'customer 42's address as of 2025-05-15' joins sat_customer_address with WHERE load_timestamp less-than-or-equal-to '2025-05-15' ORDER BY load_timestamp DESC LIMIT 1. PIT (point-in-time) tables are often built as a derived layer that pre-computes the latest version per entity per snapshot date, avoiding the join-and-sort at query time.
What is a hash key in data vault?
A hash of the business key (SHA-256 or MD5 of customer_id, for example) used as the join key between hubs, links, and satellites. Hash keys enable parallel ingestion: any team can compute the hash from the business key without coordinating with other teams. Hash keys also enable schema-agnostic joins: the same hash key works whether the business key is a single column or a composite.
Is data vault appropriate for most 2026 data engineer roles?
No. Most 2026 data engineer roles at tech companies use Kimball star schemas. Data vault is the right answer when the domain requires governance, audit, and multi-team parallel ingestion. The senior data engineer signal is knowing when to pick vault and when not to. The wrong signal is defaulting to vault for everything (over-engineering) or defaulting to Kimball for everything (under-considering compliance needs).