Question 1

What are the three components of a data vault model?

Accepted Answer

Hubs hold business keys (hub_customer with customer_id, load_timestamp, record_source). One row per natural key. Links hold relationships between hubs (link_order with hub_customer_hash, hub_product_hash, hub_order_hash). One row per unique combination. Satellites hold descriptive attributes with full versioning (sat_customer_address with customer_hash, load_timestamp, address fields). Every change is a new satellite row.

Question 2

When does a data engineer pick data vault over Kimball star schema?

Accepted Answer

When the domain has explicit audit, compliance, or multi-team-parallel-ingest requirements. Banking warehouses (regulatory audit). Healthcare claims warehouses (HIPAA). Pharma trial data (FDA). Multi-source enterprise warehouses with many ingest teams loading in parallel. Most tech-company analytics warehouses do not need vault and pick Kimball.

Question 3

What is the main downside of data vault?

Accepted Answer

Query complexity. Every business question becomes a 6-table join (hub plus link plus satellite for both sides, plus point-in-time logic to pick the right satellite version). Analysts cannot query the raw vault directly. The pragmatic solution is to build a downstream business vault or Kimball-style data mart on top of the raw vault for query. The raw vault sits in silver; the marts sit in gold.

Question 4

Why does data vault produce 10-100x storage versus Kimball?

Accepted Answer

Append-only satellites mean every attribute change becomes a new row instead of an in-place update. A customer's address changing 5 times produces 5 satellite rows in vault and 1 dim_customer row updated in place in Kimball (or 5 rows with SCD Type 2). Multiply across all attributes and customers and the storage multiplier grows. Vault accepts this cost in exchange for governance and audit.

Question 5

What is a business vault and how does it differ from raw vault?

Accepted Answer

Business vault is a derived layer on top of the raw vault that pre-computes business rules, derived satellites, and bridges, but stays in the hub-link-satellite structure. Sits between raw vault and any downstream Kimball mart. The business vault separates raw source data (raw vault) from derived business interpretations (business vault) so changes to business rules do not corrupt the audit trail of source data.

Question 6

How does point-in-time correctness work in data vault?

Accepted Answer

Satellites use load_timestamp as the version key. A query for 'customer 42's address as of 2025-05-15' joins sat_customer_address with WHERE load_timestamp less-than-or-equal-to '2025-05-15' ORDER BY load_timestamp DESC LIMIT 1. PIT (point-in-time) tables are often built as a derived layer that pre-computes the latest version per entity per snapshot date, avoiding the join-and-sort at query time.

Question 7

What is a hash key in data vault?

Accepted Answer

A hash of the business key (SHA-256 or MD5 of customer_id, for example) used as the join key between hubs, links, and satellites. Hash keys enable parallel ingestion: any team can compute the hash from the business key without coordinating with other teams. Hash keys also enable schema-agnostic joins: the same hash key works whether the business key is a single column or a composite.

Question 8

Is data vault appropriate for most 2026 data engineer roles?

Accepted Answer

No. Most 2026 data engineer roles at tech companies use Kimball star schemas. Data vault is the right answer when the domain requires governance, audit, and multi-team parallel ingestion. The senior data engineer signal is knowing when to pick vault and when not to. The wrong signal is defaulting to vault for everything (over-engineering) or defaulting to Kimball for everything (under-considering compliance needs).

Data Vault Interview Questions

Data Vault Interview Questions

Data Modeling (63)