Data Vault Interview Questions
Data Vault Interview Questions
Data vault 2.0 modeling problems for data engineer interview prep.
Data vault 2.0 interview questions for data engineer roles. Hubs for business keys, links for relationships between hubs, satellites for descriptive attributes with full versioning. The case for vault is governance, audit trail, and parallel ingestion. The case against vault is query complexity (every business question becomes a 6-table join).
Data vault 2.0 is an alternative to Kimball star schemas in data warehouse design. The vault structure separates the model into three concepts. Hubs hold business keys: hub_customer with customer_id, load_timestamp, record_source; hub_product with product_id, load_timestamp, record_source. One row per natural key, ever. Links hold relationships between hubs: link_order with hub_customer_hash, hub_product_hash, hub_order_hash, load_timestamp. One row per unique combination of hubs, ever. Satellites hold descriptive attributes with full versioning: sat_customer_address with customer_hash, load_timestamp, address fields. Every change to an attribute is a new satellite row; nothing is ever updated.
The case for data vault. Governance: every value in the warehouse can be traced to a load_timestamp and a record_source, supporting full audit. Compliance: regulated industries (banking, pharma, healthcare) need provable history of every value, which vault provides natively. Parallel ingestion: multiple teams can load into different hubs, links, and satellites independently because the structure is append-only and there are no cross-team joins at ingest time. Schema evolution: adding a new source system means adding new hubs, links, and satellites without modifying existing structures.
The case against data vault. Query complexity: every business question becomes a 6-table join (hub + link + satellite for both sides of a relationship, plus point-in-time logic to select the right satellite version). Analysts cannot query the raw vault directly. Storage cost: the append-only structure produces 10-100x the storage of a Kimball star for the same data. Operational complexity: hash key generation, load_timestamp standardization, satellite splitting (one satellite per source system per attribute group) are non-trivial.
The pragmatic solution most data vault shops use is to build a downstream business vault or Kimball-style data mart on top of the raw vault. The raw vault sits in the silver layer with full audit. The downstream marts sit in the gold layer with star schemas optimized for query. This pattern shows up at large enterprises (banking, healthcare, government) and at companies with regulatory requirements that Kimball alone cannot satisfy.
When a data engineer interviewer asks "Kimball or vault" for a specific domain, the default answer in 2026 is Kimball unless the domain has explicit audit, compliance, or multi-team-parallel-ingest requirements. Banking warehouses often use vault (regulatory audit). Healthcare claims warehouses often use vault (HIPAA audit). Pharma trial data often uses vault (FDA audit). Most tech-company analytics warehouses use Kimball (no regulatory audit, single team ingest, query-first optimization). The senior data engineer signal is naming the trade-off explicitly and picking based on the domain's requirements, not the default.
- What are the three components of a data vault model?
- Hubs hold business keys (hub_customer with customer_id, load_timestamp, record_source). One row per natural key. Links hold relationships between hubs (link_order with hub_customer_hash, hub_product_hash, hub_order_hash). One row per unique combination. Satellites hold descriptive attributes with full versioning (sat_customer_address with customer_hash, load_timestamp, address fields). Every change is a new satellite row.
- When does a data engineer pick data vault over Kimball star schema?
- When the domain has explicit audit, compliance, or multi-team-parallel-ingest requirements. Banking warehouses (regulatory audit). Healthcare claims warehouses (HIPAA). Pharma trial data (FDA). Multi-source enterprise warehouses with many ingest teams loading in parallel. Most tech-company analytics warehouses do not need vault and pick Kimball.
- What is the main downside of data vault?
- Query complexity. Every business question becomes a 6-table join (hub plus link plus satellite for both sides, plus point-in-time logic to pick the right satellite version). Analysts cannot query the raw vault directly. The pragmatic solution is to build a downstream business vault or Kimball-style data mart on top of the raw vault for query. The raw vault sits in silver; the marts sit in gold.
- Why does data vault produce 10-100x storage versus Kimball?
- Append-only satellites mean every attribute change becomes a new row instead of an in-place update. A customer's address changing 5 times produces 5 satellite rows in vault and 1 dim_customer row updated in place in Kimball (or 5 rows with SCD Type 2). Multiply across all attributes and customers and the storage multiplier grows. Vault accepts this cost in exchange for governance and audit.
- What is a business vault and how does it differ from raw vault?
- Business vault is a derived layer on top of the raw vault that pre-computes business rules, derived satellites, and bridges, but stays in the hub-link-satellite structure. Sits between raw vault and any downstream Kimball mart. The business vault separates raw source data (raw vault) from derived business interpretations (business vault) so changes to business rules do not corrupt the audit trail of source data.
- How does point-in-time correctness work in data vault?
- Satellites use load_timestamp as the version key. A query for 'customer 42's address as of 2025-05-15' joins sat_customer_address with WHERE load_timestamp less-than-or-equal-to '2025-05-15' ORDER BY load_timestamp DESC LIMIT 1. PIT (point-in-time) tables are often built as a derived layer that pre-computes the latest version per entity per snapshot date, avoiding the join-and-sort at query time.
- What is a hash key in data vault?
- A hash of the business key (SHA-256 or MD5 of customer_id, for example) used as the join key between hubs, links, and satellites. Hash keys enable parallel ingestion: any team can compute the hash from the business key without coordinating with other teams. Hash keys also enable schema-agnostic joins: the same hash key works whether the business key is a single column or a composite.
- Is data vault appropriate for most 2026 data engineer roles?
- No. Most 2026 data engineer roles at tech companies use Kimball star schemas. Data vault is the right answer when the domain requires governance, audit, and multi-team parallel ingestion. The senior data engineer signal is knowing when to pick vault and when not to. The wrong signal is defaulting to vault for everything (over-engineering) or defaulting to Kimball for everything (under-considering compliance needs).
62 practice problems matching this filter. Difficulty: medium (35), easy (8), hard (19).
Data Modeling (62)
- Split Decision - medium - One user, one experiment, one variant. No exceptions.
- Where They Used to Live - medium - They moved. The data stayed behind.
- Airline Flight Operations Schema - medium - Flights, passengers, and routes. Before you draw a single table, tell me the grain.
- A Number for the Seller - easy - They want a total. Give them the right schema first.
- B2B Invoicing Data Model - easy - Invoices go out, partial payments trickle in, and some customers are three months overdue.
- Clickstream and Session Schema - medium - Millions of clicks, mostly anonymous.
- Cloud File Storage Metadata Schema - hard - A file is also a folder. A folder is also a file.
- Content Engagement Data Model - hard - Post published. Now measure everything that happens next.
- Content Search and Discovery Schema - hard - Searchable from every angle. Design it so nothing gets lost.
- Customer Address History - easy - People move. Sometimes twice in a month. How do you remember where everyone was, and when?
- E-Commerce Supply Chain Tracking - hard - A package splits, reroutes, and (maybe) arrives.
- EdTech Classroom Engagement Schema - medium - They opened the assignment. Did they actually read it?
- Employee Application Time Tracking - medium - Every minute tracked. Every app accounted for.
- Employee Transfer Tracking System - medium - People switch teams. HR loses track.
- Event Ticketing System Data Model - easy - JSON in. Reporting warehouse out. Design both ends.
- Held to Account - medium
- Financial Trading Warehouse - hard - Every trade, every tick, every fraction of a share. The regulators want receipts.
- Fitness App Data Model - easy - Reps, sets, streaks, and personal bests. Gym rats love their stats.
- The No-Show - easy - Every reserved seat ends one of five ways. Build the model that can tell them apart.
- Food Truck Operations Data Model - medium - Mobile vendor, fixed menu, unpredictable locations.
- Housing Marketplace Analytics - medium - Sellers want buyers. Buyers want deals.
- Insurance Claims Lifecycle - hard - A claim gets filed. Then it gets complicated. Then it gets reassigned. Then it loops back.
- Livestream Analytics Schema - medium - Someone goes live, thousands tune in, chat explodes, and virtual gifts start flying.
- Loan Application Reporting Schema - medium - Approved, declined, or pending. Design the tables that say so.
- The Balance Always Reconciles - easy - Money out, payments back. The balance has to be exact.
- Log Parsing Pipeline Schema - medium - Raw text files, terabytes of them, full of buried signals and cryptic error codes.
- The Shape of a Run - medium - Two log lines bracket every process. Pair them and the fleet's rhythm appears.
- Marketplace Sales Warehouse - hard - No schema given. The interviewer is watching.
- Metric Definition Reverse Engineering - hard - Five numbers on a dashboard. Your job: figure out where they come from.
- Movie Streaming Analytics Schema - medium - They pressed play. What happened next is the whole question.
- Multiplayer Game Match History - medium - Millions of matches. The leaderboard refreshes in fifteen minutes.
- Online Marketplace - Seller Payouts - hard - The buyer paid one number. The seller got a different one.
- The Retail Blueprint - medium - One business. A thousand transactions. Only one layout survives the analytics layer.
- The Last Mile - medium - Order placed. Now track it to the door.
- POS Sales Data Warehouse - medium - Every beep at the register. Coupons, returns, all of it.
- Property Booking Platform - hard - Five-star listing. Three-star reality.
- Retailer Data Warehouse Design - medium - Queries are crawling. The analysts are not happy.
- Ride-Sharing Platform Schema - medium - Riders, drivers, and fares. Everyone takes a cut.
- The Sales Architecture - medium - Numbers are easy. Making them queryable at scale is the real job.
- The Customer Who Changed - hard - She moved. She upgraded. She became someone new. The record has to keep up.
- Social Platform Data Model - medium - Follows, likes, replies to replies. It never stops.
- Two Wallets - medium - Two user types. Multiple payment methods. One messy billing table.
- Subscription Churn Analysis Model - medium - Subscribers are leaving. The data knows why.
- The Heat of the Map - hard
- Telecom Network Connectivity Warehouse - hard - One device goes down. The ripple keeps going.
- The Celebrity Problem - medium - One post. A million notifications. Something has to give.
- The Churner Who Came Back - hard - They cancelled. They came back. The report has to tell both stories correctly.
- The Handoff - hard
- The JSON Files That Became a Data Mart - medium - Three semi-structured inputs. One queryable warehouse.
- The League With Too Many Loyalties - hard - A player can belong to many teams. The schema must agree.
- The Plan That Changed Twice This Month - medium - Subscribers come, go, downgrade, and share. The schema has to keep up.
- The Retail Tables That Need a New Home - medium - A working system. Now redesign it so the analysts can actually use it.
- The Schema That Could Not Answer Back - hard - Forty columns in. Zero useful answers out.
- The Table That Lies - medium - Every query comes out wrong. The data is all there.
- The Talent Funnel - medium - Thousands applied. One accepted. Where did the rest go?
- The Territory That Keeps Moving - hard - Reps get reassigned. The receipts have to survive.
- The Transfer Request - medium - Apply, wait, get approved or denied. Track all of it.
- Three-Sided Marketplace Delivery Schema - hard - One order. Two deliveries. Revenue counted twice. Where is the bug in your schema?
- Toll Road Sensor Analytics - easy - Cars enter, cars exit. Except when they don't.
- Trending Dishes Dashboard - medium - What's everyone eating? The answer changes hourly.
- What the Clock Remembers - medium
- What the Script Remembers - medium