# Data Modeling Practice Problems

> Hands-on schema design practice for data engineer interview prep.

Canonical URL: <https://datadriven.io/data-modeling-practice-problems>

Breadcrumb: [Home](https://datadriven.io/) > [Data Modeling Practice Problems](https://datadriven.io/data-modeling-practice-problems)

## Summary

Data modeling practice problems for data engineer interview prep. Whiteboard schema design with rubric-scored verdicts covering grain selection, dimension type, SCD choice, and trade-off articulation across e-commerce, marketplace, rideshare, payments, ad tech, and content platform domains.

## What this page covers

Data modeling practice for data engineer interviews looks different from SQL or Python practice. There is no auto-grader for "is this schema correct" because multiple valid designs exist for any domain. What the rubric scores instead is the decision-making process: did the data engineer state the grain in one sentence, choose appropriate SCD types per attribute with justification, build conformed dimensions across facts, defend star versus snowflake versus OBT in the specific domain, and articulate the trade-offs the interviewer would ask about next.

Six domains anchor the practice catalog. E-commerce: order-line-item grain fact with dim_customer (Type 2), dim_product (Type 2), dim_store (Type 1 for typos, Type 2 for re-branding), dim_date. Marketplace: a two-sided fact joining buyer and seller with dim_user (Type 2 with role attribute), dim_listing (Type 2 for price changes and active status), dim_geo. Rideshare: trip-grain fact with dim_rider (Type 2 for rating history), dim_driver (Type 2 for car and rating), dim_pickup_location and dim_dropoff_location (both FK to the same dim_location, usually an H3 cell), dim_date_time. Payments ledger: transaction-grain fact with dim_merchant (Type 2 for payout schedule and MCC), dim_customer (Type 2 for risk profile), dim_payment_method (Type 1 because PMs are immutable once issued). Ad tech: impression-conversion bridge table with dim_advertiser (Type 2 for billing relationship), dim_campaign (Type 2 for budget and targeting), dim_creative, dim_publisher. Content platform: viewership-event grain fact with dim_content (Type 2 for title and category), dim_user (Type 2 for subscription tier), dim_device (Type 1 because device attributes do not change), dim_date_time.

Rubric weights for each practice problem. Grain correctness (one row per X stated explicitly and consistent across measures): 25 percent. Dimension design (SCD types correct, conformed dimensions used appropriately): 25 percent. Trade-off articulation (star vs snowflake vs OBT defended in the specific domain): 20 percent. Fact additivity (additive vs semi-additive vs non-additive measures identified): 15 percent. Edge case handling (late-arriving dimensions, bridge tables for many-to-many, slowly-changing facts): 15 percent. The L4 candidate hits 60-70 percent; the L5 candidate hits 80-90 percent; the L6 candidate adds platform-level concerns (multi-region replication, audit trail, schema evolution without downtime).

Common failure modes the rubric explicitly fails. Drawing before stating the grain. Mixed-grain fact table (some rows at order-level, some at line-item-level). Missing conformed dimension (separate dim_customer schema in orders versus returns facts). Forgotten bridge table on a many-to-many (impression-conversion joined directly producing duplication). Choosing snowflake on a small dim that should be star. Choosing star on a high-cardinality dim that needs snowflake or a separate bridge. Picking Type 1 for an attribute where downstream queries need point-in-time correctness (the customer address case, where ship-to history matters for a year-old order). The catalog catches each of these in practice scenarios.

## Frequently asked questions

### How are data modeling practice problems graded without a single correct answer?

The rubric scores the decision process, not a specific schema. Grain stated in one sentence (25 percent), dimension design with SCD justification (25 percent), trade-off articulation between star/snowflake/OBT (20 percent), fact additivity identification (15 percent), and edge case handling like late-arriving dims (15 percent). Multiple valid schemas score well if the data engineer can defend each choice.

### What is the most common mistake in data modeling practice?

Starting to draw before stating the grain. The candidate jumps to dim_customer and dim_product before saying 'one row per order line item' and ends up with a mixed-grain fact table. The fix is to write the grain on the whiteboard first, then build outward.

### Do these practice problems test specific warehouse vendors?

Most stay vendor-neutral and test pure Kimball-style design. Vendor-specific practice exists for Amazon (Redshift DISTKEY and SORTKEY decisions), Snowflake (clustering keys, micro-partition awareness), BigQuery (partitioning and clustering), Databricks (Delta MERGE INTO patterns). The vendor-specific variants are tagged on the problems where they apply.

### How many data modeling practice problems should I solve before an interview?

Six well-designed schemas across six different domains beats twenty rushed schemas on similar domains. The signal interviewers test is whether you can transfer the pattern to a new domain. Aim for e-commerce, marketplace, rideshare, payments, ad tech, content platform. Each takes 30-45 minutes; finish all six over 2 weeks with rubric review after each.

### What domains do data modeling rounds use in 2026?

Most often: marketplace (two-sided fact with buyer and seller dims), rideshare (trip-grain with location dim), payments (transaction-grain with merchant SCD2), ad tech (impression-conversion bridge), content platform (viewership-event grain). Less often: traditional e-commerce, supply chain, healthcare claims. The fundamental patterns transfer across domains; the practice value is in defending design choices in a new context.

### Should I memorize specific schemas for the interview?

No. Memorized schemas fail under follow-up. Practice the decision-making process. State grain, choose SCD type per attribute with reason, build conformed dimensions, defend star vs snowflake. When the interviewer asks 'why Type 2 here', the memorizer has no answer; the practiced data engineer recalls the reason because they made the choice from first principles.

### What is the bar for senior data engineer vs entry-level on modeling problems?

L4 (entry-mid): produce a working schema with correct grain and reasonable dimensions. L5 (senior): defend two alternatives, name conformed-dimension benefits, handle the mid-round pivot when the interviewer changes a requirement. L6 (staff): add platform-level concerns like multi-region replication, audit trail, schema evolution without downtime, and the data contract with upstream and downstream teams.

## How a data engineer practices data modeling problems

Step-by-step framework for the 30-45 minute modeling practice session.

### Step 1: Read the domain prompt fully

Note the analytical questions the model needs to support. 'How many orders per customer in the last 30 days' implies daily date grain in dim_date; 'lifetime value' implies aggregating across multiple orders per customer.

### Step 2: State the fact grain in one sentence, write it down

'One row per order line item'. Stop. Do not draw yet. Re-read the prompt and verify the grain matches the analytical questions.

### Step 3: List required dimensions and SCD type for each

dim_customer Type 2 (address history matters for shipping records), dim_product Type 2 (category and price change), dim_date Type 1 (canonical calendar), dim_store Type 1 for typos or Type 2 for re-branding.

### Step 4: Draw the fact table with measures

FKs to each dim, then numeric measures: quantity, unit_price, discount, total_amount. Mark additive vs semi-additive vs non-additive.

### Step 5: Defend the design choices out loud

Why star over snowflake (small dims, columnar storage broadcasts them). Why Type 2 on dim_customer (point-in-time queries need historical address). Why a bridge table on product-category (many-to-many).

### Step 6: Handle the mid-round pivot

When the interviewer changes a requirement (real-time freshness, multi-region replication, late-arriving conversions), modify the existing model in place. Do not throw it out and restart.

## Related practice catalogs

- [Data modeling interview questions catalog](https://datadriven.io/data-modeling-interview-questions): Full catalog with rubric-scored verdicts.
- [Data modeling interview prep guide](https://datadriven.io/data-modeling-interview-prep): Round-by-round prep covering grain selection and dimension design.
- [Star schema interview questions](https://datadriven.io/star-schema-interview-questions): Conformed dimensions, additive facts, defending star over snowflake.
- [Dimensional modeling interview questions](https://datadriven.io/dimensional-modeling-interview-questions): Kimball-style design across multiple domains.
- [SCD merge logic practice problems](https://datadriven.io/scd-interview-questions): Type 2 merge in SQL and pandas with composite tiebreakers.
- [Fact table interview problems](https://datadriven.io/fact-table-interview-questions): Grain selection, additivity, snapshot vs transaction.
- [Data warehouse interview problems](https://datadriven.io/data-warehouse-interview-questions): Warehouse design, fact granularity, conformed dimensions.
- [Full data engineer coding practice](https://datadriven.io/data-engineer-coding-practice): Pair modeling with SQL, Python, and PySpark.

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.