Data Engineering Interview Prep

Data Modeling Interview Questions (2026)

Roughly a third of data engineering interviews include a dedicated data modeling round. The most frequent topics: entity identification, primary keys, attributes, foreign keys, star schema, fact tables, dimension tables, and medallion architecture.

Based on DataDriven's analysis of verified interview data. DataDriven is the only platform with interactive schema building and validation.

Why Data Modeling Is the Hardest Interview Round

Data modeling appears in roughly a third of all data engineering interviews, yet it is the round candidates prepare for the least. Unlike SQL or Python, there is no single correct answer. Every schema design is a series of decisions with consequences, and the interviewer's job is to probe whether you understand those consequences.

There is no single right schema. Two candidates can produce different schemas for the same problem, and both can pass. What matters is whether you can explain why you chose one approach over another. If the interviewer asks "why not snowflake?" and you cannot articulate the trade-off, that is a fail.

Requirements change mid-interview. Interviewers deliberately add new requirements after you finish your initial design. "Now the product team wants real-time updates." "Now we need to track 3 years of history." Your schema needs to bend without breaking.

You must think out loud. Unlike SQL where you can silently write and submit, data modeling is a conversation. Interviewers are scoring your reasoning process. Verbalize your assumptions, state your trade-offs, and ask clarifying questions before you draw a single table.

Topic 1

Normalization (1NF through 3NF)

Difficulty: Medium

Frequency: Very High

Normalization is the foundation of relational design. Interviewers test whether you can identify violations, explain why they matter, and know when denormalization is the right call.

Q1

Given a table with repeating groups in a single column (comma-separated tags), normalize it to 1NF. Explain what problem the original design creates for queries.

Q2

Identify the partial dependency in a table with a composite primary key and decompose it into 2NF.

Q3

A table stores employee_name, department_name, and department_location. Identify the transitive dependency and normalize to 3NF.

Q4

When would you intentionally denormalize a 3NF schema? Give a specific scenario with the trade-offs.

Topic 2

Star vs Snowflake Schemas

Difficulty: Medium

Frequency: Star schema: 4.7% of modeling questions

Star schema appears in 4.7% of data modeling questions, with fact tables (4.7%) and dimension tables (4.2%) closely behind. Interviewers want you to design fact and dimension tables, choose the right grain, and justify star vs snowflake trade-offs.

Q1

Design a star schema for an e-commerce platform. Define the fact table grain and list 4 dimension tables with their key attributes.

Q2

Convert a star schema into a snowflake schema by normalizing the product dimension. What queries become harder? What storage do you save?

Q3

Your fact table has 500M rows. A dimension table has 50M rows. Should you use a star or snowflake schema? Explain your reasoning based on query patterns.

Q4

What is a degenerate dimension? Give an example and explain why it lives in the fact table instead of a separate dimension.

Topic 3

Slowly Changing Dimensions (SCD Types 1-3)

Difficulty: Medium-Hard

Frequency: High

Tracking historical changes in dimension data is a classic interview topic. Interviewers test whether you understand the trade-offs between SCD types and can implement them correctly.

Q1

A customer changes their address. Show how SCD Type 1, Type 2, and Type 3 each handle this change. Which columns does each approach add or modify?

Q2

Design an SCD Type 2 implementation for a product dimension where price changes frequently. Include the surrogate key, natural key, effective dates, and current flag.

Q3

Your team uses SCD Type 2 for a dimension with 10M rows. After 3 years, the table has grown to 80M rows. What strategies can reduce the size without losing history?

Q4

When would you choose SCD Type 3 over Type 2? Describe a real scenario where storing only the previous value is sufficient.

Topic 4

Entity-Relationship Modeling

Difficulty: Medium

Frequency: Entity ID: 6.6% of modeling questions

Entity identification is the most common data modeling topic at 6.6%. ER diagrams are the starting point for most schema design interviews. You draw entities, relationships, and cardinalities. Then the interviewer asks follow-up questions about edge cases.

Q1

Draw an ER model for a university system: students, courses, professors, and enrollments. Define cardinalities for each relationship.

Q2

Model a social network where users can follow other users, create posts, and like posts. How do you handle the self-referencing follow relationship?

Q3

Design an ER model for a hotel booking system. Handle the case where a single booking can span multiple rooms and multiple nights.

Q4

Your ER model has a many-to-many relationship between orders and products. The intersection table needs to store quantity and unit_price. Why is unit_price stored here instead of on the product table?

Topic 5

Keys and Constraints

Difficulty: Easy-Medium

Frequency: Primary keys 5.9%, Foreign keys 4.7%

Primary keys (5.9%) and foreign keys (4.7%) are core modeling topics. Interviewers probe whether you understand why each constraint exists and what breaks when you skip one.

Q1

When should you use a surrogate key (auto-increment ID) vs a natural key (email, SSN)? Give one scenario for each.

Q2

A table has a composite primary key of (user_id, event_date). Can you add a foreign key from another table to this composite key? What are the implications?

Q3

Your schema has no foreign key constraints for performance reasons. How do you ensure referential integrity? What is the cost of skipping enforcement?

Topic 6

Trade-off Reasoning

Difficulty: Hard

Frequency: Very High

The hardest part of a data modeling interview is not the design itself. It is defending your choices when the interviewer pushes back. Every design has trade-offs, and you need to articulate them clearly.

Q1

You denormalized a dimension to speed up a critical dashboard query. The interviewer asks: what happens when the source data changes? How do you keep the denormalized copy in sync?

Q2

Your schema uses a single wide fact table with 200 columns. The interviewer suggests splitting it into multiple fact tables. Argue for and against each approach.

Q3

A stakeholder wants real-time data in the warehouse. Your current design uses daily batch loads. What schema changes would you make to support streaming ingestion without breaking existing queries?

Q4

You chose a star schema, but a new requirement needs a query that joins 5 dimension tables. The query is slow. What are your options? Which one do you recommend and why?

How to Defend Your Schema Design Decisions

Name the trade-off explicitly. "I chose to denormalize the product category onto the fact table. This costs storage and creates update complexity, but it eliminates a JOIN on every dashboard query, which matters at our query volume."

Reference the query patterns. Every design decision should tie back to how the data will be queried. "Analysts will filter by date range and group by region, so I put both on the fact table as foreign keys to keep those queries single-scan."

Acknowledge what you are giving up. Interviewers respect honesty. "This design does not handle the case where a product belongs to multiple categories. If that requirement appears later, I would add a bridge table. But for now, the simpler design serves the stated requirements."

Have a fallback ready. When the interviewer pushes back, do not defend your original design to the death. Say: "If the requirement changes to X, I would restructure this table by Y. Here is what that migration looks like."

Data Modeling Interview FAQ

Why is data modeling the hardest interview round?+
Because there is no single correct answer. SQL and Python have objectively right or wrong output. Data modeling has trade-offs. The interviewer is evaluating your reasoning process, not just the final schema. You need to articulate why you made each choice and what you would change if requirements shifted.
How do I practice data modeling without a study partner?+
Design a schema for a business you know well: your favorite app, your company, an e-commerce site. Write down 3 trade-offs for each design decision. Then change a requirement (add real-time, add a new entity, increase scale 100x) and see how your schema holds up. DataDriven offers interactive schema building where you create tables, define relationships, and validate your design.
Do I need to memorize all SCD types for interviews?+
Know Types 1, 2, and 3 cold. Type 1 overwrites, Type 2 adds rows with date ranges, Type 3 adds a column for the previous value. Types 4-7 exist but are rarely asked. If you can explain the trade-offs between Types 1-3 with a concrete example, you are well prepared.
Star schema or snowflake schema: which should I default to in interviews?+
Start with star schema unless you have a specific reason to snowflake. Star schemas are simpler, faster for queries, and easier to explain. If the interviewer asks about storage efficiency or dimension table normalization, that is your cue to discuss snowflake. Always frame it as a trade-off, not a rule.
How long should a data modeling interview answer take?+
Spend 5 minutes asking clarifying questions about the business requirements. Spend 15 minutes drawing the schema. Spend the remaining 20-25 minutes defending your design and discussing alternatives. Interviewers value the discussion more than the diagram.

Practice Data Modeling Interactively

Build schemas, define relationships, and validate your designs. The only platform with hands-on data modeling practice.