Data Engineering Interview Prep
Roughly a third of data engineering interviews include a dedicated data modeling round. The most frequent topics: entity identification, primary keys, attributes, foreign keys, star schema, fact tables, dimension tables, and medallion architecture.
Based on DataDriven's analysis of verified interview data. DataDriven is the only platform with interactive schema building and validation.
Data modeling appears in roughly a third of all data engineering interviews, yet it is the round candidates prepare for the least. Unlike SQL or Python, there is no single correct answer. Every schema design is a series of decisions with consequences, and the interviewer's job is to probe whether you understand those consequences.
There is no single right schema. Two candidates can produce different schemas for the same problem, and both can pass. What matters is whether you can explain why you chose one approach over another. If the interviewer asks "why not snowflake?" and you cannot articulate the trade-off, that is a fail.
Requirements change mid-interview. Interviewers deliberately add new requirements after you finish your initial design. "Now the product team wants real-time updates." "Now we need to track 3 years of history." Your schema needs to bend without breaking.
You must think out loud. Unlike SQL where you can silently write and submit, data modeling is a conversation. Interviewers are scoring your reasoning process. Verbalize your assumptions, state your trade-offs, and ask clarifying questions before you draw a single table.
Difficulty: Medium
Frequency: Very High
Normalization is the foundation of relational design. Interviewers test whether you can identify violations, explain why they matter, and know when denormalization is the right call.
Given a table with repeating groups in a single column (comma-separated tags), normalize it to 1NF. Explain what problem the original design creates for queries.
Identify the partial dependency in a table with a composite primary key and decompose it into 2NF.
A table stores employee_name, department_name, and department_location. Identify the transitive dependency and normalize to 3NF.
When would you intentionally denormalize a 3NF schema? Give a specific scenario with the trade-offs.
Difficulty: Medium
Frequency: Star schema: 4.7% of modeling questions
Star schema appears in 4.7% of data modeling questions, with fact tables (4.7%) and dimension tables (4.2%) closely behind. Interviewers want you to design fact and dimension tables, choose the right grain, and justify star vs snowflake trade-offs.
Design a star schema for an e-commerce platform. Define the fact table grain and list 4 dimension tables with their key attributes.
Convert a star schema into a snowflake schema by normalizing the product dimension. What queries become harder? What storage do you save?
Your fact table has 500M rows. A dimension table has 50M rows. Should you use a star or snowflake schema? Explain your reasoning based on query patterns.
What is a degenerate dimension? Give an example and explain why it lives in the fact table instead of a separate dimension.
Difficulty: Medium-Hard
Frequency: High
Tracking historical changes in dimension data is a classic interview topic. Interviewers test whether you understand the trade-offs between SCD types and can implement them correctly.
A customer changes their address. Show how SCD Type 1, Type 2, and Type 3 each handle this change. Which columns does each approach add or modify?
Design an SCD Type 2 implementation for a product dimension where price changes frequently. Include the surrogate key, natural key, effective dates, and current flag.
Your team uses SCD Type 2 for a dimension with 10M rows. After 3 years, the table has grown to 80M rows. What strategies can reduce the size without losing history?
When would you choose SCD Type 3 over Type 2? Describe a real scenario where storing only the previous value is sufficient.
Difficulty: Medium
Frequency: Entity ID: 6.6% of modeling questions
Entity identification is the most common data modeling topic at 6.6%. ER diagrams are the starting point for most schema design interviews. You draw entities, relationships, and cardinalities. Then the interviewer asks follow-up questions about edge cases.
Draw an ER model for a university system: students, courses, professors, and enrollments. Define cardinalities for each relationship.
Model a social network where users can follow other users, create posts, and like posts. How do you handle the self-referencing follow relationship?
Design an ER model for a hotel booking system. Handle the case where a single booking can span multiple rooms and multiple nights.
Your ER model has a many-to-many relationship between orders and products. The intersection table needs to store quantity and unit_price. Why is unit_price stored here instead of on the product table?
Difficulty: Easy-Medium
Frequency: Primary keys 5.9%, Foreign keys 4.7%
Primary keys (5.9%) and foreign keys (4.7%) are core modeling topics. Interviewers probe whether you understand why each constraint exists and what breaks when you skip one.
When should you use a surrogate key (auto-increment ID) vs a natural key (email, SSN)? Give one scenario for each.
A table has a composite primary key of (user_id, event_date). Can you add a foreign key from another table to this composite key? What are the implications?
Your schema has no foreign key constraints for performance reasons. How do you ensure referential integrity? What is the cost of skipping enforcement?
Difficulty: Hard
Frequency: Very High
The hardest part of a data modeling interview is not the design itself. It is defending your choices when the interviewer pushes back. Every design has trade-offs, and you need to articulate them clearly.
You denormalized a dimension to speed up a critical dashboard query. The interviewer asks: what happens when the source data changes? How do you keep the denormalized copy in sync?
Your schema uses a single wide fact table with 200 columns. The interviewer suggests splitting it into multiple fact tables. Argue for and against each approach.
A stakeholder wants real-time data in the warehouse. Your current design uses daily batch loads. What schema changes would you make to support streaming ingestion without breaking existing queries?
You chose a star schema, but a new requirement needs a query that joins 5 dimension tables. The query is slow. What are your options? Which one do you recommend and why?
Name the trade-off explicitly. "I chose to denormalize the product category onto the fact table. This costs storage and creates update complexity, but it eliminates a JOIN on every dashboard query, which matters at our query volume."
Reference the query patterns. Every design decision should tie back to how the data will be queried. "Analysts will filter by date range and group by region, so I put both on the fact table as foreign keys to keep those queries single-scan."
Acknowledge what you are giving up. Interviewers respect honesty. "This design does not handle the case where a product belongs to multiple categories. If that requirement appears later, I would add a bridge table. But for now, the simpler design serves the stated requirements."
Have a fallback ready. When the interviewer pushes back, do not defend your original design to the death. Say: "If the requirement changes to X, I would restructure this table by Y. Here is what that migration looks like."
Build schemas, define relationships, and validate your designs. The only platform with hands-on data modeling practice.