If you're new to data modeling, start here. You'll find the word "model" means three different things depending on who's talking. A conceptual model is a sketch for a product manager. A logical model is the entity contract your engineering team argues over. A physical model is what actually runs on the database. You'll need all three eventually. This page walks you through them in the order that'll make sense to you.
Data modeling is about 18% of the interview corpus, but candidates punch above that weight when they can talk cleanly about the three levels. It's one of the best areas to study because the ideas are stable, the vocabulary is small, and practice pays off fast.
DE Interviews: Modeling
Modeling Qs Analyzed
Interactive Challenges
Onsite Modeling Rounds
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
You'll hear the word "model" used loosely, and that's fine until it costs you an interview answer. The three levels below are the ones hiring managers expect you to separate cleanly. It's a small vocabulary. Once you can tell a product manager's sketch apart from a DBA's DDL, you'll sound a lot more senior than candidates who treat them all as the same thing.
A conceptual data model defines the high-level entities and relationships in a system without any implementation detail. It answers 'what entities exist and how do they relate?' An e-commerce system has Customers, Orders, Products, and Payments. A Customer places an Order. An Order contains Products. A Payment settles an Order. No data types, no primary keys, no table names. Just entities and relationships.
Audience: Business stakeholders, product managers, and architects use conceptual models to align on scope before anyone writes DDL.
Entities: Customer, Order, Product, Payment
Relationships:
Customer --places--> Order
Order --contains-> Product (many-to-many via OrderItem)
Payment --settles--> OrderInterview note: Interviewers sometimes hand you a business scenario and ask you to 'draw the data model.' Start here. Sketch entities as boxes and relationships as lines. This shows structured thinking before diving into table definitions.
A logical data model adds structure to the conceptual model. It defines attributes (columns), primary keys, foreign keys, and data types at an abstract level. It is database-agnostic: no indexes, no partitions, no engine-specific syntax. The logical model for an Order might include order_id (PK), customer_id (FK), order_date, and total_amount. It specifies that order_id is an integer and order_date is a date, but it does not specify whether the table is partitioned or what index strategy to use.
Audience: Data architects and senior engineers use logical models to communicate schema design without committing to a specific database engine.
Order (logical)
order_id INTEGER PK
customer_id INTEGER FK -> Customer
order_date DATE
total_amount DECIMAL(10,2)
status VARCHAR(20)Interview note: When an interviewer asks you to 'design a schema,' a logical data model is usually the expected deliverable. Define entities, attributes, keys, and relationships. Mention data types but skip physical details unless asked.
A physical data model is the actual implementation in a specific database engine. It includes CREATE TABLE statements, index definitions, partitioning strategies, storage formats, and engine-specific features. The physical model for an Order in Snowflake looks different from the same Order in PostgreSQL. Snowflake uses clustering keys instead of indexes. PostgreSQL uses B-tree indexes and table partitioning. The logical model is the same; the physical model adapts to the engine.
Audience: Data engineers build physical models. You write the DDL, choose the partitioning strategy, define the indexes, and optimize for the query workload.
-- Physical model in Snowflake
CREATE TABLE fact_orders (
order_id NUMBER(38,0),
customer_key NUMBER(38,0),
order_date DATE,
total_amount NUMBER(10,2)
)
CLUSTER BY (order_date);
-- Clustered by date for time-range query performanceInterview note: Physical model questions come up in senior interviews. The interviewer wants to hear about partitioning (fact tables by date), clustering (Snowflake cluster keys on high-cardinality columns), and storage formats (Parquet vs Delta vs Iceberg). Show that you think about how the model performs, not just what it contains.
Four reasons data models are not optional for data engineers. Skipping the model leads to ad-hoc schemas, broken pipelines, and difficult interviews.
A data model is a shared language between engineers, analysts, and business stakeholders. When everyone agrees that a 'customer' entity has these attributes and relates to 'orders' this way, there is less ambiguity in requirements, reports, and pipeline logic. Without a shared model, the marketing team's definition of 'customer' might differ from the finance team's, and your pipeline produces numbers that nobody trusts.
Data models are the blueprint for your database schemas. Whether you are building a normalized OLTP schema, a star schema for analytics, or a wide table for a feature store, the data model defines the structure. Starting from a model prevents ad-hoc table creation that leads to duplicated data, inconsistent naming, and queries that require 15 joins because nobody planned the relationships.
Your data model defines the contract between upstream sources and downstream consumers. When an analyst builds a dashboard on top of fact_orders, they depend on the model having specific columns, data types, and grain. When the source system changes, the pipeline needs to map the change back to the model. Without an explicit model, every pipeline change risks breaking downstream consumers in ways nobody anticipated.
Data modeling questions appear in roughly 40% of data engineering interview loops. The interviewer hands you a business scenario and expects a data model. Candidates who start with entities and relationships, then move to attributes and keys, then discuss physical trade-offs demonstrate the structured thinking that hiring teams look for. Jumping straight to column names without a model signals inexperience.
Five patterns you will encounter in data engineering work and interviews. Each is a physical data model pattern optimized for a specific workload.
A central fact table surrounded by denormalized dimension tables. The most common pattern for analytics warehouses. Optimized for read-heavy workloads with few joins.
Star Schema Guide →Similar to a star schema but with normalized dimension tables. Dimensions are split into sub-tables linked by foreign keys. Uses less storage but requires more joins.
Snowflake Schema Guide →A modeling methodology for enterprise data warehouses. Uses Hubs (business keys), Links (relationships), and Satellites (descriptive attributes). Designed for auditability and parallel loading.
Data Vault Guide →A single denormalized table with many columns. Common in data lakes and feature stores where query simplicity and scan performance matter more than storage efficiency. Often materialized from a star schema for specific use cases.
Dimensional Modeling Guide →Tables decomposed to eliminate redundancy. Each non-key attribute depends on the key, the whole key, and nothing but the key. The standard for OLTP systems. Data engineers extract from 3NF sources and transform into star schemas or wide tables.
Relational Data Model Guide →A five-step process that works for interviews and real projects. Start abstract, add detail progressively, and validate against the actual workload.
List the core things the business cares about: customers, orders, products, events, employees. These become your entities. Do not start with columns or tables. Start with what the business tracks.
How do entities connect? A customer places orders (one-to-many). An order contains products (many-to-many through a junction table). A product belongs to a category (many-to-one). Cardinality matters: one-to-one, one-to-many, and many-to-many each have different physical implementations.
For each entity, list the attributes. Identify the primary key (natural or surrogate) and foreign keys. Define data types at the logical level. This is the logical data model. It should be database-agnostic at this point.
Based on the workload (OLTP vs OLAP), choose a physical pattern. OLTP: normalized 3NF. OLAP: star schema or wide table. Data lake: Parquet with partition keys. The logical model stays the same; the physical implementation adapts to the engine and query patterns.
Write the DDL, load sample data, run representative queries, and validate that the model supports the required analytics. Check query performance. Verify that joins produce correct results. Ensure the grain is correct by comparing row counts against expectations.
Pick a scenario on DataDriven, sketch the three levels for it, and compare your version to a working solution. You'll remember it longer than any article you read twice.
Start Practicing