Data Modeling

What Is a Data Model?

If you're new to data modeling, start here. You'll find the word "model" means three different things depending on who's talking. A conceptual model is a sketch for a product manager. A logical model is the entity contract your engineering team argues over. A physical model is what actually runs on the database. You'll need all three eventually. This page walks you through them in the order that'll make sense to you.

Data modeling is about 18% of the interview corpus, but candidates punch above that weight when they can talk cleanly about the three levels. It's one of the best areas to study because the ideas are stable, the vocabulary is small, and practice pays off fast.

18%

DE Interviews: Modeling

191

Modeling Qs Analyzed

1,418

Interactive Challenges

49

Onsite Modeling Rounds

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Three Types of Data Models

You'll hear the word "model" used loosely, and that's fine until it costs you an interview answer. The three levels below are the ones hiring managers expect you to separate cleanly. It's a small vocabulary. Once you can tell a product manager's sketch apart from a DBA's DDL, you'll sound a lot more senior than candidates who treat them all as the same thing.

Conceptual Data Model

A conceptual data model defines the high-level entities and relationships in a system without any implementation detail. It answers 'what entities exist and how do they relate?' An e-commerce system has Customers, Orders, Products, and Payments. A Customer places an Order. An Order contains Products. A Payment settles an Order. No data types, no primary keys, no table names. Just entities and relationships.

Audience: Business stakeholders, product managers, and architects use conceptual models to align on scope before anyone writes DDL.

Entities: Customer, Order, Product, Payment
Relationships:
  Customer  --places-->  Order
  Order     --contains-> Product (many-to-many via OrderItem)
  Payment   --settles-->  Order

Interview note: Interviewers sometimes hand you a business scenario and ask you to 'draw the data model.' Start here. Sketch entities as boxes and relationships as lines. This shows structured thinking before diving into table definitions.

Logical Data Model

A logical data model adds structure to the conceptual model. It defines attributes (columns), primary keys, foreign keys, and data types at an abstract level. It is database-agnostic: no indexes, no partitions, no engine-specific syntax. The logical model for an Order might include order_id (PK), customer_id (FK), order_date, and total_amount. It specifies that order_id is an integer and order_date is a date, but it does not specify whether the table is partitioned or what index strategy to use.

Audience: Data architects and senior engineers use logical models to communicate schema design without committing to a specific database engine.

Order (logical)
  order_id     INTEGER   PK
  customer_id  INTEGER   FK -> Customer
  order_date   DATE
  total_amount DECIMAL(10,2)
  status       VARCHAR(20)

Interview note: When an interviewer asks you to 'design a schema,' a logical data model is usually the expected deliverable. Define entities, attributes, keys, and relationships. Mention data types but skip physical details unless asked.

Physical Data Model

A physical data model is the actual implementation in a specific database engine. It includes CREATE TABLE statements, index definitions, partitioning strategies, storage formats, and engine-specific features. The physical model for an Order in Snowflake looks different from the same Order in PostgreSQL. Snowflake uses clustering keys instead of indexes. PostgreSQL uses B-tree indexes and table partitioning. The logical model is the same; the physical model adapts to the engine.

Audience: Data engineers build physical models. You write the DDL, choose the partitioning strategy, define the indexes, and optimize for the query workload.

-- Physical model in Snowflake
CREATE TABLE fact_orders (
  order_id     NUMBER(38,0),
  customer_key NUMBER(38,0),
  order_date   DATE,
  total_amount NUMBER(10,2)
)
CLUSTER BY (order_date);
-- Clustered by date for time-range query performance

Interview note: Physical model questions come up in senior interviews. The interviewer wants to hear about partitioning (fact tables by date), clustering (Snowflake cluster keys on high-cardinality columns), and storage formats (Parquet vs Delta vs Iceberg). Show that you think about how the model performs, not just what it contains.

Why Data Models Matter

Four reasons data models are not optional for data engineers. Skipping the model leads to ad-hoc schemas, broken pipelines, and difficult interviews.

Communication

A data model is a shared language between engineers, analysts, and business stakeholders. When everyone agrees that a 'customer' entity has these attributes and relates to 'orders' this way, there is less ambiguity in requirements, reports, and pipeline logic. Without a shared model, the marketing team's definition of 'customer' might differ from the finance team's, and your pipeline produces numbers that nobody trusts.

Schema Design

Data models are the blueprint for your database schemas. Whether you are building a normalized OLTP schema, a star schema for analytics, or a wide table for a feature store, the data model defines the structure. Starting from a model prevents ad-hoc table creation that leads to duplicated data, inconsistent naming, and queries that require 15 joins because nobody planned the relationships.

Pipeline Contracts

Your data model defines the contract between upstream sources and downstream consumers. When an analyst builds a dashboard on top of fact_orders, they depend on the model having specific columns, data types, and grain. When the source system changes, the pipeline needs to map the change back to the model. Without an explicit model, every pipeline change risks breaking downstream consumers in ways nobody anticipated.

Interview Performance

Data modeling questions appear in roughly 40% of data engineering interview loops. The interviewer hands you a business scenario and expects a data model. Candidates who start with entities and relationships, then move to attributes and keys, then discuss physical trade-offs demonstrate the structured thinking that hiring teams look for. Jumping straight to column names without a model signals inexperience.

Common Data Model Patterns

Five patterns you will encounter in data engineering work and interviews. Each is a physical data model pattern optimized for a specific workload.

Star Schema

A central fact table surrounded by denormalized dimension tables. The most common pattern for analytics warehouses. Optimized for read-heavy workloads with few joins.

Star Schema Guide

Snowflake Schema

Similar to a star schema but with normalized dimension tables. Dimensions are split into sub-tables linked by foreign keys. Uses less storage but requires more joins.

Snowflake Schema Guide

Data Vault

A modeling methodology for enterprise data warehouses. Uses Hubs (business keys), Links (relationships), and Satellites (descriptive attributes). Designed for auditability and parallel loading.

Data Vault Guide

Wide / Flat Table

A single denormalized table with many columns. Common in data lakes and feature stores where query simplicity and scan performance matter more than storage efficiency. Often materialized from a star schema for specific use cases.

Dimensional Modeling Guide

Normalized (3NF)

Tables decomposed to eliminate redundancy. Each non-key attribute depends on the key, the whole key, and nothing but the key. The standard for OLTP systems. Data engineers extract from 3NF sources and transform into star schemas or wide tables.

Relational Data Model Guide

How to Build a Data Model

A five-step process that works for interviews and real projects. Start abstract, add detail progressively, and validate against the actual workload.

1

Identify Business Entities

List the core things the business cares about: customers, orders, products, events, employees. These become your entities. Do not start with columns or tables. Start with what the business tracks.

2

Define Relationships

How do entities connect? A customer places orders (one-to-many). An order contains products (many-to-many through a junction table). A product belongs to a category (many-to-one). Cardinality matters: one-to-one, one-to-many, and many-to-many each have different physical implementations.

3

Add Attributes and Keys

For each entity, list the attributes. Identify the primary key (natural or surrogate) and foreign keys. Define data types at the logical level. This is the logical data model. It should be database-agnostic at this point.

4

Choose a Physical Pattern

Based on the workload (OLTP vs OLAP), choose a physical pattern. OLTP: normalized 3NF. OLAP: star schema or wide table. Data lake: Parquet with partition keys. The logical model stays the same; the physical implementation adapts to the engine and query patterns.

5

Implement and Validate

Write the DDL, load sample data, run representative queries, and validate that the model supports the required analytics. Check query performance. Verify that joins produce correct results. Ensure the grain is correct by comparing row counts against expectations.

Frequently Asked Questions

What is a data model?+
A data model is a structured representation of how data is organized, stored, and related within a system. It defines the entities (things like customers, orders, products), their attributes (columns like name, date, amount), and the relationships between them (a customer places orders). Data models exist at three levels: conceptual (high-level entities and relationships), logical (attributes, keys, and data types without implementation details), and physical (actual database tables, indexes, and partitions).
What are the three types of data models?+
The three types are conceptual, logical, and physical. A conceptual model shows entities and relationships without detail (used for business alignment). A logical model adds attributes, keys, and data types without specifying a database engine (used for design). A physical model is the actual implementation with CREATE TABLE statements, indexes, and engine-specific features (used for building). Each level adds more detail while preserving the structure of the level above.
Why do data engineers need data models?+
Data engineers build the schemas that analysts query, the pipelines that move data, and the transformations that shape it. Without a data model, you get ad-hoc table creation, inconsistent naming, duplicated data, and pipelines that break when source schemas change. A data model is the blueprint that keeps everything coherent. It also appears in 40% of DE interview loops, so understanding modeling methodology is directly tied to getting hired.
What is the difference between a data model and a schema?+
A data model is the design. A schema is the implementation. The data model defines entities, relationships, attributes, and keys at a conceptual or logical level. The schema is the physical realization: CREATE TABLE statements, indexes, constraints, and partitions in a specific database. You can have one logical data model that produces different physical schemas in PostgreSQL, Snowflake, and BigQuery. The model is the blueprint; the schema is the building.

You'll Get There Faster By Drawing Than By Reading

Pick a scenario on DataDriven, sketch the three levels for it, and compare your version to a working solution. You'll remember it longer than any article you read twice.

Start Practicing