Relational Data Model

The relational model is the mathematical foundation under every SQL database. Codd's 1970 paper introduced relations, tuples, and attributes as a way to store and query data without navigating pointers. PostgreSQL, Snowflake, and BigQuery are all implementations of that framework. Understanding the theory tells you why SQL behaves the way it does.

What Is the Relational Data Model?

In June 1970, an IBM researcher named Edgar F. Codd published 'A Relational Model of Data for Large Shared Data Banks' in Communications of the ACM. IBM's existing IMS hierarchical database was the state of the art, and Codd was quietly arguing that it was a dead end. The paper proposed that data should be stored in relations (tables) with a formal algebra, and that users should ask for data by what they want, not how to navigate to it.

More than 50 years later, every major database (PostgreSQL, MySQL, SQL Server, Oracle, Snowflake, BigQuery, Redshift) is built on Codd's framework. SQL is the practical implementation of relational algebra. Understanding the relational model means understanding why SQL behaves the way it does.

Prepare for the interview
01 / Open invite
02min.

Know Data Modeling the way the interviewer who asks it knows it.

a Data Modeling query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1fact_orders
2 order_id bigint PK
3 customer_sk bigint FK
4 order_date date SCD2
5
Execute your solution0.4s avg.
SpotifyInterview question
Solve a Data Modeling problem

Core Terminology

Relation (= Table)

A relation is a set of tuples that share the same attributes. In practice, this is a table. The mathematical foundation matters because it guarantees that a relation has no duplicate tuples (rows) and no ordering. SQL relaxes both constraints: tables can have duplicate rows and results have no guaranteed order unless you specify ORDER BY. But the theoretical guarantee of uniqueness is why primary keys exist. Every properly designed relation has a primary key that enforces the no-duplicates rule.

Tuple (= Row)

A tuple is a single record within a relation. Each tuple contains one value for each attribute in the relation. In a customers relation with attributes (customer_id, name, email), a tuple might be (1, 'Alice Chen', 'alice@example.com'). Tuples are unordered within a relation. There is no 'first row' or 'last row' in the relational model. SQL engines store rows in physical order for performance, but the logical model treats all tuples as equal members of a set.

Attribute (= Column)

An attribute defines a named property with a specific domain (data type). The domain constrains what values the attribute can hold. An age attribute with domain INTEGER cannot hold the string 'thirty-five.' Attributes are unordered within a tuple. Saying 'the third column' has no meaning in the relational model, though SQL engines assign ordinal positions for convenience. Each attribute must have a unique name within its relation.

Domain (= Data type + constraints)

A domain is the set of allowable values for an attribute. INTEGER, VARCHAR(255), and BOOLEAN are domains. But domains go beyond simple types. An 'email_address' domain might be VARCHAR(255) with a CHECK constraint enforcing a valid format. A 'positive_amount' domain might be DECIMAL(10,2) with a CHECK that value > 0. Domains enforce data quality at the model level, not the application level. This distinction is critical in data engineering where multiple pipelines write to the same tables.

Primary Key (= Unique row identifier)

A primary key is an attribute (or set of attributes) that uniquely identifies each tuple in a relation. It must be unique and non-NULL. The relational model requires every relation to have a primary key because it enforces the fundamental no-duplicates rule. In practice, primary keys are either natural keys (business-meaningful values like email or ISBN) or surrogate keys (system-generated integers or UUIDs with no business meaning). Data warehouses typically use surrogate keys for dimensions and composite keys for facts.

Foreign Key (= Reference to another table)

A foreign key is an attribute in one relation that references the primary key of another relation. It creates a link between the two relations. The orders relation has a customer_id foreign key that references customers(customer_id). The database enforces referential integrity: you cannot insert an order with a customer_id that does not exist in the customers table. Foreign keys are the mechanism that makes the relational model relational. Without them, tables are just independent data stores.

Codd's Rules (Selected)

Rule 0: Foundation Rule

A relational database must manage its stored data using only its relational capabilities. No bypassing the relational engine to manipulate data directly in files. This rule ensures the database enforces all constraints, not the application.

Rule 1: Information Rule

All information in the database must be represented as values in cells within tables. Metadata, schema definitions, and configuration are all stored in tables (system catalogs). There is no hidden data.

Rule 2: Guaranteed Access Rule

Every value must be accessible by specifying the table name, column name, and primary key value. No pointer chasing, no file offsets, no hidden navigation. This is why SELECT with a WHERE clause on the primary key always works.

Rule 3: Systematic Treatment of NULLs

NULLs represent missing or inapplicable data and must be handled uniformly. NULL is not zero, not an empty string, and not false. It is the absence of a value. Three-valued logic (TRUE, FALSE, UNKNOWN) governs comparisons involving NULL.

Rule 5: Complete Data Sublanguage

The database must support at least one language for data definition (CREATE TABLE), data manipulation (INSERT, UPDATE, DELETE, SELECT), integrity constraints (PRIMARY KEY, FOREIGN KEY, CHECK), and transaction management (BEGIN, COMMIT, ROLLBACK). SQL fulfills this rule.

Rule 6: View Updating Rule

All views that are theoretically updatable must be updatable through the system. A view on a single table with no aggregation should support INSERT, UPDATE, and DELETE. In practice, most databases limit updatable views to simple cases.

Normalization Forms

1NF (First Normal Form)

Rule: Each column contains atomic (indivisible) values. No repeating groups. Example: A customer table with a 'phone_numbers' column containing '555-1234, 555-5678' violates 1NF. Fix: create a separate customer_phones table with one row per phone number. Interview note: 1NF is the baseline. If a table violates 1NF, it is not even a proper relation. Interviewers rarely ask about 1NF directly but might present a denormalized dataset and ask you to identify the violations.

2NF (Second Normal Form)

Rule: Must be in 1NF. Every non-key attribute depends on the entire primary key, not just part of it. Example: An order_items table with a composite key (order_id, product_id) that includes customer_name violates 2NF. customer_name depends on order_id alone, not on the full composite key. Fix: move customer_name to the orders table. Interview note: 2NF violations only occur with composite primary keys. If your table has a single-column primary key, it automatically satisfies 2NF. Interviewers test this by giving you a table with a composite key and asking what is wrong.

3NF (Third Normal Form)

Rule: Must be in 2NF. No transitive dependencies. Non-key attributes must depend on the key, the whole key, and nothing but the key. Example: An employees table with (employee_id, department_id, department_name) violates 3NF. department_name depends on department_id, not directly on employee_id. Fix: move department_name to a departments table. Interview note: 3NF is the gold standard for OLTP databases. The classic interview question is: 'Normalize this table to 3NF.' Walk through each form step by step. State the dependency chain that violates the rule, then show the fix.

BCNF (Boyce-Codd Normal Form)

Rule: Must be in 3NF. For every functional dependency X -> Y, X must be a superkey. Example: A course_schedule table with (student, course, instructor) where each instructor teaches only one course. The dependency instructor -> course means instructor determines course, but instructor is not a superkey. Fix: decompose into (student, instructor) and (instructor, course). Interview note: BCNF comes up in senior-level interviews. The key difference from 3NF: 3NF allows non-prime attributes to depend on candidate keys other than the primary key. BCNF does not. Most tables that satisfy 3NF also satisfy BCNF.

OLTP vs OLAP: How the Relational Model Shifts

AspectOLTP (Source)OLAP (Warehouse)
Normalization3NF or BCNFDenormalized (star/snowflake schema)
Constraint enforcementStrict (PK, FK, CHECK)Declarative, rarely enforced
Primary useTransactional writesAnalytical reads
Join countMany small joinsFew large joins
Row countMillions per tableBillions in fact tables

4 Relational Model Interview Questions

What is the relational data model and why does it matter for data engineering?

The relational model is a mathematical framework where data is organized into relations (tables) consisting of tuples (rows) and attributes (columns). It matters because it provides formal rules for data integrity, consistency, and query correctness. SQL is the practical implementation of relational algebra. For data engineering specifically, the relational model underpins schema design for both OLTP source systems and OLAP warehouses. Start with the formal definition (relations, tuples, attributes), then connect to practice: primary keys enforce uniqueness, foreign keys enforce referential integrity, and normalization eliminates redundancy.

Normalize this table to 3NF: orders(order_id, customer_id, customer_name, customer_city, product_id, product_name, quantity, unit_price).

Decompose into three tables. customers(customer_id, customer_name, customer_city). products(product_id, product_name, unit_price). orders(order_id, customer_id, product_id, quantity). customer_name and customer_city depend on customer_id, not order_id (transitive dependency removed). product_name depends on product_id, not order_id (same). quantity stays in orders because it describes the relationship between an order and a product at that grain.

When is denormalization appropriate? How do you decide?

Frame it as a trade-off. Normalization optimizes for write correctness. Denormalization optimizes for read performance. In a data warehouse, you denormalize because the cost of redundant storage is trivial compared to the cost of complex multi-join queries running hundreds of times per day. Star schemas are intentionally denormalized and this is the accepted standard. OLTP systems benefit from normalization (fewer anomalies, smaller writes). OLAP systems benefit from denormalization (fewer joins, faster reads). The right answer depends on the workload.

Explain referential integrity and what happens when it breaks in a data pipeline.

Referential integrity means every foreign key value must have a matching primary key in the referenced table. In OLTP, the database enforces this. In data pipelines, it often breaks: a fact table loads before its dimension table, or a dimension row is deleted while facts still reference it. Solutions: load dimensions before facts (dependency ordering), use a default 'unknown' dimension row (surrogate key 0 or -1), or enforce integrity checks as a post-load validation step. Orphan rows in fact tables are the most common consequence and produce NULL joins that corrupt aggregations.

Frequently Asked Questions

What is the relational data model?+
The relational data model is a framework for organizing data into relations (tables), where each relation consists of tuples (rows) and attributes (columns). It was formalized by Edgar F. Codd in 1970 at IBM. Data is stored in structured tables with defined data types, primary keys for uniqueness, and foreign keys for relationships between tables. The model provides mathematical guarantees about data integrity and forms the foundation for SQL and every major relational database (PostgreSQL, MySQL, SQL Server, Oracle).
What is the difference between the relational model and a relational database?+
The relational model is the theoretical framework: a set of mathematical rules about how data should be organized. A relational database is a software system that implements those rules. PostgreSQL, MySQL, and SQL Server are relational databases. They implement the relational model but also add practical features not in the original model: indexes, stored procedures, triggers, and materialized views. The model is the blueprint; the database is the building.
Why does normalization matter for data engineers?+
Normalization eliminates data redundancy and update anomalies in OLTP source systems. For data engineers, understanding normalization matters for two reasons. First, you need to understand the source systems you extract from. Normalized OLTP schemas have many tables with complex joins. Second, you need to know when to denormalize for analytics. Data warehouses intentionally denormalize (star schemas, wide tables) to optimize read performance. Knowing both sides lets you design efficient pipelines and schemas.
How does the relational model apply to modern data warehouses?+
Modern data warehouses (Snowflake, BigQuery, Redshift) are relational databases. They store data in tables with columns, support SQL, and allow primary and foreign key declarations. However, they typically do not enforce foreign key constraints at write time for performance reasons. The relational model still applies at the design level: you define schemas with clear relationships, even if enforcement is handled by your pipeline logic and data quality tests rather than database constraints.
02 / Why practice

The Idea That Outlasted Every System It Replaced

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Related Guides