Relational Data Model
The relational model is the mathematical foundation under every SQL database. Codd's 1970 paper introduced relations, tuples, and attributes as a way to store and query data without navigating pointers. PostgreSQL, Snowflake, and BigQuery are all implementations of that framework. Understanding the theory tells you why SQL behaves the way it does.
What Is the Relational Data Model?
In June 1970, an IBM researcher named Edgar F. Codd published 'A Relational Model of Data for Large Shared Data Banks' in Communications of the ACM. IBM's existing IMS hierarchical database was the state of the art, and Codd was quietly arguing that it was a dead end. The paper proposed that data should be stored in relations (tables) with a formal algebra, and that users should ask for data by what they want, not how to navigate to it.
More than 50 years later, every major database (PostgreSQL, MySQL, SQL Server, Oracle, Snowflake, BigQuery, Redshift) is built on Codd's framework. SQL is the practical implementation of relational algebra. Understanding the relational model means understanding why SQL behaves the way it does.
Know Data Modeling the way the interviewer who asks it knows it.
Core Terminology
Relation (= Table)
A relation is a set of tuples that share the same attributes. In practice, this is a table. The mathematical foundation matters because it guarantees that a relation has no duplicate tuples (rows) and no ordering. SQL relaxes both constraints: tables can have duplicate rows and results have no guaranteed order unless you specify ORDER BY. But the theoretical guarantee of uniqueness is why primary keys exist. Every properly designed relation has a primary key that enforces the no-duplicates rule.
Tuple (= Row)
A tuple is a single record within a relation. Each tuple contains one value for each attribute in the relation. In a customers relation with attributes (customer_id, name, email), a tuple might be (1, 'Alice Chen', 'alice@example.com'). Tuples are unordered within a relation. There is no 'first row' or 'last row' in the relational model. SQL engines store rows in physical order for performance, but the logical model treats all tuples as equal members of a set.
Attribute (= Column)
An attribute defines a named property with a specific domain (data type). The domain constrains what values the attribute can hold. An age attribute with domain INTEGER cannot hold the string 'thirty-five.' Attributes are unordered within a tuple. Saying 'the third column' has no meaning in the relational model, though SQL engines assign ordinal positions for convenience. Each attribute must have a unique name within its relation.
Domain (= Data type + constraints)
A domain is the set of allowable values for an attribute. INTEGER, VARCHAR(255), and BOOLEAN are domains. But domains go beyond simple types. An 'email_address' domain might be VARCHAR(255) with a CHECK constraint enforcing a valid format. A 'positive_amount' domain might be DECIMAL(10,2) with a CHECK that value > 0. Domains enforce data quality at the model level, not the application level. This distinction is critical in data engineering where multiple pipelines write to the same tables.
Primary Key (= Unique row identifier)
A primary key is an attribute (or set of attributes) that uniquely identifies each tuple in a relation. It must be unique and non-NULL. The relational model requires every relation to have a primary key because it enforces the fundamental no-duplicates rule. In practice, primary keys are either natural keys (business-meaningful values like email or ISBN) or surrogate keys (system-generated integers or UUIDs with no business meaning). Data warehouses typically use surrogate keys for dimensions and composite keys for facts.
Foreign Key (= Reference to another table)
A foreign key is an attribute in one relation that references the primary key of another relation. It creates a link between the two relations. The orders relation has a customer_id foreign key that references customers(customer_id). The database enforces referential integrity: you cannot insert an order with a customer_id that does not exist in the customers table. Foreign keys are the mechanism that makes the relational model relational. Without them, tables are just independent data stores.
Codd's Rules (Selected)
Rule 0: Foundation Rule
A relational database must manage its stored data using only its relational capabilities. No bypassing the relational engine to manipulate data directly in files. This rule ensures the database enforces all constraints, not the application.
Rule 1: Information Rule
All information in the database must be represented as values in cells within tables. Metadata, schema definitions, and configuration are all stored in tables (system catalogs). There is no hidden data.
Rule 2: Guaranteed Access Rule
Every value must be accessible by specifying the table name, column name, and primary key value. No pointer chasing, no file offsets, no hidden navigation. This is why SELECT with a WHERE clause on the primary key always works.
Rule 3: Systematic Treatment of NULLs
NULLs represent missing or inapplicable data and must be handled uniformly. NULL is not zero, not an empty string, and not false. It is the absence of a value. Three-valued logic (TRUE, FALSE, UNKNOWN) governs comparisons involving NULL.
Rule 5: Complete Data Sublanguage
The database must support at least one language for data definition (CREATE TABLE), data manipulation (INSERT, UPDATE, DELETE, SELECT), integrity constraints (PRIMARY KEY, FOREIGN KEY, CHECK), and transaction management (BEGIN, COMMIT, ROLLBACK). SQL fulfills this rule.
Rule 6: View Updating Rule
All views that are theoretically updatable must be updatable through the system. A view on a single table with no aggregation should support INSERT, UPDATE, and DELETE. In practice, most databases limit updatable views to simple cases.
Normalization Forms
1NF (First Normal Form)
Rule: Each column contains atomic (indivisible) values. No repeating groups. Example: A customer table with a 'phone_numbers' column containing '555-1234, 555-5678' violates 1NF. Fix: create a separate customer_phones table with one row per phone number. Interview note: 1NF is the baseline. If a table violates 1NF, it is not even a proper relation. Interviewers rarely ask about 1NF directly but might present a denormalized dataset and ask you to identify the violations.
2NF (Second Normal Form)
Rule: Must be in 1NF. Every non-key attribute depends on the entire primary key, not just part of it. Example: An order_items table with a composite key (order_id, product_id) that includes customer_name violates 2NF. customer_name depends on order_id alone, not on the full composite key. Fix: move customer_name to the orders table. Interview note: 2NF violations only occur with composite primary keys. If your table has a single-column primary key, it automatically satisfies 2NF. Interviewers test this by giving you a table with a composite key and asking what is wrong.
3NF (Third Normal Form)
Rule: Must be in 2NF. No transitive dependencies. Non-key attributes must depend on the key, the whole key, and nothing but the key. Example: An employees table with (employee_id, department_id, department_name) violates 3NF. department_name depends on department_id, not directly on employee_id. Fix: move department_name to a departments table. Interview note: 3NF is the gold standard for OLTP databases. The classic interview question is: 'Normalize this table to 3NF.' Walk through each form step by step. State the dependency chain that violates the rule, then show the fix.
BCNF (Boyce-Codd Normal Form)
Rule: Must be in 3NF. For every functional dependency X -> Y, X must be a superkey. Example: A course_schedule table with (student, course, instructor) where each instructor teaches only one course. The dependency instructor -> course means instructor determines course, but instructor is not a superkey. Fix: decompose into (student, instructor) and (instructor, course). Interview note: BCNF comes up in senior-level interviews. The key difference from 3NF: 3NF allows non-prime attributes to depend on candidate keys other than the primary key. BCNF does not. Most tables that satisfy 3NF also satisfy BCNF.
OLTP vs OLAP: How the Relational Model Shifts
| Aspect | OLTP (Source) | OLAP (Warehouse) |
|---|---|---|
| Normalization | 3NF or BCNF | Denormalized (star/snowflake schema) |
| Constraint enforcement | Strict (PK, FK, CHECK) | Declarative, rarely enforced |
| Primary use | Transactional writes | Analytical reads |
| Join count | Many small joins | Few large joins |
| Row count | Millions per table | Billions in fact tables |
4 Relational Model Interview Questions
What is the relational data model and why does it matter for data engineering?
The relational model is a mathematical framework where data is organized into relations (tables) consisting of tuples (rows) and attributes (columns). It matters because it provides formal rules for data integrity, consistency, and query correctness. SQL is the practical implementation of relational algebra. For data engineering specifically, the relational model underpins schema design for both OLTP source systems and OLAP warehouses. Start with the formal definition (relations, tuples, attributes), then connect to practice: primary keys enforce uniqueness, foreign keys enforce referential integrity, and normalization eliminates redundancy.
Normalize this table to 3NF: orders(order_id, customer_id, customer_name, customer_city, product_id, product_name, quantity, unit_price).
Decompose into three tables. customers(customer_id, customer_name, customer_city). products(product_id, product_name, unit_price). orders(order_id, customer_id, product_id, quantity). customer_name and customer_city depend on customer_id, not order_id (transitive dependency removed). product_name depends on product_id, not order_id (same). quantity stays in orders because it describes the relationship between an order and a product at that grain.
When is denormalization appropriate? How do you decide?
Frame it as a trade-off. Normalization optimizes for write correctness. Denormalization optimizes for read performance. In a data warehouse, you denormalize because the cost of redundant storage is trivial compared to the cost of complex multi-join queries running hundreds of times per day. Star schemas are intentionally denormalized and this is the accepted standard. OLTP systems benefit from normalization (fewer anomalies, smaller writes). OLAP systems benefit from denormalization (fewer joins, faster reads). The right answer depends on the workload.
Explain referential integrity and what happens when it breaks in a data pipeline.
Referential integrity means every foreign key value must have a matching primary key in the referenced table. In OLTP, the database enforces this. In data pipelines, it often breaks: a fact table loads before its dimension table, or a dimension row is deleted while facts still reference it. Solutions: load dimensions before facts (dependency ordering), use a default 'unknown' dimension row (surrogate key 0 or -1), or enforce integrity checks as a post-load validation step. Orphan rows in fact tables are the most common consequence and produce NULL joins that corrupt aggregations.
Frequently Asked Questions
What is the relational data model?+
What is the difference between the relational model and a relational database?+
Why does normalization matter for data engineers?+
How does the relational model apply to modern data warehouses?+
The Idea That Outlasted Every System It Replaced
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition