Data Modeling

Relational Data Model

In June 1970, an IBM researcher named Edgar F. Codd published "A Relational Model of Data for Large Shared Data Banks"in Communications of the ACM. IBM's existing IMS hierarchical database was the state of the art, and Codd was quietly arguing that it was a dead end. The paper proposed that data should be stored in relations (tables) with a formal algebra, and that users should ask for data by what they want, not how to fetch it. It took a decade for the idea to reach production systems. Fifty-six years later, every SQL query you write descends from that single paper.

The relational model outlasted the companies that resisted it. IBM shipped System R in 1974. Oracle shipped commercially in 1979. By 1985 Codd had to publish 12 rules explaining what still didn't count as relational. Knowing the theory behind the thing makes the interview answers faster.

1970

Codd's Paper Published

41%

DE Interviews Are SQL

429

SQL Questions Analyzed

854

SQL Challenges Built

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Core Terminology

The relational model uses formal terminology that maps directly to the informal terms you already know. Understanding the formal terms is not academic posturing. Interviewers at senior levels use them, and they appear in database documentation, research papers, and system design discussions.

Relation

= Table

A relation is a set of tuples that share the same attributes. In practice, this is a table. The mathematical foundation matters because it guarantees that a relation has no duplicate tuples (rows) and no ordering. SQL relaxes both constraints: tables can have duplicate rows and results have no guaranteed order unless you specify ORDER BY. But the theoretical guarantee of uniqueness is why primary keys exist. Every properly designed relation has a primary key that enforces the no-duplicates rule.

Tuple

= Row

A tuple is a single record within a relation. Each tuple contains one value for each attribute in the relation. In a customers relation with attributes (customer_id, name, email), a tuple might be (1, 'Alice Chen', 'alice@example.com'). Tuples are unordered within a relation. There is no 'first row' or 'last row' in the relational model. SQL engines store rows in physical order for performance, but the logical model treats all tuples as equal members of a set.

Attribute

= Column

An attribute defines a named property with a specific domain (data type). The domain constrains what values the attribute can hold. An age attribute with domain INTEGER cannot hold the string 'thirty-five.' Attributes are unordered within a tuple. Saying 'the third column' has no meaning in the relational model, though SQL engines assign ordinal positions for convenience. Each attribute must have a unique name within its relation.

Domain

= Data type + constraints

A domain is the set of allowable values for an attribute. INTEGER, VARCHAR(255), and BOOLEAN are domains. But domains go beyond simple types. An 'email_address' domain might be VARCHAR(255) with a CHECK constraint enforcing a valid format. A 'positive_amount' domain might be DECIMAL(10,2) with a CHECK that value > 0. Domains enforce data quality at the model level, not the application level. This distinction is critical in data engineering where multiple pipelines write to the same tables.

Primary Key

= Unique row identifier

A primary key is an attribute (or set of attributes) that uniquely identifies each tuple in a relation. It must be unique and non-NULL. The relational model requires every relation to have a primary key because it enforces the fundamental no-duplicates rule. In practice, primary keys are either natural keys (business-meaningful values like email or ISBN) or surrogate keys (system-generated integers or UUIDs with no business meaning). Data warehouses typically use surrogate keys for dimensions and composite keys for facts.

Foreign Key

= Reference to another table

A foreign key is an attribute in one relation that references the primary key of another relation. It creates a link between the two relations. The orders relation has a customer_id foreign key that references customers(customer_id). The database enforces referential integrity: you cannot insert an order with a customer_id that does not exist in the customers table. Foreign keys are the mechanism that makes the relational model relational. Without them, tables are just independent data stores.

Codd's Rules (Selected)

Codd wrote the 12 rules (numbered 0 through 12) in 1985 as a protest document. Vendors were slapping "relational" on every database that supported SQL, and Codd wanted a checklist that separated real relational systems from marketing. Oracle and IBM DB2 failed several of the rules when he published them. They still do. But a handful of the rules actually matter for interviews and for understanding why SQL behaves the way it does.

You do not need to memorize all 13. The ones below are the ones interviewers actually reference. Being able to name 3 or 4 of Codd's rules and explain why they matter signals genuine understanding of the relational model.

Rule 0: Foundation Rule

A relational database must manage its stored data using only its relational capabilities. No bypassing the relational engine to manipulate data directly in files. This rule ensures the database enforces all constraints, not the application.

Rule 1: Information Rule

All information in the database must be represented as values in cells within tables. Metadata, schema definitions, and configuration are all stored in tables (system catalogs). There is no hidden data.

Rule 2: Guaranteed Access Rule

Every value must be accessible by specifying the table name, column name, and primary key value. No pointer chasing, no file offsets, no hidden navigation. This is why SELECT with a WHERE clause on the primary key always works.

Rule 3: Systematic Treatment of NULLs

NULLs represent missing or inapplicable data and must be handled uniformly. NULL is not zero, not an empty string, and not false. It is the absence of a value. Three-valued logic (TRUE, FALSE, UNKNOWN) governs comparisons involving NULL.

Rule 5: Complete Data Sublanguage

The database must support at least one language for data definition (CREATE TABLE), data manipulation (INSERT, UPDATE, DELETE, SELECT), integrity constraints (PRIMARY KEY, FOREIGN KEY, CHECK), and transaction management (BEGIN, COMMIT, ROLLBACK). SQL fulfills this rule.

Rule 6: View Updating Rule

All views that are theoretically updatable must be updatable through the system. A view on a single table with no aggregation should support INSERT, UPDATE, and DELETE. In practice, most databases limit updatable views to simple cases.

Normalization Forms

Normalization is the process of organizing a relational schema to reduce redundancy and prevent update anomalies. Each normal form builds on the previous one, adding stricter rules about how attributes relate to keys.

In OLTP systems, 3NF is the standard target. In OLAP systems (data warehouses), intentional denormalization is the norm because read performance outweighs write anomaly concerns. Interviewers expect you to know both sides of this trade-off.

1NF (First Normal Form)

Rule: Each column contains atomic (indivisible) values. No repeating groups.

A customer table with a 'phone_numbers' column containing '555-1234, 555-5678' violates 1NF. Fix: create a separate customer_phones table with one row per phone number.

Interview note: 1NF is the baseline. If a table violates 1NF, it is not even a proper relation. Interviewers rarely ask about 1NF directly but might present a denormalized dataset and ask you to identify the violations.

2NF (Second Normal Form)

Rule: Must be in 1NF. Every non-key attribute depends on the entire primary key, not just part of it.

An order_items table with a composite key (order_id, product_id) that includes customer_name violates 2NF. customer_name depends on order_id alone, not on the full composite key. Fix: move customer_name to the orders table.

Interview note: 2NF violations only occur with composite primary keys. If your table has a single-column primary key, it automatically satisfies 2NF. Interviewers test this by giving you a table with a composite key and asking what is wrong.

3NF (Third Normal Form)

Rule: Must be in 2NF. No transitive dependencies. Non-key attributes must depend on the key, the whole key, and nothing but the key.

An employees table with (employee_id, department_id, department_name) violates 3NF. department_name depends on department_id, not directly on employee_id. Fix: move department_name to a departments table.

Interview note: 3NF is the gold standard for OLTP databases. The classic interview question is: 'Normalize this table to 3NF.' Walk through each form step by step. State the dependency chain that violates the rule, then show the fix.

BCNF (Boyce-Codd Normal Form)

Rule: Must be in 3NF. For every functional dependency X -> Y, X must be a superkey.

A course_schedule table with (student, course, instructor) where each instructor teaches only one course. The dependency instructor -> course means instructor determines course, but instructor is not a superkey. Fix: decompose into (student, instructor) and (instructor, course).

Interview note: BCNF comes up in senior-level interviews. The key difference from 3NF: 3NF allows non-prime attributes to depend on candidate keys other than the primary key. BCNF does not. Most tables that satisfy 3NF also satisfy BCNF.

The Relational Model in Data Engineering

Data engineers work with the relational model every day, even when they do not think about it explicitly. Source systems (PostgreSQL, MySQL, SQL Server) are relational databases. Target systems (Snowflake, BigQuery, Redshift) are also relational databases. The schemas you extract from and load into follow relational principles.

Where it gets interesting is the gap between theory and practice. Source OLTP systems enforce constraints strictly: primary keys, foreign keys, NOT NULL, CHECK constraints. Data warehouses relax enforcement for performance. Snowflake lets you declare primary and foreign keys but does not enforce them. BigQuery has no foreign key enforcement at all. This means your pipeline code must handle integrity that the database will not.

Schema design in warehouses follows the relational model at the conceptual level but deviates at the physical level. Star schemas are intentionally denormalized. Wide tables combine attributes from multiple relations into one. These are pragmatic trade-offs, not violations of the model. The relational model tells you what is correct. The physical implementation tells you what is fast. Knowing both is what makes a strong data engineer.

OLTP vs OLAP: How the Relational Model Shifts

AspectOLTP (Source)OLAP (Warehouse)
Normalization3NF or BCNFDenormalized (star/snowflake schema)
Constraint enforcementStrict (PK, FK, CHECK)Declarative, rarely enforced
Primary useTransactional writesAnalytical reads
Join countMany small joinsFew large joins
Row countMillions per tableBillions in fact tables

4 Relational Model Interview Questions

These questions test whether you understand the relational model beyond a surface-level definition. Each includes what the interviewer is really evaluating and how to structure a strong answer.

Q1: What is the relational data model and why does it matter for data engineering?

What they test:

Foundational knowledge. The interviewer wants to hear that the relational model is a mathematical framework where data is organized into relations (tables) consisting of tuples (rows) and attributes (columns). It matters because it provides formal rules for data integrity, consistency, and query correctness. Mention that SQL is the practical implementation of relational algebra. For data engineering specifically, the relational model underpins schema design for both OLTP source systems and OLAP warehouses.

Approach:

Start with the formal definition (relations, tuples, attributes). Then connect it to practice: primary keys enforce uniqueness, foreign keys enforce referential integrity, and normalization eliminates redundancy. Close with why DEs care: you design schemas, write migrations, and build pipelines that depend on these guarantees.

Q2: Normalize this table to 3NF: orders(order_id, customer_id, customer_name, customer_city, product_id, product_name, quantity, unit_price).

What they test:

Step-by-step normalization. The interviewer wants to see you identify 1NF (already satisfied if values are atomic), 2NF (partial dependencies on composite keys), and 3NF (transitive dependencies). customer_name and customer_city depend on customer_id, not order_id. product_name depends on product_id, not order_id.

Approach:

Decompose into three tables. customers(customer_id, customer_name, customer_city). products(product_id, product_name, unit_price). orders(order_id, customer_id, product_id, quantity). Explain each step: why customer_name moved, why product_name moved, and why quantity stays in orders.

Q3: When is denormalization appropriate? How do you decide?

What they test:

Pragmatism. The interviewer does not want a candidate who says 'always normalize' or 'always denormalize.' The correct answer depends on the workload. OLTP systems benefit from normalization (fewer anomalies, smaller writes). OLAP systems benefit from denormalization (fewer joins, faster reads). Data warehouses intentionally denormalize dimensions in star schemas for query performance.

Approach:

Frame it as a trade-off. Normalization optimizes for write correctness. Denormalization optimizes for read performance. In a data warehouse, you denormalize because the cost of redundant storage is trivial compared to the cost of complex multi-join queries running hundreds of times per day. State that star schemas are intentionally denormalized and this is the accepted standard.

Q4: Explain referential integrity and what happens when it breaks in a data pipeline.

What they test:

Real-world experience. Referential integrity means every foreign key value must have a matching primary key in the referenced table. In OLTP, the database enforces this. In data pipelines, it often breaks: a fact table loads before its dimension table, or a dimension row is deleted while facts still reference it. The interviewer wants to hear about orphan rows, how to detect them, and how to prevent them.

Approach:

Define referential integrity. Then describe the pipeline scenario: fact table loads at 2 AM, dimension table loads at 3 AM. Between 2 and 3 AM, fact rows reference dimension keys that do not exist yet. Solutions: load dimensions before facts (dependency ordering), use a default 'unknown' dimension row (surrogate key 0 or -1), or enforce integrity checks as a post-load validation step.

Frequently Asked Questions

What is the relational data model?+
The relational data model is a framework for organizing data into relations (tables), where each relation consists of tuples (rows) and attributes (columns). It was formalized by Edgar F. Codd in 1970 at IBM. Data is stored in structured tables with defined data types, primary keys for uniqueness, and foreign keys for relationships between tables. The model provides mathematical guarantees about data integrity and forms the foundation for SQL and every major relational database (PostgreSQL, MySQL, SQL Server, Oracle).
What is the difference between the relational model and a relational database?+
The relational model is the theoretical framework: a set of mathematical rules about how data should be organized. A relational database is a software system that implements those rules. PostgreSQL, MySQL, and SQL Server are relational databases. They implement the relational model but also add practical features not in the original model: indexes, stored procedures, triggers, and materialized views. The model is the blueprint; the database is the building.
Why does normalization matter for data engineers?+
Normalization eliminates data redundancy and update anomalies in OLTP source systems. For data engineers, understanding normalization matters for two reasons. First, you need to understand the source systems you extract from. Normalized OLTP schemas have many tables with complex joins. Second, you need to know when to denormalize for analytics. Data warehouses intentionally denormalize (star schemas, wide tables) to optimize read performance. Knowing both sides lets you design efficient pipelines and schemas.
How does the relational model apply to modern data warehouses?+
Modern data warehouses (Snowflake, BigQuery, Redshift) are relational databases. They store data in tables with columns, support SQL, and allow primary and foreign key declarations. However, they typically do not enforce foreign key constraints at write time for performance reasons. The relational model still applies at the design level: you define schemas with clear relationships, even if enforcement is handled by your pipeline logic and data quality tests rather than database constraints.

The Idea That Outlasted Every System It Replaced

Codd's 1970 paper is still the standard. Practice designing schemas against the theory the way the people who built Postgres, Snowflake, and BigQuery did.

Start Practicing