In June 1970, an IBM researcher named Edgar F. Codd published "A Relational Model of Data for Large Shared Data Banks"in Communications of the ACM. IBM's existing IMS hierarchical database was the state of the art, and Codd was quietly arguing that it was a dead end. The paper proposed that data should be stored in relations (tables) with a formal algebra, and that users should ask for data by what they want, not how to fetch it. It took a decade for the idea to reach production systems. Fifty-six years later, every SQL query you write descends from that single paper.
The relational model outlasted the companies that resisted it. IBM shipped System R in 1974. Oracle shipped commercially in 1979. By 1985 Codd had to publish 12 rules explaining what still didn't count as relational. Knowing the theory behind the thing makes the interview answers faster.
Codd's Paper Published
DE Interviews Are SQL
SQL Questions Analyzed
SQL Challenges Built
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
The relational model uses formal terminology that maps directly to the informal terms you already know. Understanding the formal terms is not academic posturing. Interviewers at senior levels use them, and they appear in database documentation, research papers, and system design discussions.
A relation is a set of tuples that share the same attributes. In practice, this is a table. The mathematical foundation matters because it guarantees that a relation has no duplicate tuples (rows) and no ordering. SQL relaxes both constraints: tables can have duplicate rows and results have no guaranteed order unless you specify ORDER BY. But the theoretical guarantee of uniqueness is why primary keys exist. Every properly designed relation has a primary key that enforces the no-duplicates rule.
A tuple is a single record within a relation. Each tuple contains one value for each attribute in the relation. In a customers relation with attributes (customer_id, name, email), a tuple might be (1, 'Alice Chen', 'alice@example.com'). Tuples are unordered within a relation. There is no 'first row' or 'last row' in the relational model. SQL engines store rows in physical order for performance, but the logical model treats all tuples as equal members of a set.
An attribute defines a named property with a specific domain (data type). The domain constrains what values the attribute can hold. An age attribute with domain INTEGER cannot hold the string 'thirty-five.' Attributes are unordered within a tuple. Saying 'the third column' has no meaning in the relational model, though SQL engines assign ordinal positions for convenience. Each attribute must have a unique name within its relation.
A domain is the set of allowable values for an attribute. INTEGER, VARCHAR(255), and BOOLEAN are domains. But domains go beyond simple types. An 'email_address' domain might be VARCHAR(255) with a CHECK constraint enforcing a valid format. A 'positive_amount' domain might be DECIMAL(10,2) with a CHECK that value > 0. Domains enforce data quality at the model level, not the application level. This distinction is critical in data engineering where multiple pipelines write to the same tables.
A primary key is an attribute (or set of attributes) that uniquely identifies each tuple in a relation. It must be unique and non-NULL. The relational model requires every relation to have a primary key because it enforces the fundamental no-duplicates rule. In practice, primary keys are either natural keys (business-meaningful values like email or ISBN) or surrogate keys (system-generated integers or UUIDs with no business meaning). Data warehouses typically use surrogate keys for dimensions and composite keys for facts.
A foreign key is an attribute in one relation that references the primary key of another relation. It creates a link between the two relations. The orders relation has a customer_id foreign key that references customers(customer_id). The database enforces referential integrity: you cannot insert an order with a customer_id that does not exist in the customers table. Foreign keys are the mechanism that makes the relational model relational. Without them, tables are just independent data stores.
Codd wrote the 12 rules (numbered 0 through 12) in 1985 as a protest document. Vendors were slapping "relational" on every database that supported SQL, and Codd wanted a checklist that separated real relational systems from marketing. Oracle and IBM DB2 failed several of the rules when he published them. They still do. But a handful of the rules actually matter for interviews and for understanding why SQL behaves the way it does.
You do not need to memorize all 13. The ones below are the ones interviewers actually reference. Being able to name 3 or 4 of Codd's rules and explain why they matter signals genuine understanding of the relational model.
A relational database must manage its stored data using only its relational capabilities. No bypassing the relational engine to manipulate data directly in files. This rule ensures the database enforces all constraints, not the application.
All information in the database must be represented as values in cells within tables. Metadata, schema definitions, and configuration are all stored in tables (system catalogs). There is no hidden data.
Every value must be accessible by specifying the table name, column name, and primary key value. No pointer chasing, no file offsets, no hidden navigation. This is why SELECT with a WHERE clause on the primary key always works.
NULLs represent missing or inapplicable data and must be handled uniformly. NULL is not zero, not an empty string, and not false. It is the absence of a value. Three-valued logic (TRUE, FALSE, UNKNOWN) governs comparisons involving NULL.
The database must support at least one language for data definition (CREATE TABLE), data manipulation (INSERT, UPDATE, DELETE, SELECT), integrity constraints (PRIMARY KEY, FOREIGN KEY, CHECK), and transaction management (BEGIN, COMMIT, ROLLBACK). SQL fulfills this rule.
All views that are theoretically updatable must be updatable through the system. A view on a single table with no aggregation should support INSERT, UPDATE, and DELETE. In practice, most databases limit updatable views to simple cases.
Normalization is the process of organizing a relational schema to reduce redundancy and prevent update anomalies. Each normal form builds on the previous one, adding stricter rules about how attributes relate to keys.
In OLTP systems, 3NF is the standard target. In OLAP systems (data warehouses), intentional denormalization is the norm because read performance outweighs write anomaly concerns. Interviewers expect you to know both sides of this trade-off.
Rule: Each column contains atomic (indivisible) values. No repeating groups.
A customer table with a 'phone_numbers' column containing '555-1234, 555-5678' violates 1NF. Fix: create a separate customer_phones table with one row per phone number.
Interview note: 1NF is the baseline. If a table violates 1NF, it is not even a proper relation. Interviewers rarely ask about 1NF directly but might present a denormalized dataset and ask you to identify the violations.
Rule: Must be in 1NF. Every non-key attribute depends on the entire primary key, not just part of it.
An order_items table with a composite key (order_id, product_id) that includes customer_name violates 2NF. customer_name depends on order_id alone, not on the full composite key. Fix: move customer_name to the orders table.
Interview note: 2NF violations only occur with composite primary keys. If your table has a single-column primary key, it automatically satisfies 2NF. Interviewers test this by giving you a table with a composite key and asking what is wrong.
Rule: Must be in 2NF. No transitive dependencies. Non-key attributes must depend on the key, the whole key, and nothing but the key.
An employees table with (employee_id, department_id, department_name) violates 3NF. department_name depends on department_id, not directly on employee_id. Fix: move department_name to a departments table.
Interview note: 3NF is the gold standard for OLTP databases. The classic interview question is: 'Normalize this table to 3NF.' Walk through each form step by step. State the dependency chain that violates the rule, then show the fix.
Rule: Must be in 3NF. For every functional dependency X -> Y, X must be a superkey.
A course_schedule table with (student, course, instructor) where each instructor teaches only one course. The dependency instructor -> course means instructor determines course, but instructor is not a superkey. Fix: decompose into (student, instructor) and (instructor, course).
Interview note: BCNF comes up in senior-level interviews. The key difference from 3NF: 3NF allows non-prime attributes to depend on candidate keys other than the primary key. BCNF does not. Most tables that satisfy 3NF also satisfy BCNF.
Data engineers work with the relational model every day, even when they do not think about it explicitly. Source systems (PostgreSQL, MySQL, SQL Server) are relational databases. Target systems (Snowflake, BigQuery, Redshift) are also relational databases. The schemas you extract from and load into follow relational principles.
Where it gets interesting is the gap between theory and practice. Source OLTP systems enforce constraints strictly: primary keys, foreign keys, NOT NULL, CHECK constraints. Data warehouses relax enforcement for performance. Snowflake lets you declare primary and foreign keys but does not enforce them. BigQuery has no foreign key enforcement at all. This means your pipeline code must handle integrity that the database will not.
Schema design in warehouses follows the relational model at the conceptual level but deviates at the physical level. Star schemas are intentionally denormalized. Wide tables combine attributes from multiple relations into one. These are pragmatic trade-offs, not violations of the model. The relational model tells you what is correct. The physical implementation tells you what is fast. Knowing both is what makes a strong data engineer.
| Aspect | OLTP (Source) | OLAP (Warehouse) |
|---|---|---|
| Normalization | 3NF or BCNF | Denormalized (star/snowflake schema) |
| Constraint enforcement | Strict (PK, FK, CHECK) | Declarative, rarely enforced |
| Primary use | Transactional writes | Analytical reads |
| Join count | Many small joins | Few large joins |
| Row count | Millions per table | Billions in fact tables |
These questions test whether you understand the relational model beyond a surface-level definition. Each includes what the interviewer is really evaluating and how to structure a strong answer.
What they test:
Foundational knowledge. The interviewer wants to hear that the relational model is a mathematical framework where data is organized into relations (tables) consisting of tuples (rows) and attributes (columns). It matters because it provides formal rules for data integrity, consistency, and query correctness. Mention that SQL is the practical implementation of relational algebra. For data engineering specifically, the relational model underpins schema design for both OLTP source systems and OLAP warehouses.
Approach:
Start with the formal definition (relations, tuples, attributes). Then connect it to practice: primary keys enforce uniqueness, foreign keys enforce referential integrity, and normalization eliminates redundancy. Close with why DEs care: you design schemas, write migrations, and build pipelines that depend on these guarantees.
What they test:
Step-by-step normalization. The interviewer wants to see you identify 1NF (already satisfied if values are atomic), 2NF (partial dependencies on composite keys), and 3NF (transitive dependencies). customer_name and customer_city depend on customer_id, not order_id. product_name depends on product_id, not order_id.
Approach:
Decompose into three tables. customers(customer_id, customer_name, customer_city). products(product_id, product_name, unit_price). orders(order_id, customer_id, product_id, quantity). Explain each step: why customer_name moved, why product_name moved, and why quantity stays in orders.
What they test:
Pragmatism. The interviewer does not want a candidate who says 'always normalize' or 'always denormalize.' The correct answer depends on the workload. OLTP systems benefit from normalization (fewer anomalies, smaller writes). OLAP systems benefit from denormalization (fewer joins, faster reads). Data warehouses intentionally denormalize dimensions in star schemas for query performance.
Approach:
Frame it as a trade-off. Normalization optimizes for write correctness. Denormalization optimizes for read performance. In a data warehouse, you denormalize because the cost of redundant storage is trivial compared to the cost of complex multi-join queries running hundreds of times per day. State that star schemas are intentionally denormalized and this is the accepted standard.
What they test:
Real-world experience. Referential integrity means every foreign key value must have a matching primary key in the referenced table. In OLTP, the database enforces this. In data pipelines, it often breaks: a fact table loads before its dimension table, or a dimension row is deleted while facts still reference it. The interviewer wants to hear about orphan rows, how to detect them, and how to prevent them.
Approach:
Define referential integrity. Then describe the pipeline scenario: fact table loads at 2 AM, dimension table loads at 3 AM. Between 2 and 3 AM, fact rows reference dimension keys that do not exist yet. Solutions: load dimensions before facts (dependency ordering), use a default 'unknown' dimension row (surrogate key 0 or -1), or enforce integrity checks as a post-load validation step.
Codd's 1970 paper is still the standard. Practice designing schemas against the theory the way the people who built Postgres, Snowflake, and BigQuery did.
Start Practicing