Loading lesson...

Data Quality and Contracts: Intermediate

Schema, distribution, and referential integrity turn cheap checks into a real suite

Category: Pipeline Architecture
Difficulty: intermediate
Duration: 32 minutes
Challenges: 0 hands-on challenges

Topics covered: Schema Validation Basics, Distributional Checks, Referential Integrity in DW, Test vs Prod: Same Checks, Quality Suite for Events

Lesson Sections

Schema Validation Basics (concepts: paSchemaValidation, paSchemaDrift)
Schema validation is the second-most-common failure mode after row count, and it is the easiest to express. A schema check asserts that every row in a table conforms to a declared shape: column names exist, types match, nullability matches, and values fall in declared ranges. A row that fails a schema check is, by definition, a row the pipeline should not have produced. Schema validation is also where the conversation with the producer starts, because the schema is the producer's commitment to t
Distributional Checks (concepts: paDistributionalCheck, paStatisticalQuality)
Schema and row-level checks miss a class of failures where every individual row is structurally valid and the population has shifted. A column whose mean used to be 42.3 and is now 67.8 may signal a real change in the world, or a producer-side bug, or an upstream filter regression. None of the rows are individually wrong. The distribution is wrong. Distributional checks compare summary statistics of the current run against a historical baseline and fire when the comparison crosses a threshold. T
Referential Integrity in DW (concepts: paReferentialIntegrity, paOrphanKeys)
Operational databases enforce referential integrity through foreign key constraints. A row in the orders table cannot reference a customer_id that does not exist in the customers table because the database refuses to write it. Analytical pipelines do not get this protection for free. Warehouses like Snowflake and BigQuery either do not enforce foreign keys at all or treat them as informational hints. The pipeline becomes responsible for enforcing the integrity that the operational database used
Test vs Prod: Same Checks (concepts: paEnvironmentalThresholds, paQualityTesting)
A common authoring mistake is to write quality checks that pass cleanly in the test environment and then fail repeatedly in production for reasons that have nothing to do with quality. The cause is almost always thresholds. Test data has different volumes, different distributions, and different time windows than production data. The same check can fire on test for trivial reasons and fail to fire on production for real reasons, because the bounds were tuned in the wrong environment. The discipli
Quality Suite for Events (concepts: paQualitySuite, paFivePillars)
The exercise puts the lesson into a single concrete deliverable. The target is a customer events table at a SaaS company that loads roughly two million events per day across four event types. The table feeds a dashboard, a churn model, and a billing report. The deliverable is a complete quality suite covering all five quality pillars: freshness, volume, distribution, schema, and lineage hint. The suite stays small enough to ship but covers the failure modes that have actually shown up in mature