Data Quality and Contracts: Intermediate

An ad-tech company at scale ran a daily revenue pipeline that fed both finance reporting and a real-time bidding model. One Friday the row count check passed, the null check passed, the uniqueness check passed, and the freshness check passed. Every gate was green. Revenue was nonetheless reported eleven percent lower than the actual amount billed by the ad servers. The cause was a distribution shift: a single advertiser had switched its bid currency from USD to EUR, and the pipeline silently treated the EUR values as USD. Every row was structurally valid. Every aggregate was wrong. The four cheap checks from the beginner tier had no answer for this failure mode. The fix required a second class of checks that look at the shape of the data, not the count of it. This lesson is about that second class: schema validation, distributional checks, referential integrity, and the threshold work that makes them robust.

Schema Validation Basics

Daily Life
Interviews

Apply schema validation across required fields, types, nullability, and ranges using a tool the team can adopt incrementally.

Schema validation is the second-most-common failure mode after row count, and it is the easiest to express. A schema check asserts that every row in a table conforms to a declared shape: column names exist, types match, nullability matches, and values fall in declared ranges. A row that fails a schema check is, by definition, a row the pipeline should not have produced. Schema validation is also where the conversation with the producer starts, because the schema is the producer's commitment to the consumer. The conversation is uncomfortable when it happens for the first time after a production incident; it is much easier when it happens at integration time, before any consumer has built anything against the data. The discipline of writing the schema check first surfaces every ambiguity in the producer's mental model. Whether amount is in cents or dollars, whether timestamps are wall clock or UTC, whether NULL means missing or zero, whether the primary key is event_id or (event_id, partition); every one of these comes up while authoring the schema check, and each one has caused a production incident at multiple companies.

What a Schema Check Asserts

AssertionWhat It CatchesCommon Failure Mode
Column existsProducer dropped or renamed a columnDDL change upstream propagated without coordination
Type matches declared typeProducer started sending strings where numbers were expectedJSON parsing differences; locale changes; quote-stripping bugs
Nullability respectedRequired column started arriving NULLProducer made a field optional; client library bug
Value in declared rangeProducer sent unexpected enum values or out-of-range numbersNew event types; client clock drift; currency mix-ups
Cardinality boundsCategorical column suddenly has too many or too few distinct valuesID generation bug; deduplication regression

The Three Common Tools

Great Expectationsdbt testsSoda
Great Expectations
Python-first, expectation-suite model
Declarative expectation suites in JSON or YAML, run against pandas, Spark, or SQL backends. Strong on documentation and data profiling. Common in Python-heavy teams.
dbt tests
SQL-first, model-adjacent
Tests live next to the dbt model they validate, written as SQL or as generic test macros (not_null, unique, accepted_values, relationships). Lightweight to adopt when dbt is already in use.
Soda
Declarative checks in YAML, multi-source
Checks written in SodaCL, a YAML-based DSL. Targets warehouses, streaming sources, and lakehouses. Strong on alerting and observability integrations.
1# Great Expectations : customer_id IS NOT NULL AND IS a string expectations : - expectation_type : expect_column_to_exist kwargs : column : customer_id - expectation_type : expect_column_values_to_not_be_null kwargs : column : customer_id - expectation_type : expect_column_values_to_be_of_type kwargs : column : customer_id type_ : STRING
1# dbt : same assertions, written next to the model version : 2 models : - name : fct_orders columns : - name : customer_id tests : - not_null - dbt_expectations.expect_column_values_to_be_of_type : column_type : VARCHAR
1# Soda : SodaCL checks for fct_orders : - missing_count(customer_id) = 0 - SCHEMA : fail : WHEN wrong column type : customer_id : VARCHAR
Choose dbt Tests When
  • The team already runs dbt for transforms
  • Tests should live next to the SQL model
  • The check vocabulary fits not_null, unique, accepted_values, relationships
  • Low overhead is a top priority
Choose Great Expectations or Soda When
  • Validation runs against non-warehouse sources (S3 files, Kafka, APIs)
  • Distributional and statistical assertions are needed beyond simple SQL
  • A standalone catalog of expectations and profiling reports is valuable
  • Multiple data stores need a single quality framework

Range Checks: Where Schema Becomes Semantics

Range checks are the bridge between schema and semantics. A column declared as a positive integer between zero and one million is more constrained than a column declared as just an integer. Range checks catch real bugs: a price column that should always be positive that suddenly goes negative because of a refund-handling change; a status column that should always be one of seven enum values that suddenly contains an eighth. The cost is keeping the range declaration in sync with the producer's actual constraints. The cost is real but bounded. Range declarations live next to the schema declaration in the same file. When the producer changes the legal range, the change shows up in the same PR that introduces the new behavior, and the consumer is informed at review time rather than at incident time. The discipline rewards the producer with fewer surprise consumer pages and rewards the consumer with a guarantee that the range constraints are accurate.
1SELECT
2 COUNT(*) FILTER (
3 WHERE order_amount_cents < 0
4 ) AS negative_amounts,
5 COUNT(*) FILTER (
6 WHERE order_amount_cents > 10000000
7 ) AS over_100k_amounts,
8 COUNT(*) FILTER (
9 WHERE order_status NOT IN (
10 'placed',
11 'paid',
12 'shipped',
13 'delivered',
14 'cancelled',
15 'refunded',
16 'returned'
17 )
18 ) AS invalid_statuses,
19 COUNT(*) AS total_rows
20FROM fct_orders
21WHERE order_date = CURRENT_DATE

Schema Drift: The Producer Changed and Did Not Tell

Schema drift is the slow-motion form of schema failure. The producer adds a new optional column; downstream pipelines parse the old shape and ignore the new column. The producer renames a column; downstream pipelines now read NULL where they used to read data. The producer changes a type from integer to string; SQL casts silently produce zero. Each individual change is small. Each change accumulates. Schema validation catches drift at the moment it crosses a layer boundary, before downstream consumers parse the new shape against the old expectation. The accumulated cost of unchecked drift is the most common reason older pipelines become unmaintainable. After a few years of small undocumented changes, the consumer-facing tables no longer have a schema anyone can describe, the queries against them rely on undocumented type coercion behavior, and any change to the producer carries unknown blast radius. Schema validation prevents the accumulation by making each drift event visible at the moment it happens.
Concrete schema drift incidents in production:
  • A producer renamed customer_id to customer_uuid; six pipelines silently lost their join key
  • A producer changed timestamp from epoch milliseconds to ISO-8601 strings; downstream filters returned zero rows
  • A producer added a new event_type that nobody had whitelisted; analytics counts diverged by 8 percent
  • A producer changed amount from cents to dollars; revenue inflated 100x for one day

Schema drift is the producer-side analog of silent data corruption. Schema validation catches drift at the boundary, before the rest of the pipeline parses against the old expectation.

Do
  • Declare schemas explicitly and version them in the same repo as the pipeline code
  • Run schema validation at the raw layer immediately after extraction
  • Pick one tool and use it consistently across pipelines
Don't
  • Treat schema as documentation; treat it as an executable contract
  • Mix three quality tools across the same pipeline; the cognitive cost compounds
  • Skip range checks because the type already passes; ranges encode the real constraints
TIP
When integrating a new producer, ask for the schema and the value ranges in writing before the integration ships. The conversation forced by the question often surfaces undocumented constraints that would otherwise leak into production as drift.

Distributional Checks

Daily Life
Interviews

Detect distributional shifts using mean, stddev, quantiles, and category-mix comparisons against a historical baseline.

Schema and row-level checks miss a class of failures where every individual row is structurally valid and the population has shifted. A column whose mean used to be 42.3 and is now 67.8 may signal a real change in the world, or a producer-side bug, or an upstream filter regression. None of the rows are individually wrong. The distribution is wrong. Distributional checks compare summary statistics of the current run against a historical baseline and fire when the comparison crosses a threshold. The class of failures these checks catch is exactly the class that escapes every other check. Row counts are normal; nulls are absent; uniqueness is preserved; schema is intact; freshness is on time. The only signal that anything is amiss is that the distribution of values has shifted in a way that the prior data did not predict. Distributional checks turn that signal into an alert.

Statistics That Earn Their Keep

StatisticWhat It TracksWhat a Shift Often Means
MeanAverage value of a numeric columnProducer changed units (USD to EUR, cents to dollars), or a filter changed scope
Standard deviationSpread of a numeric columnNew high-value outliers; a bug that is producing extreme values; sample-size change
Quantiles (p50, p95, p99)Distribution shape, especially the tailTail spike often catches what mean misses; e.g., a fraud burst is in p99, not in mean
CardinalityNumber of distinct values in a categorical columnNew values appeared; existing values disappeared; ID generation regression
Category mixProportion of each category in a categorical columnOne category is suddenly dominant or absent; channel attribution bug

A Real Distribution Shift

Consider an e-commerce checkout pipeline where order_amount_usd is normally distributed around a mean of 84 dollars with a standard deviation near 32 dollars. One Wednesday the mean drops to 71 dollars. No row is invalid; every value is plausible. The cause turns out to be a promo code that started applying twice for one customer segment because of a coupon-stacking bug. Without a distributional check, the only signal is the dashboard line moving down, and the cause hides in the noise of normal day-to-day variation. With a distributional check, the alert fires the same day. The same kind of incident has appeared at every large e-commerce retailer at least once. The variations differ in mechanism: a tax engine bug, a currency conversion regression, a fulfillment cost miscoded as revenue, a coupon stacking error, an A/B test exposure miscount. The shape is identical: structurally valid rows whose distribution does not match history.
1WITH today AS (
2 SELECT
3 AVG(order_amount_usd) AS mean_today,
4 STDDEV(order_amount_usd) AS stddev_today,
5 APPROX_PERCENTILE(
6 order_amount_usd,
7 0.95
8 ) AS p95_today
9 FROM fct_orders
10 WHERE order_date = CURRENT_DATE
11),
12baseline AS (
13 SELECT
14 AVG(order_amount_usd) AS mean_baseline,
15 STDDEV(order_amount_usd) AS stddev_baseline
16 FROM fct_orders
17 WHERE order_date BETWEEN CURRENT_DATE - 28
18 AND CURRENT_DATE - 1
19)
20
21SELECT
22 ROUND(today.mean_today, 2) AS mean_today,
23 ROUND(
24 baseline.mean_baseline,
25 2
26 ) AS mean_baseline,
27 ROUND(
28 ABS(
29 today.mean_today - baseline.mean_baseline
30 ) / baseline.stddev_baseline,
31 2
32 ) AS z_shift
33FROM today
34CROSS JOIN baseline

Picking a Threshold

The threshold for a distributional check is not arbitrary. A reasonable starting point is to express the shift in standard deviations of the historical baseline (a z-score). A shift of more than two or three baseline standard deviations is worth investigating. The exact threshold depends on the column. A column that varies a lot day-to-day will tolerate a larger z-shift before raising an alert; a column that is normally rock-steady will warrant a tighter threshold. The point is that the threshold is computed from history, not invented. The historical window matters too. A 7-day baseline reacts quickly to recent changes, which is sometimes a feature and sometimes a bug; a 30-day baseline smooths over individual outliers but takes longer to detect persistent drift; a 90-day baseline is stable but unresponsive. The choice depends on how frequently the underlying process actually changes. Fast-moving columns (campaign exposure, inventory levels) prefer shorter windows; slow-moving columns (lifetime value, customer segment shares) prefer longer ones.
Z-Shift MagnitudeRecommended ActionReasoning
Less than 2Pass; treat as normal variationMost days will fall inside this band by definition
2 to 3Warn; log and notify but do not blockPlausible normal variation but worth a human glance
3 to 5Block on critical metrics; warn on othersRare under normal variation; likely a real change
Greater than 5Block; almost certainly a bug or upstream changeFalls outside historical experience; running on this data is risky

Cardinality and Category Mix

Some of the highest-leverage distributional checks are on categorical columns. A country_code column that has historically had values for forty-seven countries and suddenly has values for forty-eight signals a new market or a producer bug. A device_type column whose mobile share has been steady at 62 percent and is now 81 percent signals a desktop tracking failure. These shifts rarely move the row count or the means by enough to trigger numeric checks, but they encode real changes in the underlying data. Cardinality and category mix are also among the easiest distributional checks to interpret in incident response. A new value appearing in event_type names a specific producer change; a category share shift names which segment of users is behaving differently. The interpretability is part of the leverage: the alert points at the cause rather than at a symptom.
1SELECT
2 device_type,
3 COUNT(*) AS rows_today,
4 ROUND(
5 100 * COUNT(*) / SUM(COUNT(*)) OVER (
6 ),
7 1
8 ) AS pct_today
9FROM fct_orders
10WHERE order_date = CURRENT_DATE
11GROUP BY device_type
12ORDER BY rows_today DESC

What Distributional Checks Do Not Catch

Distributional checks are not the right tool for low-volume rare events. A pipeline that processes a million events per day will have stable means and quantiles. A pipeline that processes fifty events per day will have means and quantiles that bounce around for reasons that have nothing to do with quality. The smaller the population, the noisier the statistics, the wider the threshold needed, the less the check tells. Distributional checks earn their keep on tables with daily volumes large enough that day-to-day statistical variation is small. The threshold of usefulness is fuzzy. Tables with thousands of rows per day usually benefit; tables with hundreds usually do not. A common pattern is to compute weekly or monthly aggregates for low-volume tables and run distributional checks at that grain instead, accepting later detection in exchange for a meaningful baseline.
Distributional Checks Earn Their Keep When
  • Daily row count is large (tens of thousands or more)
  • The columns being tracked have stable behavior over time
  • Cost of silent shifts is high (revenue, model features, regulated metrics)
  • Historical data is available to compute baselines
Distributional Checks Are A Poor Fit When
  • Daily volumes are small; statistics are dominated by noise
  • The data is intentionally bursty (campaign events, batch arrivals)
  • Baselines do not yet exist; the table is too new to model
  • The cost of false positives outweighs the cost of missed shifts
TIP
Implement distributional checks as a second pass after the cheap checks. The cheap checks catch the high-leverage failures; distributional checks catch the next class. Building both before either is in production is over-engineering.
alert
A row-valid column can still produce a wrong aggregate when its distribution shifts.
check
Z-scores against a historical baseline are the simplest defensible threshold.
query
Cardinality and category mix often catch what means and stddevs miss.

Referential Integrity in DW

Daily Life
Interviews

Detect orphan keys with LEFT JOIN NULL filters and choose between block, quarantine, placeholder, and defer strategies.

Operational databases enforce referential integrity through foreign key constraints. A row in the orders table cannot reference a customer_id that does not exist in the customers table because the database refuses to write it. Analytical pipelines do not get this protection for free. Warehouses like Snowflake and BigQuery either do not enforce foreign keys at all or treat them as informational hints. The pipeline becomes responsible for enforcing the integrity that the operational database used to enforce automatically. The cost of skipping this responsibility is orphan keys: rows in a fact table that reference dimension keys nobody has seen. The cost is not always visible. INNER JOINs silently drop orphan rows, which makes downstream metrics quietly understate. LEFT JOINs preserve the rows but produce NULLs in the dimension columns, which makes downstream filters behave unpredictably. The same orphan can produce different wrong answers in two different consumer queries depending on the join shape, which is exactly the failure pattern that erodes trust the fastest.

What An Orphan Key Looks Like

TableColumnValueStatus
fct_orderscustomer_idc_8421Present
dim_customercustomer_idc_8421Missing (orphan)
fct_orderscustomer_idc_8422Present
dim_customercustomer_idc_8422Present
The first row in fct_orders references a customer_id that does not exist in dim_customer. A LEFT JOIN from fct to dim will produce NULL for the customer columns. An INNER JOIN will silently drop the row. Either way, the consumer sees something other than what the source data implied.

How Orphans Get Created

CauseMechanismFrequency
Race condition during ingestionOrder arrives before the customer is replicated; no recoveryCommon in CDC pipelines without transactional ordering
Soft-deleted dimension rows filtered outDim load excludes deleted_at IS NOT NULL; fact still references themCommon in analytics filtering on is_active
Dimension table truncated and rebuiltMid-rebuild window where dim is empty but fact has rows referencing itPipeline design bug; fixable with proper transaction or swap
Producer started using IDs the dim does not yet knowSchema or scope change without dim updateCommon at integration boundaries with new partners
1WITH orphans AS (
2 SELECT
3 f.customer_id
4 FROM fct_orders AS f
5 LEFT JOIN dim_customer AS d USING (customer_id)
6 WHERE d.customer_id IS NULL
7)
8
9SELECT
10 COUNT(DISTINCT customer_id) AS orphan_keys,
11 COUNT(*) AS orphan_rows
12FROM orphans

The pattern is LEFT JOIN with a NULL filter on the dimension's primary key. Any row that survives is an orphan. The check is cheap, runs in seconds at warehouse scale, and produces an exact count.

BlockQuarantinePlaceholderDefer
Block
Halt the publish on any orphan
Strongest authority. Fits critical fact tables where downstream joins must succeed and partial publication misleads consumers.
Quarantine
Side-table the orphan rows
Move orphans to a quarantine table; publish only matched rows. Requires a review process; otherwise quarantines accumulate.
Placeholder
Auto-create an Unknown dim entry
Inserts a sentinel dim row so joins always succeed. Best when matching is best-effort and downstream tolerates Unknown values.
Defer
Wait for the dimension to catch up
Retries the join after a delay. Fits eventually-consistent producers where order arrival can race the dim load.
StrategyMechanismWhen To Use
Block the publishTreat the orphan count as a blocking quality check failureCritical fact tables where joins must succeed
Quarantine the orphan rowsMove orphan rows to a side table; publish only matched rowsTables where partial publication is acceptable and quarantines are reviewed
Insert a placeholder dim rowAuto-create a 'Unknown' dimension entry for unseen keysWhen matching is best-effort and joins must always succeed
Defer publicationWait until the dimension catches up; retry the joinWhen the producer ordering is known to be eventually consistent
The inverse problem is dimension rows that no fact references. These are usually less harmful than orphans because they do not break joins; they only inflate the dimension. A dimension that grows without bound while the fact stays the same shape signals a slow leak: rows added that nobody uses, possibly because of a producer side-write that never made it into the fact load. The check is the mirror of the orphan check, and it is worth running periodically rather than continuously.

Multi-Hop Integrity

Real data models have multi-hop relationships: an order references a customer, the customer references a sales region, the sales region references a country. Integrity at each hop is independent. A pipeline can have zero orphans on customer_id and still have orphans on region_id if the customer dimension references regions the region dimension does not contain. The check has to walk every foreign key explicitly. Treating integrity as a one-step check at the immediate parent misses the chain. The same property surfaces in slowly changing dimensions, where the dimension is versioned and a fact may reference an older version of the dimension that has since been replaced. The orphan check has to know which version of the dimension is canonical for the fact's effective date, which is more involved than a simple LEFT JOIN. Most teams handle this by treating the SCD as multiple dimension tables, one per version, and running the orphan check against the version that was current at the fact's timestamp.
Five questions to ask before declaring a fact table referentially clean:
  • Does every key in fct join to its immediate dim
  • Does every dim key in turn join to its parent dim, where applicable
  • Are soft-deleted dim rows excluded; if so, do existing facts still reference them
  • Is the dimension's primary key actually unique (cross-check against uniqueness gates)
  • Has the dim caught up to the fact's most recent rows (timing question, not structural)
Do
  • Run referential integrity checks at the curated layer where joins happen
  • Walk every foreign key in the data model, not just the immediate parent
  • Decide quarantine vs block per fact table based on consumer tolerance
Don't
  • Assume the warehouse enforces FK constraints; most do not
  • Silently INNER JOIN; the dropped rows hide the integrity failure
  • Treat orphan rows as noise; they encode a real upstream problem
TIP
When designing a fact table, list every dimension it joins to and write the orphan check for each before the table publishes. Adding the check after the first orphan incident is twice the work.

Most warehouse foreign key declarations are informational. Trust the constraint when SQL Server or Postgres enforces it; do not trust it in Snowflake, BigQuery, or Redshift.

Source
source
schema + null check
ingest check
Transform
transform
row-count check
output check
Storage
warehouse

Quality checks live at boundaries: validate on the way in (schema, nulls) and on the way out (row counts, totals). A failing gate stops bad data before it reaches the warehouse.

Test vs Prod: Same Checks

Daily Life
Interviews

Maintain the same quality suite across environments while tuning thresholds to match the data each environment carries.

A common authoring mistake is to write quality checks that pass cleanly in the test environment and then fail repeatedly in production for reasons that have nothing to do with quality. The cause is almost always thresholds. Test data has different volumes, different distributions, and different time windows than production data. The same check can fire on test for trivial reasons and fail to fire on production for real reasons, because the bounds were tuned in the wrong environment. The discipline is to keep the assertions identical and the thresholds environment-aware. The opposite mistake is also common: a check that passes in production because the threshold was set to whatever the production data happened to look like the day the check was written. Such a check assents to whatever the producer is currently emitting, which means it cannot detect drift away from that state. Both failure modes have the same root cause, which is treating thresholds as something to be discovered rather than something to be designed.

What Differs Between Environments

PropertyTest EnvironmentProduction Environment
Row volumeTens to thousands per day; sometimes a frozen snapshotThousands to millions per day; live traffic
Row distributionSynthetic or sampled; often unrepresentative tailsReal distributions including heavy tails and rare events
Time patternTests run on demand or on a fixed scheduleContinuous or daily; calendar effects matter
Schema completenessOften missing optional columns; partial fixturesFull schema with every column populated
Failure tolerancePipeline can break and rerun freelyPipeline failure has consumer impact

Same Assertion, Two Thresholds

A row count check that asserts 'today must be within fifty percent and two hundred percent of the seven-day baseline' makes sense in production where the baseline is a meaningful average. In test, the baseline might be the same fixture re-loaded every day, and the threshold is meaningless. The fix is not to disable the check in test. The fix is to keep the assertion identical and to load test thresholds from a different file or environment variable. The check's logic stays the same; the bounds change.
1# config / quality / thresholds.production.yaml fct_orders : row_count : min_ratio_to_baseline : 0.50 max_ratio_to_baseline : 2.00 null_rate_pct : customer_id : 0.5 order_amount_usd : 0.1 distributional : order_amount_usd_z_block : 5.0 order_amount_usd_z_warn : 3.0 # config / quality / thresholds.test.yaml fct_orders : row_count : min_ratio_to_baseline : 0.10 max_ratio_to_baseline : 10.00 null_rate_pct : customer_id : 5.0 order_amount_usd : 5.0 distributional : order_amount_usd_z_block : 999.0 order_amount_usd_z_warn : 999.0

Why Tests Need Looser Bounds, Not Different Checks

Looser bounds in test reflect that the test environment cannot meaningfully assert the same things production can. A test fixture with twelve rows cannot have a meaningful null rate baseline. A test fixture loaded once per quarter cannot have a meaningful freshness check. The assertion that 'customer_id is not null' is the same in both environments; the threshold for what counts as a violation is loosened in test because the test data is not statistically meaningful. Disabling the check entirely loses coverage; tightening the threshold creates flakes; loosening the threshold preserves the signal that the check still runs. The principle generalizes: test environments verify that the check executes and that the pipeline plumbing connects the check to the gate. Production environments verify that the data conforms. Both verifications are necessary and address different failure modes. A check that has never executed in test is unlikely to execute correctly in production the first time; a check tuned to test thresholds will not catch production failures.
Wrong Way: Different Checks Per Environment
  • Test runs five checks; production runs eleven
  • Test does not catch what production does, even structurally
  • A check passes in test and is undefined in production
  • Adding a new check requires updating two suites
Right Way: Same Checks, Different Thresholds
  • Both environments run the same eleven checks
  • Test verifies the check executes; production verifies the data
  • A check that breaks in production also breaks in test
  • Thresholds are config; the suite is code

What Test Data Should Cover

Test CaseWhat It VerifiesHow It Is Loaded
Happy pathPipeline runs end-to-end on representative dataSampled or synthetic rows in expected proportions
Schema-violation rowSchema check actually fires when the producer breaks the contractA fixture row with a NULL where NOT NULL is declared
Distributional shiftDistributional check fires when the data shiftsA fixture day with values shifted by 5+ stddev
Orphan keyReferential integrity check actually catches an orphanA fact row whose dim entry is intentionally missing
Empty inputPipeline behaves correctly when no rows arriveAn empty source

The Staging Environment In Between

Staging environments sit between test and production. They run on a copy of production data, often anonymized or sampled. Staging thresholds are usually closer to production thresholds than to test thresholds because the data is statistically representative. Quality checks running in staging are a meaningful preview of how they will behave in production. A check that flaps in staging will flap in production; a check that is silent in staging will be silent in production. Staging is the right environment to tune thresholds against real distributions before flipping the suite on for production.
1# Quality runner picks thresholds based on environment
2import os
3import yaml
4
5env = os.environ.get('PIPELINE_ENV', 'test')
6thresholds_path = f'config/quality/thresholds.{env}.yaml'
7with open(thresholds_path) as f:
8 thresholds = yaml.safe_load(f)
9
10for table, checks in thresholds.items():
11 run_quality_suite(table, checks)
Things test data should never do:
  • Pretend the production volume by replaying real traffic onto the test environment
  • Skip checks because the test fixture is too small to make them meaningful
  • Use random IDs that violate the producer's actual ID generation rules
  • Be loaded once and never refreshed; stale fixtures train stale assumptions
TIP
Treat the threshold file as a piece of code reviewed in a PR. Threshold changes are interesting and worth a second pair of eyes. Magic numbers without commit history are how quality suites silently degrade.
check
Same checks across environments; thresholds vary because the data is statistically different.
alert
Disabling a check in test means production runs a check the test environment never exercised.

Quality Suite for Events

Daily Life
Interviews

Author a quality suite for a high-volume events table covering all five pillars with appropriate authority and layer placement.

The exercise puts the lesson into a single concrete deliverable. The target is a customer events table at a SaaS company that loads roughly two million events per day across four event types. The table feeds a dashboard, a churn model, and a billing report. The deliverable is a complete quality suite covering all five quality pillars: freshness, volume, distribution, schema, and lineage hint. The suite stays small enough to ship but covers the failure modes that have actually shown up in mature production systems. Two million events per day is roughly twenty-three per second, which is large enough that statistical checks are meaningful and small enough that the warehouse cost of running the checks is negligible. The shape of the table is representative of the most common high-volume analytical surface: an append-mostly fact with a handful of categorical and numeric columns, joined to dimensions for context. Most production quality suites at companies of similar scale resemble the one assembled below.

The Table

1CREATE TABLE curated.fct_customer_events(event_id VARCHAR(64) NOT NULL, customer_id VARCHAR(64) NOT NULL, event_type VARCHAR(32) NOT NULL, event_timestamp TIMESTAMP NOT NULL, product_id VARCHAR(64), amount_usd NUMERIC(12, 2), ingest_date DATE NOT NULL, CONSTRAINT pk_fct_customer_events PRIMARY KEY(event_id)) ;

Pillar 1: Freshness

The pipeline lands events hourly. The freshness SLA is two hours: MAX(event_timestamp) must be within two hours of CURRENT_TIMESTAMP at the moment the gate runs. The check fires as a blocking gate at the raw layer because stale data here cascades into stale dashboards, model features, and billing entries.
1/* Freshness gate (blocking) */
2SELECT
3 MAX(event_timestamp) AS latest_event,
4 CURRENT_TIMESTAMP AS now_ts,
5 EXTRACT(EPOCH FROM (
6 CURRENT_TIMESTAMP - MAX(
7 event_timestamp
8 )
9 )) / 60 AS minutes_behind,
10 CASE
11 WHEN CURRENT_TIMESTAMP - MAX(
12 event_timestamp
13 ) <= INTERVAL '2 hours' THEN 'pass'
14 ELSE 'fail'
15 END AS gate_result
16FROM curated.fct_customer_events
Pillar 2 (volume) is row count over a window using the day-of-week baseline from the beginner tier; it blocks on underage and warns on overage. Pillar 3 (distribution) covers amount_usd (z-shift greater than three blocks the publish) and event_type category mix (proportions must stay within five percentage points of the trailing four-week baseline).
1/* Distributional gate: amount_usd mean shift in stddev units */
2WITH today AS (
3 SELECT
4 AVG(amount_usd) AS mean_today
5 FROM curated.fct_customer_events
6 WHERE ingest_date = CURRENT_DATE
7),
8baseline AS (
9 SELECT
10 AVG(amount_usd) AS mean_baseline,
11 STDDEV(amount_usd) AS sd_baseline
12 FROM curated.fct_customer_events
13 WHERE ingest_date BETWEEN CURRENT_DATE - 28
14 AND CURRENT_DATE - 1
15)
16
17SELECT
18 ROUND(
19 ABS(
20 today.mean_today - baseline.mean_baseline
21 ) / baseline.sd_baseline,
22 2
23 ) AS z_shift,
24 CASE
25 WHEN ABS(
26 today.mean_today - baseline.mean_baseline
27 ) / baseline.sd_baseline >= 5 THEN 'block'
28 WHEN ABS(
29 today.mean_today - baseline.mean_baseline
30 ) / baseline.sd_baseline >= 3 THEN 'warn'
31 ELSE 'pass'
32 END AS gate_result
33FROM today
34CROSS JOIN baseline
Pillar 4 (schema) asserts the declared column shape: columns exist, types match, NOT NULL columns have zero nulls, and event_type contains only the four declared values. Schema checks run at the raw layer so failures are caught before the curated transform parses against the wrong shape.
1# SCHEMA CHECK expressed IN dbt tests version : 2 models : - name : fct_customer_events columns : - name : event_id tests : - not_null - UNIQUE - name : customer_id tests : not_null - name : event_type tests : - not_null - accepted_values : VALUES : [ 'signup', 'login', 'purchase', 'churn' ] - name : event_timestamp tests : not_null - name : amount_usd tests : - dbt_utils.expression_is_true : expression : 'amount_usd IS NULL OR amount_usd >= 0'
Pillar 5 (lineage hint) at this tier is not full lineage tracking (advanced tier addresses observability lineage in depth). The hint asserts that every event in curated can be traced to a row in raw with the same event_id. A curated row with no matching raw row signals a transform bug that synthesized data; a raw row with no matching curated row signals a row that the transform dropped without explanation. The check is a pair of LEFT JOINs. The pair is asymmetric in importance. Synthesized rows (curated without raw) are almost always a bug and should block the publish. Dropped rows (raw without curated) are sometimes intentional (the transform filters out test accounts, fraud, or invalid events), so the check warns rather than blocks; a sudden jump in dropped rows is worth a human glance even when the dropping itself is by design.
1/* Lineage hint: every curated row has a raw counterpart, and vice versa */
2WITH curated_only AS (
3 SELECT
4 c.event_id
5 FROM curated.fct_customer_events AS c
6 LEFT JOIN raw.events AS r USING (event_id)
7 WHERE r.event_id IS NULL
8),
9raw_only AS (
10 SELECT
11 r.event_id
12 FROM raw.events AS r
13 LEFT JOIN curated.fct_customer_events AS c USING (event_id)
14 WHERE c.event_id IS NULL
15 AND r.ingest_date = CURRENT_DATE
16)
17
18SELECT
19 (
20 SELECT
21 COUNT(*)
22 FROM curated_only
23 ) AS synthesized_rows,
24 (
25 SELECT
26 COUNT(*)
27 FROM raw_only
28 ) AS dropped_rows

The Whole Suite, Wired Together

PillarLayerAuthorityWhat It Asserts
FreshnessRawBlockMAX(event_timestamp) within 2 hours of now
VolumeRaw and curatedBlock on underage; warn on overageRow count within 50 to 200 percent of day-of-week baseline
DistributionCuratedBlock at z >= 5; warn at z >= 3amount_usd mean shift; event_type mix
SchemaRawBlockRequired columns, types, accepted values, expressions
Lineage hintCuratedBlock on synthesized; warn on droppedEvery curated row has a raw counterpart and vice versa
Failure modes the five-pillar suite catches and misses:
  • Catches: dropped Kafka partition (freshness, volume); cents-to-dollars amount change (distribution)
  • Catches: new event_type without notice (schema); synthesized or dropped rows (lineage)
  • Misses: correlated drift in two columns simultaneously (advanced statistical methods)
  • Misses: correctness against an external source of truth (needs reconciliation passes)
  • Misses: business-logic errors that produce structurally-clean output (tax bug with right shape)
Do
  • Cover all five pillars before declaring a quality suite complete
  • Match each pillar's authority (block or warn) to the consumer impact of failure
  • Run schema and freshness at the raw layer; volume, distribution, lineage at curated
Don't
  • Skip lineage at the intermediate tier; it is the cheapest way to catch transform synthesis
  • Combine pillars into a single check; debugging is easier when each pillar fires alone
  • Reuse the same threshold for amount_usd and amount_cents; units matter
TIP
Build the suite in pillar order: freshness, volume, schema, distribution, lineage. Each pillar catches a different class of failure; investing in the next pillar is justified by the value of the failures it catches that the prior pillars miss.
PUTTING IT ALL TOGETHER

> A SaaS data platform team owns fct_customer_events, the table that powers leadership dashboards, the churn model, and the monthly billing report. The team has the four cheap checks from the beginner tier in place. After two near-misses involving structurally-valid rows that produced wrong aggregates, the lead asks the team to expand the suite so the next near-miss is caught instead of celebrated.

Add schema validation that asserts required fields, types, nullability, and accepted values. Pick one tool (dbt tests, Great Expectations, or Soda) and use it consistently. The schema is the producer's commitment to the consumer made executable.
Add distributional checks for the columns where structurally-valid shifts have caused incidents. Use z-scores against a historical baseline; warn at z=3, block at z=5. Distributional checks earn their keep on high-volume tables and are noise on low-volume ones.
Add referential integrity checks at the curated layer. The warehouse does not enforce foreign keys; the pipeline must. LEFT JOIN with NULL filter is the canonical orphan check; quarantine, block, placeholder, and defer are the four valid responses.
Maintain the same checks across environments and parameterize the thresholds. The check vocabulary is code; the thresholds are configuration. This connects to the layered architecture introduced in Lesson 1: each layer carries its own quality gates rather than a single end-of-pipeline check.
Cover all five pillars (freshness, volume, distribution, schema, lineage hint) before declaring the suite complete. The suite is the producer's evidence to the consumer that the data has been continuously proved correct.
KEY TAKEAWAYS
Schema validation is the executable contract: required fields, types, nullability, ranges, accepted values. Pick one tool and use it consistently.
Distributional checks catch what row-level checks miss: z-scores against a historical baseline detect shifts that no individual row violates.
Warehouses do not enforce foreign keys: the pipeline must. LEFT JOIN NULL filter detects orphans; the response is one of block, quarantine, placeholder, or defer.
Same checks across environments, different thresholds: test data is statistically different from production; the assertion stays identical, the bounds vary.
A complete suite covers all five pillars: freshness, volume, distribution, schema, and at least a lineage hint. Anything less leaves a known class of failures unwatched.

Schema, distribution, and referential integrity turn cheap checks into a real suite

Category
Pipeline Architecture
Difficulty
intermediate
Duration
32 minutes
Challenges
0 hands-on challenges

Topics covered: Schema Validation Basics, Distributional Checks, Referential Integrity in DW, Test vs Prod: Same Checks, Quality Suite for Events

Lesson Sections

  1. Schema Validation Basics (concepts: paDataQuality)

    Schema validation is the second-most-common failure mode after row count, and it is the easiest to express. A schema check asserts that every row in a table conforms to a declared shape: column names exist, types match, nullability matches, and values fall in declared ranges. A row that fails a schema check is, by definition, a row the pipeline should not have produced. Schema validation is also where the conversation with the producer starts, because the schema is the producer's commitment to t

  2. Distributional Checks (concepts: paDataQuality)

    Schema and row-level checks miss a class of failures where every individual row is structurally valid and the population has shifted. A column whose mean used to be 42.3 and is now 67.8 may signal a real change in the world, or a producer-side bug, or an upstream filter regression. None of the rows are individually wrong. The distribution is wrong. Distributional checks compare summary statistics of the current run against a historical baseline and fire when the comparison crosses a threshold. T

  3. Referential Integrity in DW (concepts: paDataQuality)

    Operational databases enforce referential integrity through foreign key constraints. A row in the orders table cannot reference a customer_id that does not exist in the customers table because the database refuses to write it. Analytical pipelines do not get this protection for free. Warehouses like Snowflake and BigQuery either do not enforce foreign keys at all or treat them as informational hints. The pipeline becomes responsible for enforcing the integrity that the operational database used

  4. Test vs Prod: Same Checks (concepts: paDataQuality)

    A common authoring mistake is to write quality checks that pass cleanly in the test environment and then fail repeatedly in production for reasons that have nothing to do with quality. The cause is almost always thresholds. Test data has different volumes, different distributions, and different time windows than production data. The same check can fire on test for trivial reasons and fail to fire on production for real reasons, because the bounds were tuned in the wrong environment. The discipli

  5. Quality Suite for Events (concepts: paDataQuality)

    The exercise puts the lesson into a single concrete deliverable. The target is a customer events table at a SaaS company that loads roughly two million events per day across four event types. The table feeds a dashboard, a churn model, and a billing report. The deliverable is a complete quality suite covering all five quality pillars: freshness, volume, distribution, schema, and lineage hint. The suite stays small enough to ship but covers the failure modes that have actually shown up in mature