An ad-tech company at scale ran a daily revenue pipeline that fed both finance reporting and a real-time bidding model. One Friday the row count check passed, the null check passed, the uniqueness check passed, and the freshness check passed. Every gate was green. Revenue was nonetheless reported eleven percent lower than the actual amount billed by the ad servers. The cause was a distribution shift: a single advertiser had switched its bid currency from USD to EUR, and the pipeline silently treated the EUR values as USD. Every row was structurally valid. Every aggregate was wrong. The four cheap checks from the beginner tier had no answer for this failure mode. The fix required a second class of checks that look at the shape of the data, not the count of it. This lesson is about that second class: schema validation, distributional checks, referential integrity, and the threshold work that makes them robust.
Schema Validation Basics
Daily Life
Interviews
Apply schema validation across required fields, types, nullability, and ranges using a tool the team can adopt incrementally.
Schema validation is the second-most-common failure mode after row count, and it is the easiest to express. A schema check asserts that every row in a table conforms to a declared shape: column names exist, types match, nullability matches, and values fall in declared ranges. A row that fails a schema check is, by definition, a row the pipeline should not have produced. Schema validation is also where the conversation with the producer starts, because the schema is the producer's commitment to the consumer. The conversation is uncomfortable when it happens for the first time after a production incident; it is much easier when it happens at integration time, before any consumer has built anything against the data. The discipline of writing the schema check first surfaces every ambiguity in the producer's mental model. Whether amount is in cents or dollars, whether timestamps are wall clock or UTC, whether NULL means missing or zero, whether the primary key is event_id or (event_id, partition); every one of these comes up while authoring the schema check, and each one has caused a production incident at multiple companies.
What a Schema Check Asserts
Assertion
What It Catches
Common Failure Mode
Column exists
Producer dropped or renamed a column
DDL change upstream propagated without coordination
Type matches declared type
Producer started sending strings where numbers were expected
Producer made a field optional; client library bug
Value in declared range
Producer sent unexpected enum values or out-of-range numbers
New event types; client clock drift; currency mix-ups
Cardinality bounds
Categorical column suddenly has too many or too few distinct values
ID generation bug; deduplication regression
The Three Common Tools
Great Expectationsdbt testsSoda
Great Expectations
Python-first, expectation-suite model
Declarative expectation suites in JSON or YAML, run against pandas, Spark, or SQL backends. Strong on documentation and data profiling. Common in Python-heavy teams.
dbt tests
SQL-first, model-adjacent
Tests live next to the dbt model they validate, written as SQL or as generic test macros (not_null, unique, accepted_values, relationships). Lightweight to adopt when dbt is already in use.
Soda
Declarative checks in YAML, multi-source
Checks written in SodaCL, a YAML-based DSL. Targets warehouses, streaming sources, and lakehouses. Strong on alerting and observability integrations.
The check vocabulary fits not_null, unique, accepted_values, relationships
Low overhead is a top priority
•Choose Great Expectations or Soda When
Validation runs against non-warehouse sources (S3 files, Kafka, APIs)
Distributional and statistical assertions are needed beyond simple SQL
A standalone catalog of expectations and profiling reports is valuable
Multiple data stores need a single quality framework
Range Checks: Where Schema Becomes Semantics
Range checks are the bridge between schema and semantics. A column declared as a positive integer between zero and one million is more constrained than a column declared as just an integer. Range checks catch real bugs: a price column that should always be positive that suddenly goes negative because of a refund-handling change; a status column that should always be one of seven enum values that suddenly contains an eighth. The cost is keeping the range declaration in sync with the producer's actual constraints. The cost is real but bounded. Range declarations live next to the schema declaration in the same file. When the producer changes the legal range, the change shows up in the same PR that introduces the new behavior, and the consumer is informed at review time rather than at incident time. The discipline rewards the producer with fewer surprise consumer pages and rewards the consumer with a guarantee that the range constraints are accurate.
1
SELECT
2
COUNT(*)FILTER(
3
WHEREorder_amount_cents<0
4
)ASnegative_amounts,
5
COUNT(*)FILTER(
6
WHEREorder_amount_cents>10000000
7
)ASover_100k_amounts,
8
COUNT(*)FILTER(
9
WHEREorder_statusNOTIN(
10
'placed',
11
'paid',
12
'shipped',
13
'delivered',
14
'cancelled',
15
'refunded',
16
'returned'
17
)
18
)ASinvalid_statuses,
19
COUNT(*)AStotal_rows
20
FROMfct_orders
21
WHEREorder_date=CURRENT_DATE
Schema Drift: The Producer Changed and Did Not Tell
Schema drift is the slow-motion form of schema failure. The producer adds a new optional column; downstream pipelines parse the old shape and ignore the new column. The producer renames a column; downstream pipelines now read NULL where they used to read data. The producer changes a type from integer to string; SQL casts silently produce zero. Each individual change is small. Each change accumulates. Schema validation catches drift at the moment it crosses a layer boundary, before downstream consumers parse the new shape against the old expectation. The accumulated cost of unchecked drift is the most common reason older pipelines become unmaintainable. After a few years of small undocumented changes, the consumer-facing tables no longer have a schema anyone can describe, the queries against them rely on undocumented type coercion behavior, and any change to the producer carries unknown blast radius. Schema validation prevents the accumulation by making each drift event visible at the moment it happens.
Concrete schema drift incidents in production:
▸A producer renamed customer_id to customer_uuid; six pipelines silently lost their join key
▸A producer changed timestamp from epoch milliseconds to ISO-8601 strings; downstream filters returned zero rows
▸A producer added a new event_type that nobody had whitelisted; analytics counts diverged by 8 percent
▸A producer changed amount from cents to dollars; revenue inflated 100x for one day
Schema drift is the producer-side analog of silent data corruption. Schema validation catches drift at the boundary, before the rest of the pipeline parses against the old expectation.
✓Do
Declare schemas explicitly and version them in the same repo as the pipeline code
Run schema validation at the raw layer immediately after extraction
Pick one tool and use it consistently across pipelines
✗Don't
Treat schema as documentation; treat it as an executable contract
Mix three quality tools across the same pipeline; the cognitive cost compounds
Skip range checks because the type already passes; ranges encode the real constraints
TIP
When integrating a new producer, ask for the schema and the value ranges in writing before the integration ships. The conversation forced by the question often surfaces undocumented constraints that would otherwise leak into production as drift.
Distributional Checks
Daily Life
Interviews
Detect distributional shifts using mean, stddev, quantiles, and category-mix comparisons against a historical baseline.
Schema and row-level checks miss a class of failures where every individual row is structurally valid and the population has shifted. A column whose mean used to be 42.3 and is now 67.8 may signal a real change in the world, or a producer-side bug, or an upstream filter regression. None of the rows are individually wrong. The distribution is wrong. Distributional checks compare summary statistics of the current run against a historical baseline and fire when the comparison crosses a threshold. The class of failures these checks catch is exactly the class that escapes every other check. Row counts are normal; nulls are absent; uniqueness is preserved; schema is intact; freshness is on time. The only signal that anything is amiss is that the distribution of values has shifted in a way that the prior data did not predict. Distributional checks turn that signal into an alert.
Statistics That Earn Their Keep
Statistic
What It Tracks
What a Shift Often Means
Mean
Average value of a numeric column
Producer changed units (USD to EUR, cents to dollars), or a filter changed scope
Standard deviation
Spread of a numeric column
New high-value outliers; a bug that is producing extreme values; sample-size change
Quantiles (p50, p95, p99)
Distribution shape, especially the tail
Tail spike often catches what mean misses; e.g., a fraud burst is in p99, not in mean
Cardinality
Number of distinct values in a categorical column
New values appeared; existing values disappeared; ID generation regression
Category mix
Proportion of each category in a categorical column
One category is suddenly dominant or absent; channel attribution bug
A Real Distribution Shift
Consider an e-commerce checkout pipeline where order_amount_usd is normally distributed around a mean of 84 dollars with a standard deviation near 32 dollars. One Wednesday the mean drops to 71 dollars. No row is invalid; every value is plausible. The cause turns out to be a promo code that started applying twice for one customer segment because of a coupon-stacking bug. Without a distributional check, the only signal is the dashboard line moving down, and the cause hides in the noise of normal day-to-day variation. With a distributional check, the alert fires the same day. The same kind of incident has appeared at every large e-commerce retailer at least once. The variations differ in mechanism: a tax engine bug, a currency conversion regression, a fulfillment cost miscoded as revenue, a coupon stacking error, an A/B test exposure miscount. The shape is identical: structurally valid rows whose distribution does not match history.
1
WITHtodayAS(
2
SELECT
3
AVG(order_amount_usd)ASmean_today,
4
STDDEV(order_amount_usd)ASstddev_today,
5
APPROX_PERCENTILE(
6
order_amount_usd,
7
0.95
8
)ASp95_today
9
FROMfct_orders
10
WHEREorder_date=CURRENT_DATE
11
),
12
baselineAS(
13
SELECT
14
AVG(order_amount_usd)ASmean_baseline,
15
STDDEV(order_amount_usd)ASstddev_baseline
16
FROMfct_orders
17
WHEREorder_dateBETWEENCURRENT_DATE-28
18
ANDCURRENT_DATE-1
19
)
20
21
SELECT
22
ROUND(today.mean_today,2)ASmean_today,
23
ROUND(
24
baseline.mean_baseline,
25
2
26
)ASmean_baseline,
27
ROUND(
28
ABS(
29
today.mean_today-baseline.mean_baseline
30
)/baseline.stddev_baseline,
31
2
32
)ASz_shift
33
FROMtoday
34
CROSSJOINbaseline
Picking a Threshold
The threshold for a distributional check is not arbitrary. A reasonable starting point is to express the shift in standard deviations of the historical baseline (a z-score). A shift of more than two or three baseline standard deviations is worth investigating. The exact threshold depends on the column. A column that varies a lot day-to-day will tolerate a larger z-shift before raising an alert; a column that is normally rock-steady will warrant a tighter threshold. The point is that the threshold is computed from history, not invented. The historical window matters too. A 7-day baseline reacts quickly to recent changes, which is sometimes a feature and sometimes a bug; a 30-day baseline smooths over individual outliers but takes longer to detect persistent drift; a 90-day baseline is stable but unresponsive. The choice depends on how frequently the underlying process actually changes. Fast-moving columns (campaign exposure, inventory levels) prefer shorter windows; slow-moving columns (lifetime value, customer segment shares) prefer longer ones.
Z-Shift Magnitude
Recommended Action
Reasoning
Less than 2
Pass; treat as normal variation
Most days will fall inside this band by definition
2 to 3
Warn; log and notify but do not block
Plausible normal variation but worth a human glance
3 to 5
Block on critical metrics; warn on others
Rare under normal variation; likely a real change
Greater than 5
Block; almost certainly a bug or upstream change
Falls outside historical experience; running on this data is risky
Cardinality and Category Mix
Some of the highest-leverage distributional checks are on categorical columns. A country_code column that has historically had values for forty-seven countries and suddenly has values for forty-eight signals a new market or a producer bug. A device_type column whose mobile share has been steady at 62 percent and is now 81 percent signals a desktop tracking failure. These shifts rarely move the row count or the means by enough to trigger numeric checks, but they encode real changes in the underlying data. Cardinality and category mix are also among the easiest distributional checks to interpret in incident response. A new value appearing in event_type names a specific producer change; a category share shift names which segment of users is behaving differently. The interpretability is part of the leverage: the alert points at the cause rather than at a symptom.
1
SELECT
2
device_type,
3
COUNT(*)ASrows_today,
4
ROUND(
5
100*COUNT(*)/SUM(COUNT(*))OVER(
6
),
7
1
8
)ASpct_today
9
FROMfct_orders
10
WHEREorder_date=CURRENT_DATE
11
GROUPBYdevice_type
12
ORDERBYrows_todayDESC
What Distributional Checks Do Not Catch
Distributional checks are not the right tool for low-volume rare events. A pipeline that processes a million events per day will have stable means and quantiles. A pipeline that processes fifty events per day will have means and quantiles that bounce around for reasons that have nothing to do with quality. The smaller the population, the noisier the statistics, the wider the threshold needed, the less the check tells. Distributional checks earn their keep on tables with daily volumes large enough that day-to-day statistical variation is small. The threshold of usefulness is fuzzy. Tables with thousands of rows per day usually benefit; tables with hundreds usually do not. A common pattern is to compute weekly or monthly aggregates for low-volume tables and run distributional checks at that grain instead, accepting later detection in exchange for a meaningful baseline.
✓Distributional Checks Earn Their Keep When
Daily row count is large (tens of thousands or more)
The columns being tracked have stable behavior over time
Cost of silent shifts is high (revenue, model features, regulated metrics)
Historical data is available to compute baselines
•Distributional Checks Are A Poor Fit When
Daily volumes are small; statistics are dominated by noise
The data is intentionally bursty (campaign events, batch arrivals)
Baselines do not yet exist; the table is too new to model
The cost of false positives outweighs the cost of missed shifts
TIP
Implement distributional checks as a second pass after the cheap checks. The cheap checks catch the high-leverage failures; distributional checks catch the next class. Building both before either is in production is over-engineering.
A row-valid column can still produce a wrong aggregate when its distribution shifts.
Z-scores against a historical baseline are the simplest defensible threshold.
Cardinality and category mix often catch what means and stddevs miss.
Referential Integrity in DW
Daily Life
Interviews
Detect orphan keys with LEFT JOIN NULL filters and choose between block, quarantine, placeholder, and defer strategies.
Operational databases enforce referential integrity through foreign key constraints. A row in the orders table cannot reference a customer_id that does not exist in the customers table because the database refuses to write it. Analytical pipelines do not get this protection for free. Warehouses like Snowflake and BigQuery either do not enforce foreign keys at all or treat them as informational hints. The pipeline becomes responsible for enforcing the integrity that the operational database used to enforce automatically. The cost of skipping this responsibility is orphan keys: rows in a fact table that reference dimension keys nobody has seen. The cost is not always visible. INNER JOINs silently drop orphan rows, which makes downstream metrics quietly understate. LEFT JOINs preserve the rows but produce NULLs in the dimension columns, which makes downstream filters behave unpredictably. The same orphan can produce different wrong answers in two different consumer queries depending on the join shape, which is exactly the failure pattern that erodes trust the fastest.
What An Orphan Key Looks Like
Table
Column
Value
Status
fct_orders
customer_id
c_8421
Present
dim_customer
customer_id
c_8421
Missing (orphan)
fct_orders
customer_id
c_8422
Present
dim_customer
customer_id
c_8422
Present
The first row in fct_orders references a customer_id that does not exist in dim_customer. A LEFT JOIN from fct to dim will produce NULL for the customer columns. An INNER JOIN will silently drop the row. Either way, the consumer sees something other than what the source data implied.
How Orphans Get Created
Cause
Mechanism
Frequency
Race condition during ingestion
Order arrives before the customer is replicated; no recovery
Common in CDC pipelines without transactional ordering
Soft-deleted dimension rows filtered out
Dim load excludes deleted_at IS NOT NULL; fact still references them
Common in analytics filtering on is_active
Dimension table truncated and rebuilt
Mid-rebuild window where dim is empty but fact has rows referencing it
Pipeline design bug; fixable with proper transaction or swap
Producer started using IDs the dim does not yet know
Schema or scope change without dim update
Common at integration boundaries with new partners
1
WITHorphansAS(
2
SELECT
3
f.customer_id
4
FROMfct_ordersASf
5
LEFTJOINdim_customerASdUSING(customer_id)
6
WHEREd.customer_idISNULL
7
)
8
9
SELECT
10
COUNT(DISTINCTcustomer_id)ASorphan_keys,
11
COUNT(*)ASorphan_rows
12
FROMorphans
The pattern is LEFT JOIN with a NULL filter on the dimension's primary key. Any row that survives is an orphan. The check is cheap, runs in seconds at warehouse scale, and produces an exact count.
BlockQuarantinePlaceholderDefer
Block
Halt the publish on any orphan
Strongest authority. Fits critical fact tables where downstream joins must succeed and partial publication misleads consumers.
Quarantine
Side-table the orphan rows
Move orphans to a quarantine table; publish only matched rows. Requires a review process; otherwise quarantines accumulate.
Placeholder
Auto-create an Unknown dim entry
Inserts a sentinel dim row so joins always succeed. Best when matching is best-effort and downstream tolerates Unknown values.
Defer
Wait for the dimension to catch up
Retries the join after a delay. Fits eventually-consistent producers where order arrival can race the dim load.
Strategy
Mechanism
When To Use
Block the publish
Treat the orphan count as a blocking quality check failure
Critical fact tables where joins must succeed
Quarantine the orphan rows
Move orphan rows to a side table; publish only matched rows
Tables where partial publication is acceptable and quarantines are reviewed
Insert a placeholder dim row
Auto-create a 'Unknown' dimension entry for unseen keys
When matching is best-effort and joins must always succeed
Defer publication
Wait until the dimension catches up; retry the join
When the producer ordering is known to be eventually consistent
The inverse problem is dimension rows that no fact references. These are usually less harmful than orphans because they do not break joins; they only inflate the dimension. A dimension that grows without bound while the fact stays the same shape signals a slow leak: rows added that nobody uses, possibly because of a producer side-write that never made it into the fact load. The check is the mirror of the orphan check, and it is worth running periodically rather than continuously.
Multi-Hop Integrity
Real data models have multi-hop relationships: an order references a customer, the customer references a sales region, the sales region references a country. Integrity at each hop is independent. A pipeline can have zero orphans on customer_id and still have orphans on region_id if the customer dimension references regions the region dimension does not contain. The check has to walk every foreign key explicitly. Treating integrity as a one-step check at the immediate parent misses the chain. The same property surfaces in slowly changing dimensions, where the dimension is versioned and a fact may reference an older version of the dimension that has since been replaced. The orphan check has to know which version of the dimension is canonical for the fact's effective date, which is more involved than a simple LEFT JOIN. Most teams handle this by treating the SCD as multiple dimension tables, one per version, and running the orphan check against the version that was current at the fact's timestamp.
Five questions to ask before declaring a fact table referentially clean:
▸Does every key in fct join to its immediate dim
▸Does every dim key in turn join to its parent dim, where applicable
▸Are soft-deleted dim rows excluded; if so, do existing facts still reference them
▸Is the dimension's primary key actually unique (cross-check against uniqueness gates)
▸Has the dim caught up to the fact's most recent rows (timing question, not structural)
✓Do
Run referential integrity checks at the curated layer where joins happen
Walk every foreign key in the data model, not just the immediate parent
Decide quarantine vs block per fact table based on consumer tolerance
✗Don't
Assume the warehouse enforces FK constraints; most do not
Silently INNER JOIN; the dropped rows hide the integrity failure
Treat orphan rows as noise; they encode a real upstream problem
TIP
When designing a fact table, list every dimension it joins to and write the orphan check for each before the table publishes. Adding the check after the first orphan incident is twice the work.
Most warehouse foreign key declarations are informational. Trust the constraint when SQL Server or Postgres enforces it; do not trust it in Snowflake, BigQuery, or Redshift.
Source
source
schema + null check
ingest check
Transform
transform
row-count check
output check
Storage
warehouse
Quality checks live at boundaries: validate on the way in (schema, nulls) and on the way out (row counts, totals). A failing gate stops bad data before it reaches the warehouse.
Test vs Prod: Same Checks
Daily Life
Interviews
Maintain the same quality suite across environments while tuning thresholds to match the data each environment carries.
A common authoring mistake is to write quality checks that pass cleanly in the test environment and then fail repeatedly in production for reasons that have nothing to do with quality. The cause is almost always thresholds. Test data has different volumes, different distributions, and different time windows than production data. The same check can fire on test for trivial reasons and fail to fire on production for real reasons, because the bounds were tuned in the wrong environment. The discipline is to keep the assertions identical and the thresholds environment-aware. The opposite mistake is also common: a check that passes in production because the threshold was set to whatever the production data happened to look like the day the check was written. Such a check assents to whatever the producer is currently emitting, which means it cannot detect drift away from that state. Both failure modes have the same root cause, which is treating thresholds as something to be discovered rather than something to be designed.
What Differs Between Environments
Property
Test Environment
Production Environment
Row volume
Tens to thousands per day; sometimes a frozen snapshot
Thousands to millions per day; live traffic
Row distribution
Synthetic or sampled; often unrepresentative tails
Real distributions including heavy tails and rare events
Time pattern
Tests run on demand or on a fixed schedule
Continuous or daily; calendar effects matter
Schema completeness
Often missing optional columns; partial fixtures
Full schema with every column populated
Failure tolerance
Pipeline can break and rerun freely
Pipeline failure has consumer impact
Same Assertion, Two Thresholds
A row count check that asserts 'today must be within fifty percent and two hundred percent of the seven-day baseline' makes sense in production where the baseline is a meaningful average. In test, the baseline might be the same fixture re-loaded every day, and the threshold is meaningless. The fix is not to disable the check in test. The fix is to keep the assertion identical and to load test thresholds from a different file or environment variable. The check's logic stays the same; the bounds change.
Why Tests Need Looser Bounds, Not Different Checks
Looser bounds in test reflect that the test environment cannot meaningfully assert the same things production can. A test fixture with twelve rows cannot have a meaningful null rate baseline. A test fixture loaded once per quarter cannot have a meaningful freshness check. The assertion that 'customer_id is not null' is the same in both environments; the threshold for what counts as a violation is loosened in test because the test data is not statistically meaningful. Disabling the check entirely loses coverage; tightening the threshold creates flakes; loosening the threshold preserves the signal that the check still runs. The principle generalizes: test environments verify that the check executes and that the pipeline plumbing connects the check to the gate. Production environments verify that the data conforms. Both verifications are necessary and address different failure modes. A check that has never executed in test is unlikely to execute correctly in production the first time; a check tuned to test thresholds will not catch production failures.
•Wrong Way: Different Checks Per Environment
Test runs five checks; production runs eleven
Test does not catch what production does, even structurally
A check passes in test and is undefined in production
Adding a new check requires updating two suites
✓Right Way: Same Checks, Different Thresholds
Both environments run the same eleven checks
Test verifies the check executes; production verifies the data
A check that breaks in production also breaks in test
Thresholds are config; the suite is code
What Test Data Should Cover
Test Case
What It Verifies
How It Is Loaded
Happy path
Pipeline runs end-to-end on representative data
Sampled or synthetic rows in expected proportions
Schema-violation row
Schema check actually fires when the producer breaks the contract
A fixture row with a NULL where NOT NULL is declared
Distributional shift
Distributional check fires when the data shifts
A fixture day with values shifted by 5+ stddev
Orphan key
Referential integrity check actually catches an orphan
A fact row whose dim entry is intentionally missing
Empty input
Pipeline behaves correctly when no rows arrive
An empty source
The Staging Environment In Between
Staging environments sit between test and production. They run on a copy of production data, often anonymized or sampled. Staging thresholds are usually closer to production thresholds than to test thresholds because the data is statistically representative. Quality checks running in staging are a meaningful preview of how they will behave in production. A check that flaps in staging will flap in production; a check that is silent in staging will be silent in production. Staging is the right environment to tune thresholds against real distributions before flipping the suite on for production.
1
# Quality runner picks thresholds based on environment
▸Pretend the production volume by replaying real traffic onto the test environment
▸Skip checks because the test fixture is too small to make them meaningful
▸Use random IDs that violate the producer's actual ID generation rules
▸Be loaded once and never refreshed; stale fixtures train stale assumptions
TIP
Treat the threshold file as a piece of code reviewed in a PR. Threshold changes are interesting and worth a second pair of eyes. Magic numbers without commit history are how quality suites silently degrade.
Same checks across environments; thresholds vary because the data is statistically different.
Disabling a check in test means production runs a check the test environment never exercised.
Quality Suite for Events
Daily Life
Interviews
Author a quality suite for a high-volume events table covering all five pillars with appropriate authority and layer placement.
The exercise puts the lesson into a single concrete deliverable. The target is a customer events table at a SaaS company that loads roughly two million events per day across four event types. The table feeds a dashboard, a churn model, and a billing report. The deliverable is a complete quality suite covering all five quality pillars: freshness, volume, distribution, schema, and lineage hint. The suite stays small enough to ship but covers the failure modes that have actually shown up in mature production systems. Two million events per day is roughly twenty-three per second, which is large enough that statistical checks are meaningful and small enough that the warehouse cost of running the checks is negligible. The shape of the table is representative of the most common high-volume analytical surface: an append-mostly fact with a handful of categorical and numeric columns, joined to dimensions for context. Most production quality suites at companies of similar scale resemble the one assembled below.
The pipeline lands events hourly. The freshness SLA is two hours: MAX(event_timestamp) must be within two hours of CURRENT_TIMESTAMP at the moment the gate runs. The check fires as a blocking gate at the raw layer because stale data here cascades into stale dashboards, model features, and billing entries.
1
/* Freshness gate (blocking) */
2
SELECT
3
MAX(event_timestamp)ASlatest_event,
4
CURRENT_TIMESTAMPASnow_ts,
5
EXTRACT(EPOCHFROM(
6
CURRENT_TIMESTAMP-MAX(
7
event_timestamp
8
)
9
))/60ASminutes_behind,
10
CASE
11
WHENCURRENT_TIMESTAMP-MAX(
12
event_timestamp
13
)<=INTERVAL'2 hours'THEN'pass'
14
ELSE'fail'
15
ENDASgate_result
16
FROMcurated.fct_customer_events
Pillar 2 (volume) is row count over a window using the day-of-week baseline from the beginner tier; it blocks on underage and warns on overage. Pillar 3 (distribution) covers amount_usd (z-shift greater than three blocks the publish) and event_type category mix (proportions must stay within five percentage points of the trailing four-week baseline).
1
/* Distributional gate: amount_usd mean shift in stddev units */
2
WITHtodayAS(
3
SELECT
4
AVG(amount_usd)ASmean_today
5
FROMcurated.fct_customer_events
6
WHEREingest_date=CURRENT_DATE
7
),
8
baselineAS(
9
SELECT
10
AVG(amount_usd)ASmean_baseline,
11
STDDEV(amount_usd)ASsd_baseline
12
FROMcurated.fct_customer_events
13
WHEREingest_dateBETWEENCURRENT_DATE-28
14
ANDCURRENT_DATE-1
15
)
16
17
SELECT
18
ROUND(
19
ABS(
20
today.mean_today-baseline.mean_baseline
21
)/baseline.sd_baseline,
22
2
23
)ASz_shift,
24
CASE
25
WHENABS(
26
today.mean_today-baseline.mean_baseline
27
)/baseline.sd_baseline>=5THEN'block'
28
WHENABS(
29
today.mean_today-baseline.mean_baseline
30
)/baseline.sd_baseline>=3THEN'warn'
31
ELSE'pass'
32
ENDASgate_result
33
FROMtoday
34
CROSSJOINbaseline
Pillar 4 (schema) asserts the declared column shape: columns exist, types match, NOT NULL columns have zero nulls, and event_type contains only the four declared values. Schema checks run at the raw layer so failures are caught before the curated transform parses against the wrong shape.
1
#SCHEMACHECKexpressedINdbttestsversion:2models:-name:fct_customer_eventscolumns:-name:event_idtests:-not_null-UNIQUE-name:customer_idtests:not_null-name:event_typetests:-not_null-accepted_values:VALUES:['signup','login','purchase','churn']-name:event_timestamptests:not_null-name:amount_usdtests:-dbt_utils.expression_is_true:expression:'amount_usd IS NULL OR amount_usd >= 0'
Pillar 5 (lineage hint) at this tier is not full lineage tracking (advanced tier addresses observability lineage in depth). The hint asserts that every event in curated can be traced to a row in raw with the same event_id. A curated row with no matching raw row signals a transform bug that synthesized data; a raw row with no matching curated row signals a row that the transform dropped without explanation. The check is a pair of LEFT JOINs. The pair is asymmetric in importance. Synthesized rows (curated without raw) are almost always a bug and should block the publish. Dropped rows (raw without curated) are sometimes intentional (the transform filters out test accounts, fraud, or invalid events), so the check warns rather than blocks; a sudden jump in dropped rows is worth a human glance even when the dropping itself is by design.
1
/* Lineage hint: every curated row has a raw counterpart, and vice versa */
▸Catches: new event_type without notice (schema); synthesized or dropped rows (lineage)
▸Misses: correlated drift in two columns simultaneously (advanced statistical methods)
▸Misses: correctness against an external source of truth (needs reconciliation passes)
▸Misses: business-logic errors that produce structurally-clean output (tax bug with right shape)
✓Do
Cover all five pillars before declaring a quality suite complete
Match each pillar's authority (block or warn) to the consumer impact of failure
Run schema and freshness at the raw layer; volume, distribution, lineage at curated
✗Don't
Skip lineage at the intermediate tier; it is the cheapest way to catch transform synthesis
Combine pillars into a single check; debugging is easier when each pillar fires alone
Reuse the same threshold for amount_usd and amount_cents; units matter
TIP
Build the suite in pillar order: freshness, volume, schema, distribution, lineage. Each pillar catches a different class of failure; investing in the next pillar is justified by the value of the failures it catches that the prior pillars miss.
❯❯❯PUTTING IT ALL TOGETHER
> A SaaS data platform team owns fct_customer_events, the table that powers leadership dashboards, the churn model, and the monthly billing report. The team has the four cheap checks from the beginner tier in place. After two near-misses involving structurally-valid rows that produced wrong aggregates, the lead asks the team to expand the suite so the next near-miss is caught instead of celebrated.
Add schema validation that asserts required fields, types, nullability, and accepted values. Pick one tool (dbt tests, Great Expectations, or Soda) and use it consistently. The schema is the producer's commitment to the consumer made executable.
Add distributional checks for the columns where structurally-valid shifts have caused incidents. Use z-scores against a historical baseline; warn at z=3, block at z=5. Distributional checks earn their keep on high-volume tables and are noise on low-volume ones.
Add referential integrity checks at the curated layer. The warehouse does not enforce foreign keys; the pipeline must. LEFT JOIN with NULL filter is the canonical orphan check; quarantine, block, placeholder, and defer are the four valid responses.
Maintain the same checks across environments and parameterize the thresholds. The check vocabulary is code; the thresholds are configuration. This connects to the layered architecture introduced in Lesson 1: each layer carries its own quality gates rather than a single end-of-pipeline check.
Cover all five pillars (freshness, volume, distribution, schema, lineage hint) before declaring the suite complete. The suite is the producer's evidence to the consumer that the data has been continuously proved correct.
KEY TAKEAWAYS
Schema validation is the executable contract: required fields, types, nullability, ranges, accepted values. Pick one tool and use it consistently.
Distributional checks catch what row-level checks miss: z-scores against a historical baseline detect shifts that no individual row violates.
Warehouses do not enforce foreign keys: the pipeline must. LEFT JOIN NULL filter detects orphans; the response is one of block, quarantine, placeholder, or defer.
Same checks across environments, different thresholds: test data is statistically different from production; the assertion stays identical, the bounds vary.
A complete suite covers all five pillars: freshness, volume, distribution, schema, and at least a lineage hint. Anything less leaves a known class of failures unwatched.
Schema, distribution, and referential integrity turn cheap checks into a real suite
Category
Pipeline Architecture
Difficulty
intermediate
Duration
32 minutes
Challenges
0 hands-on challenges
Topics covered: Schema Validation Basics, Distributional Checks, Referential Integrity in DW, Test vs Prod: Same Checks, Quality Suite for Events
Schema validation is the second-most-common failure mode after row count, and it is the easiest to express. A schema check asserts that every row in a table conforms to a declared shape: column names exist, types match, nullability matches, and values fall in declared ranges. A row that fails a schema check is, by definition, a row the pipeline should not have produced. Schema validation is also where the conversation with the producer starts, because the schema is the producer's commitment to t
Schema and row-level checks miss a class of failures where every individual row is structurally valid and the population has shifted. A column whose mean used to be 42.3 and is now 67.8 may signal a real change in the world, or a producer-side bug, or an upstream filter regression. None of the rows are individually wrong. The distribution is wrong. Distributional checks compare summary statistics of the current run against a historical baseline and fire when the comparison crosses a threshold. T
Operational databases enforce referential integrity through foreign key constraints. A row in the orders table cannot reference a customer_id that does not exist in the customers table because the database refuses to write it. Analytical pipelines do not get this protection for free. Warehouses like Snowflake and BigQuery either do not enforce foreign keys at all or treat them as informational hints. The pipeline becomes responsible for enforcing the integrity that the operational database used
A common authoring mistake is to write quality checks that pass cleanly in the test environment and then fail repeatedly in production for reasons that have nothing to do with quality. The cause is almost always thresholds. Test data has different volumes, different distributions, and different time windows than production data. The same check can fire on test for trivial reasons and fail to fire on production for real reasons, because the bounds were tuned in the wrong environment. The discipli
The exercise puts the lesson into a single concrete deliverable. The target is a customer events table at a SaaS company that loads roughly two million events per day across four event types. The table feeds a dashboard, a churn model, and a billing report. The deliverable is a complete quality suite covering all five quality pillars: freshness, volume, distribution, schema, and lineage hint. The suite stays small enough to ship but covers the failure modes that have actually shown up in mature