Data Quality and Contracts: Advanced

A public-company data platform team had every quality check the intermediate tier prescribes. Schema validation, distributional checks, referential integrity, freshness gates, the whole suite. The pages still came in volume. On-call rotations were burning out. Half the alerts were technically correct: a column did shift, a row count did dip, a freshness gate did miss its threshold. None of those alerts represented an actual production problem; they represented threshold drift, calendar effects, and producer behaviors that nobody had told the consumer about. The other half of the alerts were silent failures the suite missed because the assertions had never been written. The team had quality engineering. It did not have quality discipline. The path from one to the other has three components: contracts that name what is committed, observability that names what is happening, and threshold tuning that names what is normal. This lesson covers all three, plus the rollout problem of getting there from a legacy pipeline that does none of them.

What you will be able to do

Apply data contracts as producer-consumer commitments enforced in CI rather than discovered in production

Distinguish the five pillars of data observability and apply them as a diagnostic framework

Roll out contracts on a legacy pipeline without breaking the consumers that depend on its current behavior

Data Contracts and CI Enforcement

Daily Life

Interviews

Author a data contract with schema, guarantees, and evolution policy, and enforce it in producer CI plus pipeline gates.

Lesson 1's advanced tier introduced the pipeline-as-product framing: a pipeline has a contract that names producer, consumer, schema, freshness SLA, quality SLA, backfill policy, and deprecation policy. This section turns that framing into a working mechanism. A data contract is the executable form of the commitment. The producer commits to a shape and a set of guarantees; the consumer relies on them; the contract is checked in CI on every change so violations cannot ship. Without enforcement, contracts are documentation. With enforcement, they are the layer that prevents most of the silent failures the previous lessons spent so much effort detecting. The shift from documentation to enforcement is the central move. Documentation rots. CI does not. A contract that is a YAML file in a wiki page is fiction within six months because the producer evolves and the wiki does not. A contract that is a YAML file in the producer's repo, validated by a CI step, evolves at the speed the producer evolves and never disagrees with reality. Every team that has built a contract program at scale agrees that the enforcement step is the one that changes the failure rate; the document was always achievable.

What Makes A Contract A Contract

Property	Implicit Agreement	Data Contract
Form	Tribal knowledge; Slack threads; an outdated wiki page	A versioned file checked into the producer's repo
Enforcement	Discovered after a consumer breaks	Checked in CI on every producer change
Visibility	Whoever has been around long enough to know	Discoverable by every consumer through a registry or catalog
Versioning	Implicit; whatever was true the last time someone looked	Semver-style; breaking changes require a new major version
Failure mode	Production incident	Failed CI run; producer fixes before merge

Anatomy Of A Contract

# A producer - side contract for the customer events stream contract : name : customer_events version : 2.3.0 producer : team : platform - events repo : github.com / example / platform - events on_call : '#platform-events-oncall' SCHEMA : primary_key : event_id fields : - { name : event_id, type : string, nullable : FALSE } - { name : customer_id, type : string, nullable : FALSE } - { name : event_type, type : string, nullable : FALSE, accepted_values : [ signup, login, purchase, churn ] } - { name : event_timestamp, type : TIMESTAMP, nullable : FALSE } - { name : amount_usd, type : DECIMAL, nullable : TRUE, RANGE : { MIN : 0, MAX : 1000000 } } guarantees : freshness : '<= 5 minutes p99' volume : daily_min : 1000000 daily_max : 5000000 uniqueness : event_id delivery : at_least_once consumers : - team : analytics - engineering use_case : leadership KPI dashboard - team : ml - platform use_case : churn prediction features - team : finance use_case : monthly billing aggregation evolution_policy : additive_changes : minor_version breaking_changes : major_version_with_90d_notice field_deletions : forbidden_in_minor

Where The Contract Is Enforced

Stage	What Is Checked	Failure Means
Producer pre-commit	Producer code emits events that match the contract schema	Pre-commit hook rejects the change
Producer CI	Schema, accepted values, ranges, primary key uniqueness	PR cannot merge; producer fixes the violation
Pipeline ingestion gate	Real events conform to the contract at the boundary	Quality gate halts; producer is paged
Consumer CI	Consumer code reads the contract version it depends on	Consumer build fails; consumer pins or upgrades
Registry validation	Contract version is registered and discoverable	Deploy is blocked until registration succeeds

Producer and consumer bind through explicit version pinning. The producer publishes a contract version; the consumer pins to a version it has tested against. A consumer that depends on customer_events 2.3.0 declares that dependency in code and builds against it. When the producer publishes 2.4.0 (additive minor), the consumer's build still passes because 2.4.0 is backward-compatible with 2.3.0. When the producer wants to publish 3.0.0 (breaking), the contract evolution policy requires a 90-day notice and a major version bump; consumers receive notification and have time to migrate. This is exactly the pattern that semver versioning solved for software libraries; data contracts apply it to data.

	# Consumer code declares its contract dependency explicitly
	from contracts import customer_events

	# Pin to a specific version; build fails if version is removed or unavailable
	schema = customer_events.load(version='2.3.0')

	for event in stream.read(schema=schema):
	# Type-checked against the pinned schema
	process_event(event)

What CI Enforcement Looks Like

	#.github / workflows / contract - CHECK.yml
	ON the producer repo name : contract - CHECK
	ON : pull_request jobs : validate - contract : runs -
	ON : ubuntu - latest steps : - uses : actions / checkout @ v4 - name : Validate contract syntax run : contract - cli validate contracts / customer_events.yaml - name : Detect breaking changes vs main run : contract - cli diff origin / main contracts / customer_events.yaml - name : Generate test fixtures
	FROM contract run : contract - cli fixtures contracts / customer_events.yaml > / tmp / fixtures.json - name : Run producer code against generated fixtures run : pytest tests / contract_compliance /

When Contracts Pay For Themselves

•Pre-Contract World

Producer changes a column type; consumers break in production
Schema rolls forward without consumer awareness
Producers and consumers debate root cause during the incident
Adding a new consumer means archeology to find the actual schema

✓Post-Contract World

Breaking change is rejected at producer CI; never reaches consumers
Schema evolution follows semver; consumers see only versions they pin to
Root cause is named by the contract version; debate is short
New consumers read the contract registry and integrate without archeology

What Contracts Cannot Do Alone

Contracts catch shape violations and explicitly-named guarantee violations. They do not catch business-logic correctness on conforming data. A contract that says amount_usd is a decimal in the range zero to one million does not catch an amount that is the wrong number for the underlying transaction. The quality suite from the intermediate tier remains necessary. Contracts and quality suites are complements: contracts prevent shape failures, quality suites detect distribution-and-content failures. A serious data platform runs both.

Signs that a contract program is mature:

▸Producers declare contracts before consumers integrate, not after
▸Schema evolution follows a documented semver-like policy
▸CI rejects breaking changes; production gates rarely have to
▸Consumer code declares contract dependencies as explicit pins
▸A registry exists where any team can discover what contracts are available

✓Do

Version contracts with semver; additive minor, breaking major, fixed-bug patch
Enforce contracts in producer CI; production gates are the second line of defense
Publish contracts to a registry that consumers can discover programmatically

✗Don't

Treat contracts as documentation; without enforcement they accumulate as fiction
Allow breaking changes in minor versions; consumers will silently break
Couple the contract format to one tool; contracts outlast any tool decision

TIP

Start contracts at the highest-leverage producer-consumer boundary, not at every boundary at once. The first contract teaches the team how to author and enforce; the second through tenth follow the pattern with little marginal cost.

Five Pillars of Observability

Daily Life

Interviews

Apply the five pillars of data observability as a diagnostic framework for incident response and as a coverage map for designing quality programs.

Barr Moses and the Monte Carlo Data team named the five pillars of data observability: freshness, distribution, volume, schema, and lineage. The naming has caught on widely enough that conversations about quality use it as shorthand. The pillars are useful because they are not a checklist; they are a diagnostic framework. When something is wrong with the data, the pillar that detected the symptom narrows the search for the cause. When designing a quality program, the pillars name the gaps that have to be filled before the program is considered observable rather than instrumented alone. The framework predates the pillars under different names. Software observability matured along the same lines: metrics, logs, and traces are a similar three-axis decomposition, where each axis answers a different diagnostic question. The data version is more axes because data has more independent dimensions of failure. Volume can be wrong without distribution being wrong; schema can shift without freshness slipping; lineage can be unknown even when every other pillar is green.

The Five Pillars

FreshnessDistributionVolumeSchemaLineage

Freshness

Is the data current

Time between latest available record and now. Asks whether the pipeline is keeping up. The simplest pillar to measure and the first one consumers notice.

Distribution

Are values within the expected range and shape

Statistical properties of columns: mean, stddev, quantiles, cardinality, category mix. Catches shifts that no individual row violates.

Volume

Is the right amount of data arriving

Row counts compared to historical baselines. Catches dropped partitions, broken filters, exploded joins. Often the first signal of an upstream problem.

Schema

Is the shape of the data what was promised

Columns, types, nullability, accepted values, ranges. The producer-side commitment made executable.

Lineage

Where does this data come from and what reads it

Upstream and downstream relationships at column granularity. Turns 'a number changed' into 'a number changed because this transform changed'.

Pillars As A Diagnostic Framework

When a consumer reports a wrong number, a senior engineer walks the pillars in order and uses each one to either narrow or rule out a class of cause. Freshness rules out 'is the data even recent.' Volume rules out 'did the right amount of data arrive.' Schema rules out 'is the shape correct.' Distribution rules out 'are the values within their normal range.' Lineage answers 'what produced this column and what depends on it.' The walk takes minutes and replaces the unstructured 'try things until something works' debugging that consumes hours when each pillar has to be checked manually.

Symptom	First Pillar To Check	Why
Dashboard shows last week's data on Monday morning	Freshness	Most likely an ingestion stall; rules in or out the simplest cause
Revenue dropped 15 percent overnight	Volume	Sudden numeric drops are usually missing rows, not changed values
Average order amount is up but row count is steady	Distribution	Aggregate change without row count change implies value shift
Pipeline succeeded but downstream join produces nulls	Schema	Type or nullability mismatch is the typical cause of post-success join failures
Two consumers see different numbers from the same source	Lineage	Different consumers may read different downstream tables; lineage exposes the divergence

A pillar is a category. A check is a specific assertion within a pillar. The pillar 'distribution' contains many checks: mean shift, stddev shift, p99 shift, category mix shift, cardinality shift. The framework value of pillars is in coverage, not in implementation. A program that has a hundred checks within four pillars and zero checks in the fifth is a program with a known blind spot.

Lineage: The Pillar Most Programs Skip

Most quality programs cover four pillars well and lineage poorly. The reason is cost. Lineage at table granularity is moderately expensive; lineage at column granularity is expensive; lineage that updates as transforms evolve is expensive to keep current. The payoff is that lineage transforms incident response. A column-level lineage system answers questions like 'what consumers depend on amount_usd in fct_orders' in seconds. Without lineage, the same question takes hours of grep-and-Slack archaeology. Tools like dbt, Dagster, and standalone catalogs (DataHub, OpenMetadata) cover this pillar; the discipline is in adopting and keeping them current.

Column - level lineage example for fct_orders.amount_usd : raw.events.amount -> curated.fct_orders.amount_usd - + -> mart.daily_revenue.revenue_usd | + -> feature_store.user_features.lifetime_spend | + -> reverse_etl.salesforce.account_revenue

Reading the diagram backward from a consumer answers 'where does this number come from.' Reading it forward from a source column answers 'who is affected if this column changes.' Both questions show up in incident response and in change reviews. Lineage is the pillar that ties the other four together.

When To Adopt Each Pillar

Maturity Stage	Pillars In Place	What The Team Can Answer
Stage 1: Cheap checks	Volume, freshness	Did the right amount of data arrive on time
Stage 2: Suite	Volume, freshness, schema, distribution	Is the data structurally and statistically as expected
Stage 3: Observable	All five pillars including lineage	What changed, why, and who is affected
Stage 4: Contract-enforced	All five pillars plus contracts in CI	Same as Stage 3, but most failures cannot ship

•Without Lineage

Incident response starts with 'who owns this column'
Change reviews miss downstream consumers
Deprecation requires manual canvassing
Two consumers compute conflicting numbers; cause is hidden

✓With Lineage

Incident response starts with the lineage graph; ownership is metadata
Change reviews automatically surface affected consumers
Deprecation walks the graph and notifies every downstream
Conflicting numbers are explained by divergent transforms in the graph

TIP

When a quality program is in place but the team still spends hours diagnosing incidents, the missing investment is almost always lineage. The other four pillars detect; lineage interprets.

The five pillars are a framework for coverage and a diagnostic walk for incidents.

Most programs underinvest in lineage; the cost shows up in incident response time.

Pillars are categories; specific checks live within them. Coverage means hitting all five categories, not running one check.

The pillars are descriptive, not prescriptive. A program with twenty schema checks and zero distribution checks is not 'four-fifths observable'; it is observable on schema and blind on distribution. Coverage is binary per pillar.

Quality SLAs vs Ops SLAs

Daily Life

Interviews

Distinguish operational from quality SLAs and state both as separate commitments with separate measurements and improvements.

An SLA states a commitment. The pipeline-as-product framing from Lesson 1 introduced two SLAs as elements of the contract: freshness SLA and quality SLA. They are commonly conflated. They are different commitments to different things, with different consequences when they fail. A pipeline that meets its operational SLA can fail its quality SLA in green. A pipeline that meets its quality SLA can miss its operational SLA without affecting correctness. The producer who treats both as one number ends up over-promising on one and under-detecting failure of the other. The conflation has visible consequences. Status pages that report a single uptime number describe operational SLA exclusively, leaving consumers with no way to distinguish 'late but correct' from 'on time but wrong'. Incident reviews that do not separate the two end up with action items that improve one without addressing the other. The split SLA is more honest, and honesty in producer commitments is the foundation of the trust that makes data products usable.

The Two SLAs

Property	Operational SLA	Quality SLA
Question answered	Did the data arrive on time	Was the data correct
Typical statement	Pipeline finishes by 6am every day	Row count within 50 to 200 percent of baseline; null rate below 1 percent
Failure mode	Late or missing run	Wrong numbers in a successfully-completed run
Detected by	Orchestrator monitoring; missed schedule alerts	Quality gates inside the pipeline
Consumer impact	Dashboard or model shows yesterday's data	Dashboard or model shows wrong data

Why Conflating Them Hurts

An operational SLA of 'fresh by 6am' tells the consumer when to expect the data. A quality SLA of 'correct row counts by 6am' tells the consumer when to expect the data to be both fresh and right. The two are independent. A pipeline can meet 'fresh by 6am' with a 30 percent row count drop. A pipeline can have a flawless row count and miss the 6am deadline because the warehouse was slow. Consumers who hear 'the team has a 6am SLA' assume both meanings are guaranteed. Producers who state a 6am SLA often mean only operational. The conversation has to specify which one, or both.

Stating Both Explicitly

# BOTH SLAs stated separately IN the contract guarantees : operational_sla : statement : 'fct_orders updated by 06:00 Pacific each day' measurement : 'orchestrator-reported finish time' target : 99.0 WINDOW : 'rolling 30 days' quality_sla : statement : 'all five-pillar gates pass on the run that satisfies the operational SLA' measurement : 'count of successful runs with all gates green / total runs' target : 99.5 WINDOW : 'rolling 30 days' combined_sla : statement : 'fresh AND correct by 06:00 Pacific' target : 98.5

What The Combined SLA Actually Costs

The combined SLA is the multiplication of the two. A 99.0 percent operational SLA and a 99.5 percent quality SLA produce a combined SLA of 98.5 percent at best. Promising both individually at 99.0 percent and treating that as the joint guarantee is mathematically wrong. The cost compounds further when consumer behavior is sensitive to the combined number: a model that retrains daily on potentially-stale or potentially-wrong data needs to be designed to tolerate the combined error rate, not either component alone.

Operational	Quality	Combined (Best Case)
99.0%	99.0%	98.0%
99.5%	99.5%	99.0%
99.9%	99.9%	99.8%
99.99%	99.99%	99.98%

Designing around each SLA requires different investments. The operational SLA is improved by orchestration investments: warm pools, retry policies, redundant scheduling, removing single points of failure in compute. The quality SLA is improved by suite investments: more pillars covered, tighter thresholds tuned against history, contracts that prevent shape failures from shipping. The two require different teams in some organizations and different budgets in most. Treating them as one budget produces under-investment in whichever one is currently considered solved.

Operational SLAQuality SLACombined SLA

Operational SLA

On-time delivery commitment

Pipeline finishes by a stated deadline. Measured by orchestrator finish time. Improved by warm pools, retry budgets, and redundant scheduling.

Quality SLA

Correctness commitment

Five-pillar gates pass on the run. Measured by green-gate run rate. Improved by suite coverage, tuned thresholds, and contracts in CI.

Combined SLA

Fresh AND correct

The multiplication of the two. The honest number to publish on a status page; the geometric impossibility to avoid promising.

•Operational SLA Improvements

Warm pools and pre-provisioned compute
Retry budgets with exponential backoff
Redundant orchestrator instances
Critical-path identification and short-circuiting

✓Quality SLA Improvements

Coverage of all five pillars at every layer boundary
Threshold tuning against historical data
Contracts in CI to prevent shape failures
Lineage to shorten incident response

A real-time fraud detection feature has a tight operational SLA: data fresh within seconds. The quality SLA is also tight, but a failure that produces no result is preferable to a failure that produces wrong results. Operational and quality both matter; correctness wins ties. A monthly close finance pipeline has a relaxed operational SLA but an unforgiving quality SLA: a wrong number in the close requires a refile and a regulatory disclosure. Knowing which dominates for a given consumer is part of the contract.

Concrete operational vs quality tradeoffs in production:

▸Stripe payment events: operational tight (seconds); quality must dominate (no double-charges)
▸Daily marketing dashboard: operational moderate (by 9am); quality moderate (refile on error)
▸Monthly finance close: operational loose (any time within close window); quality near-perfect
▸ML feature store: operational tight (model retraining schedule); quality tight (drift breaks predictions)

TIP

When a consumer reports an SLA breach, ask which SLA: operational or quality. The fix is different for each. Combining them in conversation produces fixes that miss the actual problem.

✓Do

State operational and quality SLAs as separate commitments in the contract
Compute the combined SLA explicitly; do not promise the geometric impossible
Report all three (operational, quality, combined) on the producer status page

✗Don't

Treat 'the pipeline is up' as the only SLA; up-but-wrong is its own failure mode
Promise high quality SLA targets without a five-pillar suite to back them
Allow operational improvements to mask quality regression; budget for both

Operational and quality SLAs are different commitments; combining them under-detects one failure mode.

The combined SLA is the multiplication of the two; promise math that is achievable.

Source

source

schema + null check

ingest check

Transform

transform

row-count check

output check

Storage

warehouse

Quality checks live at boundaries: validate on the way in (schema, nulls) and on the way out (row counts, totals). A failing gate stops bad data before it reaches the warehouse.

Tuning Thresholds vs History

Daily Life

Interviews

Tune quality thresholds against historical data and annotate known anomalies so the alarm rate matches the team's investigation capacity.

A quality system that fires too often gets ignored. The mechanism is simple. On-call engineers receive twenty pages a week. Three of them are real. The remaining seventeen train the engineer to acknowledge alerts without reading them carefully. The next real page lands in the same Slack channel as a false one and is missed. The pipeline that the team thought was protected is, in operational terms, unprotected, because the protection mechanism has been desensitized by its own noise. The fix is not to remove checks. The fix is to tune the thresholds against historical data so that the alarm rate is low enough that every alarm is read carefully. The same dynamic appears in security operations centers, in airline cockpits, and in hospital telemetry alarms, and in every domain the conclusion is identical: an alert system has a finite signal-to-noise budget, and exceeding the budget destroys the system's value. Quality engineering has not historically thought of itself as alarm-system design, but it is.

Alert Fatigue Is A Quality Failure

Symptom	Underlying Cause	Consequence
On-call ignores quality pages	Most pages are not actionable; threshold is too tight	Real failures missed; consumer trust degrades
Quality dashboard shows constant red	Visualizing every check as critical	The dashboard becomes wallpaper; nobody looks at it
Page rate higher than incident rate	False positives outnumber real signal	Engineers escalate to suppress checks; coverage shrinks
Engineers create silent alert filters	The system has not been tuned; humans are filtering instead	Filtering becomes tribal knowledge; new on-call doesn't have it

Threshold tuning runs the proposed assertion against historical data and counts the alerts that would have fired. A threshold that would have fired three hundred times on the last ninety days of data is not a threshold; it is a constant. A threshold that would have fired zero times is not a threshold; it is a non-check. A useful threshold fires somewhere between two and ten times in a ninety-day window, and each firing corresponds to either a real incident or a known anomaly that can be classified as such. The tuning is empirical, not theoretical.

	WITH daily_stats AS (
	SELECT
	order_date,
	AVG(amount_usd) AS daily_mean
	FROM fct_orders
	WHERE order_date BETWEEN CURRENT_DATE - 90
	AND CURRENT_DATE - 1
	GROUP BY order_date
	),
	rolling AS (
	SELECT
	order_date,
	daily_mean,
	AVG(daily_mean) OVER (
	ORDER BY order_date
	ROWS BETWEEN 28 PRECEDING AND 1 PRECEDING
	) AS rolling_mean,
	STDDEV(daily_mean) OVER (
	ORDER BY order_date
	ROWS BETWEEN 28 PRECEDING AND 1 PRECEDING
	) AS rolling_sd
	FROM daily_stats
	)

	SELECT
	order_date,
	daily_mean,
	ROUND(
	(
	daily_mean - rolling_mean
	) / rolling_sd,
	2
	) AS z_score
	FROM rolling
	WHERE ABS(
	(
	daily_mean - rolling_mean
	) / rolling_sd
	) >= 3
	ORDER BY order_date

The query produces the dates on which a z >= 3 threshold would have fired. The team reviews each date with an analyst: was something real happening, or was the threshold too tight. Tuning continues until the firing rate matches the rate at which the team can credibly investigate every firing without ignoring any of them.

What 'Tuned' Looks Like

Quality SLA Target	Implied Page Rate	Threshold Tightness
99.9% (one bad day per quarter)	About one page per quarter per check	Loose; only large shifts fire
99.5% (one bad day per month)	About one page per month per check	Moderate; large and persistent shifts fire
99.0% (one bad day per ten days)	About three pages per month per check	Tighter; small persistent shifts fire
95.0% (one bad day per twenty)	Many pages per month per check	Tight; check is approaching alert fatigue

A practical pattern uses two thresholds per check. A warning threshold fires more often, into a channel where humans review during business hours. A blocking threshold fires rarely, into the on-call rotation. The warning catches plausible-but-suspicious shifts; the blocker catches definite-and-actionable failures. The warning channel is allowed to be noisy because it does not interrupt anyone outside business hours; the blocking channel is held to a strict signal-to-noise ratio because every page interrupts an engineer.

•Warning Channel

Reviewed during business hours
Tolerates noise; signal extracted by humans during review
Z-score thresholds in the 2 to 3 range
Includes plausible day-of-week and seasonal anomalies

•Blocking Channel

Pages on-call regardless of time of day
Strict signal-to-noise ratio; tuned to fire rarely
Z-score thresholds in the 4 to 6 range
Excludes anomalies that historical review classified as benign

Many quality false alarms are calendar effects: holidays, end-of-month spikes, marketing campaigns, product launches. A pure z-score against a trailing window catches these as alerts, even though they are predictable. The fix is to enrich the baseline with calendar awareness: same-day-of-week comparisons, holiday flags, campaign annotations. The baseline becomes a model rather than a sliding average. The investment is worth it for high-volume tables where calendar effects produce most of the noise. For low-volume tables, the same investment is over-engineering.

	WITH baseline AS (
	SELECT
	EXTRACT(DOW FROM order_date) AS day_of_week,
	AVG(daily_count) AS dow_mean,
	STDDEV(daily_count) AS dow_sd
	FROM fct_orders_daily
	WHERE order_date BETWEEN CURRENT_DATE - 90
	AND CURRENT_DATE - 1
	GROUP BY EXTRACT(DOW FROM order_date)
	)

	SELECT
	EXTRACT(DOW FROM CURRENT_DATE) AS dow_today,
	(
	SELECT
	COUNT(*)
	FROM fct_orders
	WHERE order_date = CURRENT_DATE
	) AS today_count,
	ROUND(baseline.dow_mean, 0) AS dow_mean,
	ROUND(
	(
	(
	SELECT
	COUNT(*)
	FROM fct_orders
	WHERE order_date = CURRENT_DATE
	) - baseline.dow_mean
	) / baseline.dow_sd,
	2
	) AS z_score
	FROM baseline
	WHERE day_of_week = EXTRACT(DOW FROM CURRENT_DATE)

Some anomalies are real and not bugs. Black Friday produces a row count spike that the threshold should know about. A planned product launch produces a feature distribution shift that the threshold should know about. Annotating these in advance prevents the threshold from firing on them. The annotation lives next to the threshold definition and is reviewed in the same PR cycle. Without annotations, the team adds suppression rules ad-hoc during the incident, and those rules outlive the event they were created for.

# Calendar annotations consulted BY the threshold engine known_anomalies : - DATE : '2026-11-27' TABLE : fct_orders metric : row_count expected_z_shift : '+8 to +15' reason : 'Black Friday' - date_range : '2026-12-20 to 2026-12-26' TABLE : fct_orders metric : amount_usd_mean expected_z_shift : '+2 to +5' reason : 'Holiday gifting; higher AOV' - date_range : '2026-04-15 to 2026-04-22' TABLE : fct_customer_events metric : event_type_signup_pct expected_z_shift : '+3 to +6' reason : 'Spring marketing launch'

Symptoms that thresholds need re-tuning:

▸Engineers create Slack mute rules for specific quality alerts
▸On-call routinely acknowledges pages with 'expected; ignoring'
▸The same check fires on the same day of the week every week
▸New on-call rotations report being overwhelmed by quality alerts
▸A real incident is missed because the page was indistinguishable from noise

✓Do

Tune every threshold against at least 90 days of historical data before turning it on
Use two-tier checks: warn loosely and block strictly
Annotate known anomalies in version control next to the thresholds

✗Don't

Treat alert volume as a measure of quality coverage; the right measure is incidents caught
Tighten thresholds reactively after an incident; tighten as a deliberate review
Allow ad-hoc suppression rules to outlive their original cause

TIP

The cost of a false alarm is the next real alarm that gets ignored. Treat threshold tuning as part of building the check, not as a follow-up task.

Contracts on a Legacy Pipeline

Daily Life

Interviews

Roll out data contracts on a legacy pipeline by documenting the current state, inventorying consumers, observing before enforcing, and migrating behind versioned changes.

Greenfield contracts are easy. Contracts on a legacy pipeline that has run for four years and has dozens of unknown consumers are hard. The mistake most teams make is treating the rollout as a one-shot migration: write the contract, declare the producer compliant, declare consumers responsible for catching up. The mistake produces breakage and erodes the credibility of the contract program. The disciplined rollout treats existing consumer behavior as the starting contract, evolves toward the desired contract over time, and uses the same shape that semver software libraries have used for decades. The exercise below walks through the rollout for a customer events stream that is the lifeblood of three downstream consumers and an unknown number of analyst queries. The legacy character of the pipeline is what makes the rollout interesting. A greenfield rollout has the luxury of starting from a clean schema and an enumerable consumer set. A legacy rollout has neither. The discipline below is what compensates for the missing luxury, and the principles transfer to any production system that has been running long enough that the original authors have moved on and the original assumptions have drifted.

The Starting State

What the team inherits:

▸Producer is a four-year-old Kafka topic with no formal schema; events are JSON
▸Three named consumers: leadership dashboard, churn model, billing aggregator
▸Unknown number of analyst queries reading from the curated layer
▸No quality gates; one wrong-number incident every six to eight weeks
▸Schema has drifted: producers have added and renamed fields without coordination

Step 1 is to document the current contract. The first contract is not the contract the team wants; it is the contract the producer is currently delivering, including the drift. The team writes it by inspecting recent production data: every field that has appeared in the last 30 days, every type variation seen, every value range observed. The result is messy and accurate. The contract version is 1.0.0, and it captures reality, not aspiration.

	# Contract version 1.0.0 : documents what IS, NOT what should be contract : name : customer_events version : 1.0.0 status : documented_legacy SCHEMA : primary_key : event_id fields : - { name : event_id, type : string, nullable : FALSE } - { name : customer_id, type : string, nullable : TRUE, observed_null_rate : 0.008 } - { name : event_type, type : string, nullable : TRUE, observed_values : [ signup, login, purchase, churn, page_view, click, error ] } - { name : ts, type :
	UNION [ TIMESTAMP, string ], nullable : FALSE, notes : 'producers send both ISO-8601 strings and epoch ms' } - { name : amount, type :
	UNION [ DECIMAL, string ], nullable : TRUE, notes : 'string for older clients; decimal for newer' } guarantees : freshness : 'best effort, typically <= 30 minutes' volume : 'no enforced bound; observed 1.5 to 4M per day' consumers_known : - leadership_dashboard - churn_model - billing_aggregator

Step 2 is to inventory every consumer. The three named consumers are easy. The unknown analyst queries are not. Lineage tooling helps: parsing recent query logs against the table catches most analyst queries; manual outreach catches the rest. The output is a list of every reader, ranked by frequency, with an owner per consumer. Step 3 adds quality gates in observation-only mode. The gates run, log results, and do not block. After two to four weeks of observation, the team has data on which gates would have fired and tunes against it. Once tuned, gates switch from observation to enforcing. Skipping the observation phase is the most common rollout mistake.

# Gates start IN observation mode quality_gates : - name : row_count_within_baseline layer : raw mode : observe # later : 'block' threshold : 0.50 _to_2.00 _of_dow_baseline - name : customer_id_null_rate layer : raw mode : observe threshold : 0.01 - name : amount_distribution_z_shift layer : curated mode : observe block_threshold : 5.0 warn_threshold : 3.0 - name : event_type_accepted_values layer : raw mode : observe threshold : enum_present_in_recent_history

Step 4 plans the cleanups as versioned changes. Legacy problems (union types on ts and amount, unwhitelisted event_types, ambiguous nulls) are each a versioned change. Renaming ts to event_timestamp with strict ISO-8601 is a breaking change; it ships in version 2.0.0 with a 90-day notice. Whitelisting accepted_values for event_type is backward-compatible if the producer commits to not emitting new values without a contract update; it ships as minor 1.x. Each change has a migration plan naming affected consumers and their actions. Cleanups happen one at a time, not as a single rewrite.

Change	Version	Type	Consumer Action
Whitelist accepted_values for event_type	1.1.0	Backward-compatible (additive guarantee)	None; consumers can adopt the tighter contract or stay on 1.0
Tighten customer_id null rate to <= 1 percent	1.2.0	Backward-compatible (tighter guarantee)	None; consumers benefit
Rename ts to event_timestamp; strict ISO-8601	2.0.0	Breaking	Pin to 1.x or migrate within 90 days
Rename amount to amount_usd; strict decimal	2.0.0	Breaking	Pin to 1.x or migrate within 90 days
Drop legacy event_types (page_view, click, error)	3.0.0	Breaking	Pin to 2.x or migrate within 180 days

Step 5 migrates consumers behind the versions. Each consumer migrates to the next major version on its own schedule within the announced notice period. The producer maintains both old and new versions during the transition window; dual-publishing is more expensive than single-publishing, and that maintenance burden falls on the producer. The benefit is that no consumer breaks in production; each migration is a deliberate code change with tests, not a surprise in an incident channel. Some consumers migrate in a week; others take the full notice period. Both are acceptable; a unilateral cutover is not. The producer tracks migration progress with a simple consumer-version query against the contract registry.

	SELECT
	contract_version,
	COUNT(DISTINCT
	consumer_team
	) AS pinned_consumers,
	MIN(last_seen) AS earliest_pin,
	MAX(last_seen) AS latest_pin
	FROM contract_registry.pins
	WHERE contract_name = 'customer_events'
	AND last_seen >= CURRENT_DATE - 7
	GROUP BY contract_version
	ORDER BY contract_version

Producer publishes : customer_events 1. x - + customer_events 2. x - + customer_events 3. x - + pre - announce Day 0 Day 30 Day 60 Day 90 Day 180 Day 270 Consumer A pins 1.0 Consumer B pins 1.0 Consumer C pins 2.0

Step 6 locks in the new state. Once consumers are migrated and the legacy version is deprecated, the producer flips the gates from observation-only to enforcing on the new contract. CI now rejects producer changes that violate the contract. Schema drift becomes impossible; incidents that used to come every six to eight weeks become rare. The rollout is complete when the producer can commit to the contract as a CI-enforced guarantee, not documentation alone.

•Before The Rollout

Wrong-number incident every six to eight weeks
Three named consumers; unknown analyst surface
Schema drifts continuously; producers add fields without notice
Quality is best-effort; no SLA possible
Rolling back a producer change requires consumer-by-consumer triage

✓After The Rollout

Wrong-number incidents are rare; most are caught in CI
Every consumer is registered; lineage covers analyst queries
Schema evolves through versioned changes with notice
Operational and quality SLAs are stated and tracked
Producer can roll back any version because all consumers are pinned

The rollout principles, in order:

▸Document the legacy contract as it actually is, not as the team wishes
▸Inventory every consumer; surprise consumers cause migration failures
▸Add gates in observation-only mode and tune against the observed data
▸Plan cleanups as versioned changes; backward-compatible as minor, breaking as major
▸Migrate consumers behind versions on their own schedule within the notice window
▸Flip gates to enforcing only after the migration is complete

TIP

A legacy rollout that takes nine months and breaks zero consumers is a successful rollout. A legacy rollout that takes three months and breaks four consumers undoes its own credibility. Speed is not the goal; trust is.

✓Do

Treat the first contract as a documentation of reality, not aspiration
Use semver explicitly; minor for additive, major for breaking, with notice
Maintain old and new versions in parallel during the migration window

✗Don't

Cut over consumers unilaterally; the rollout is a negotiation, not a mandate
Skip observation mode; thresholds need real data before they can be enforced
Treat the rollout as one project; treat each version bump as its own change

Contracts and quality become a system only when the producer can roll back any version because every consumer is pinned. The concise statement of the rollout principle: a legacy rollout earns its trust by treating consumers as parties to a negotiation, not as inventory to be migrated, structured by versions and notice periods. The trust accumulates over the months in which no consumer is broken by surprise.

❯❯❯PUTTING IT ALL TOGETHER

> A staff data engineer joins a public-company data platform team. The platform has every quality check the intermediate tier prescribes. On-call rotations are burning out from alert fatigue. Half of incidents are caught by gates and half are still discovered by consumers asking 'why does this number look wrong'. The engineer is asked to turn the quality program into something the team trusts and the consumers can rely on.

Adopt data contracts as producer-side commitments enforced in CI. The contract names schema, freshness, volume, uniqueness, and evolution policy. Producer CI rejects breaking changes; pipeline gates are the second line of defense. (Builds on Lesson 1's pipeline-as-product framing.)

Cover all five pillars (freshness, distribution, volume, schema, lineage) and treat them as a diagnostic framework, not a checklist. Lineage is the pillar most programs underinvest in; the cost of skipping it is incident response time.

State operational and quality SLAs separately. The combined SLA is the multiplication of the two. Designing for one without the other under-detects the other failure mode. (Connects to Lesson 4's orchestration SLAs and Lesson 6's failure handling.)

Tune every threshold against historical data. Use two-tier checks: warn loosely, block strictly. Annotate known anomalies (holidays, launches) so the threshold engine is calendar-aware. The cost of a false alarm is the next real alarm that gets ignored.

Roll out contracts on the legacy pipeline by documenting reality first, inventorying every consumer, observing before enforcing, and migrating consumers behind versioned changes with explicit notice periods. A nine-month rollout that breaks zero consumers is the goal. (Builds on the layered architecture and decoupling concepts from Lesson 1, the four cheap checks from this lesson's beginner tier, and the five pillars from this lesson's intermediate tier.)

Combine all of the above into a quality discipline rather than a quality engineering effort. Engineering catches failures; discipline prevents them. The shift is the difference between a team that manages incidents and a team that prevents them.

KEY TAKEAWAYS

Contracts make quality enforceable: producer commits, consumer relies, CI rejects violations before they ship. Pipeline gates are the second line of defense, not the first.

The five pillars are coverage, not a checklist: freshness, distribution, volume, schema, lineage. A program with four pillars and zero lineage has a known blind spot.

Operational and quality SLAs are separate commitments: the combined SLA is their multiplication. Promise the math that is achievable, not the geometric impossible.

Threshold tuning is part of authoring a check: use historical data, two-tier warn-and-block, and calendar annotations. The cost of false alarms is the next real alarm that gets ignored.

Legacy rollouts are negotiations, not migrations: document reality, inventory consumers, observe before enforcing, version every change with notice. Speed is not the goal; trust is.

Contracts make quality enforceable; observability makes it diagnosable; tuning makes it trusted

Category: Pipeline Architecture
Difficulty: advanced
Duration: 38 minutes
Challenges: 0 hands-on challenges

Topics covered: Data Contracts and CI Enforcement, Five Pillars of Observability, Quality SLAs vs Ops SLAs, Tuning Thresholds vs History, Contracts on a Legacy Pipeline

Lesson Sections

Data Contracts and CI Enforcement (concepts: paSchemaEvolution)
Lesson 1's advanced tier introduced the pipeline-as-product framing: a pipeline has a contract that names producer, consumer, schema, freshness SLA, quality SLA, backfill policy, and deprecation policy. This section turns that framing into a working mechanism. A data contract is the executable form of the commitment. The producer commits to a shape and a set of guarantees; the consumer relies on them; the contract is checked in CI on every change so violations cannot ship. Without enforcement, c
Five Pillars of Observability (concepts: paDataQuality)
Barr Moses and the Monte Carlo Data team named the five pillars of data observability: freshness, distribution, volume, schema, and lineage. The naming has caught on widely enough that conversations about quality use it as shorthand. The pillars are useful because they are not a checklist; they are a diagnostic framework. When something is wrong with the data, the pillar that detected the symptom narrows the search for the cause. When designing a quality program, the pillars name the gaps that h
Quality SLAs vs Ops SLAs (concepts: paDataQuality)
An SLA states a commitment. The pipeline-as-product framing from Lesson 1 introduced two SLAs as elements of the contract: freshness SLA and quality SLA. They are commonly conflated. They are different commitments to different things, with different consequences when they fail. A pipeline that meets its operational SLA can fail its quality SLA in green. A pipeline that meets its quality SLA can miss its operational SLA without affecting correctness. The producer who treats both as one number end
Tuning Thresholds vs History (concepts: paDataQuality)
A quality system that fires too often gets ignored. The mechanism is simple. On-call engineers receive twenty pages a week. Three of them are real. The remaining seventeen train the engineer to acknowledge alerts without reading them carefully. The next real page lands in the same Slack channel as a false one and is missed. The pipeline that the team thought was protected is, in operational terms, unprotected, because the protection mechanism has been desensitized by its own noise. The fix is no
Contracts on a Legacy Pipeline (concepts: paSchemaEvolution)
Greenfield contracts are easy. Contracts on a legacy pipeline that has run for four years and has dozens of unknown consumers are hard. The mistake most teams make is treating the rollout as a one-shot migration: write the contract, declare the producer compliant, declare consumers responsible for catching up. The mistake produces breakage and erodes the credibility of the contract program. The disciplined rollout treats existing consumer behavior as the starting contract, evolves toward the des