Data Quality and Contracts: Advanced

A public-company data platform team had every quality check the intermediate tier prescribes. Schema validation, distributional checks, referential integrity, freshness gates, the whole suite. The pages still came in volume. On-call rotations were burning out. Half the alerts were technically correct: a column did shift, a row count did dip, a freshness gate did miss its threshold. None of those alerts represented an actual production problem; they represented threshold drift, calendar effects, and producer behaviors that nobody had told the consumer about. The other half of the alerts were silent failures the suite missed because the assertions had never been written. The team had quality engineering. It did not have quality discipline. The path from one to the other has three components: contracts that name what is committed, observability that names what is happening, and threshold tuning that names what is normal. This lesson covers all three, plus the rollout problem of getting there from a legacy pipeline that does none of them.

Data Contracts and CI Enforcement

Daily Life
Interviews

Author a data contract with schema, guarantees, and evolution policy, and enforce it in producer CI plus pipeline gates.

Lesson 1's advanced tier introduced the pipeline-as-product framing: a pipeline has a contract that names producer, consumer, schema, freshness SLA, quality SLA, backfill policy, and deprecation policy. This section turns that framing into a working mechanism. A data contract is the executable form of the commitment. The producer commits to a shape and a set of guarantees; the consumer relies on them; the contract is checked in CI on every change so violations cannot ship. Without enforcement, contracts are documentation. With enforcement, they are the layer that prevents most of the silent failures the previous lessons spent so much effort detecting. The shift from documentation to enforcement is the central move. Documentation rots. CI does not. A contract that is a YAML file in a wiki page is fiction within six months because the producer evolves and the wiki does not. A contract that is a YAML file in the producer's repo, validated by a CI step, evolves at the speed the producer evolves and never disagrees with reality. Every team that has built a contract program at scale agrees that the enforcement step is the one that changes the failure rate; the document was always achievable.

What Makes A Contract A Contract

PropertyImplicit AgreementData Contract
FormTribal knowledge; Slack threads; an outdated wiki pageA versioned file checked into the producer's repo
EnforcementDiscovered after a consumer breaksChecked in CI on every producer change
VisibilityWhoever has been around long enough to knowDiscoverable by every consumer through a registry or catalog
VersioningImplicit; whatever was true the last time someone lookedSemver-style; breaking changes require a new major version
Failure modeProduction incidentFailed CI run; producer fixes before merge

Anatomy Of A Contract

1# A producer - side contract for the customer events stream contract : name : customer_events version : 2.3.0 producer : team : platform - events repo : github.com / example / platform - events on_call : '#platform-events-oncall' SCHEMA : primary_key : event_id fields : - { name : event_id, type : string, nullable : FALSE } - { name : customer_id, type : string, nullable : FALSE } - { name : event_type, type : string, nullable : FALSE, accepted_values : [ signup, login, purchase, churn ] } - { name : event_timestamp, type : TIMESTAMP, nullable : FALSE } - { name : amount_usd, type : DECIMAL, nullable : TRUE, RANGE : { MIN : 0, MAX : 1000000 } } guarantees : freshness : '<= 5 minutes p99' volume : daily_min : 1000000 daily_max : 5000000 uniqueness : event_id delivery : at_least_once consumers : - team : analytics - engineering use_case : leadership KPI dashboard - team : ml - platform use_case : churn prediction features - team : finance use_case : monthly billing aggregation evolution_policy : additive_changes : minor_version breaking_changes : major_version_with_90d_notice field_deletions : forbidden_in_minor

Where The Contract Is Enforced

StageWhat Is CheckedFailure Means
Producer pre-commitProducer code emits events that match the contract schemaPre-commit hook rejects the change
Producer CISchema, accepted values, ranges, primary key uniquenessPR cannot merge; producer fixes the violation
Pipeline ingestion gateReal events conform to the contract at the boundaryQuality gate halts; producer is paged
Consumer CIConsumer code reads the contract version it depends onConsumer build fails; consumer pins or upgrades
Registry validationContract version is registered and discoverableDeploy is blocked until registration succeeds
Producer and consumer bind through explicit version pinning. The producer publishes a contract version; the consumer pins to a version it has tested against. A consumer that depends on customer_events 2.3.0 declares that dependency in code and builds against it. When the producer publishes 2.4.0 (additive minor), the consumer's build still passes because 2.4.0 is backward-compatible with 2.3.0. When the producer wants to publish 3.0.0 (breaking), the contract evolution policy requires a 90-day notice and a major version bump; consumers receive notification and have time to migrate. This is exactly the pattern that semver versioning solved for software libraries; data contracts apply it to data.
1# Consumer code declares its contract dependency explicitly
2from contracts import customer_events
3
4# Pin to a specific version; build fails if version is removed or unavailable
5schema = customer_events.load(version='2.3.0')
6
7for event in stream.read(schema=schema):
8 # Type-checked against the pinned schema
9 process_event(event)

What CI Enforcement Looks Like

1#.github / workflows / contract - CHECK.yml
2 ON the producer repo name : contract - CHECK
3 ON : pull_request jobs : validate - contract : runs -
4 ON : ubuntu - latest steps : - uses : actions / checkout @ v4 - name : Validate contract syntax run : contract - cli validate contracts / customer_events.yaml - name : Detect breaking changes vs main run : contract - cli diff origin / main contracts / customer_events.yaml - name : Generate test fixtures
5FROM contract run : contract - cli fixtures contracts / customer_events.yaml > / tmp / fixtures.json - name : Run producer code against generated fixtures run : pytest tests / contract_compliance /

When Contracts Pay For Themselves

Pre-Contract World
  • Producer changes a column type; consumers break in production
  • Schema rolls forward without consumer awareness
  • Producers and consumers debate root cause during the incident
  • Adding a new consumer means archeology to find the actual schema
Post-Contract World
  • Breaking change is rejected at producer CI; never reaches consumers
  • Schema evolution follows semver; consumers see only versions they pin to
  • Root cause is named by the contract version; debate is short
  • New consumers read the contract registry and integrate without archeology

What Contracts Cannot Do Alone

Contracts catch shape violations and explicitly-named guarantee violations. They do not catch business-logic correctness on conforming data. A contract that says amount_usd is a decimal in the range zero to one million does not catch an amount that is the wrong number for the underlying transaction. The quality suite from the intermediate tier remains necessary. Contracts and quality suites are complements: contracts prevent shape failures, quality suites detect distribution-and-content failures. A serious data platform runs both.
Signs that a contract program is mature:
  • Producers declare contracts before consumers integrate, not after
  • Schema evolution follows a documented semver-like policy
  • CI rejects breaking changes; production gates rarely have to
  • Consumer code declares contract dependencies as explicit pins
  • A registry exists where any team can discover what contracts are available
Do
  • Version contracts with semver; additive minor, breaking major, fixed-bug patch
  • Enforce contracts in producer CI; production gates are the second line of defense
  • Publish contracts to a registry that consumers can discover programmatically
Don't
  • Treat contracts as documentation; without enforcement they accumulate as fiction
  • Allow breaking changes in minor versions; consumers will silently break
  • Couple the contract format to one tool; contracts outlast any tool decision
TIP
Start contracts at the highest-leverage producer-consumer boundary, not at every boundary at once. The first contract teaches the team how to author and enforce; the second through tenth follow the pattern with little marginal cost.

Five Pillars of Observability

Daily Life
Interviews

Apply the five pillars of data observability as a diagnostic framework for incident response and as a coverage map for designing quality programs.

Barr Moses and the Monte Carlo Data team named the five pillars of data observability: freshness, distribution, volume, schema, and lineage. The naming has caught on widely enough that conversations about quality use it as shorthand. The pillars are useful because they are not a checklist; they are a diagnostic framework. When something is wrong with the data, the pillar that detected the symptom narrows the search for the cause. When designing a quality program, the pillars name the gaps that have to be filled before the program is considered observable rather than instrumented alone. The framework predates the pillars under different names. Software observability matured along the same lines: metrics, logs, and traces are a similar three-axis decomposition, where each axis answers a different diagnostic question. The data version is more axes because data has more independent dimensions of failure. Volume can be wrong without distribution being wrong; schema can shift without freshness slipping; lineage can be unknown even when every other pillar is green.

The Five Pillars

FreshnessDistributionVolumeSchemaLineage
Freshness
Is the data current
Time between latest available record and now. Asks whether the pipeline is keeping up. The simplest pillar to measure and the first one consumers notice.
Distribution
Are values within the expected range and shape
Statistical properties of columns: mean, stddev, quantiles, cardinality, category mix. Catches shifts that no individual row violates.
Volume
Is the right amount of data arriving
Row counts compared to historical baselines. Catches dropped partitions, broken filters, exploded joins. Often the first signal of an upstream problem.
Schema
Is the shape of the data what was promised
Columns, types, nullability, accepted values, ranges. The producer-side commitment made executable.
Lineage
Where does this data come from and what reads it
Upstream and downstream relationships at column granularity. Turns 'a number changed' into 'a number changed because this transform changed'.

Pillars As A Diagnostic Framework

When a consumer reports a wrong number, a senior engineer walks the pillars in order and uses each one to either narrow or rule out a class of cause. Freshness rules out 'is the data even recent.' Volume rules out 'did the right amount of data arrive.' Schema rules out 'is the shape correct.' Distribution rules out 'are the values within their normal range.' Lineage answers 'what produced this column and what depends on it.' The walk takes minutes and replaces the unstructured 'try things until something works' debugging that consumes hours when each pillar has to be checked manually.
SymptomFirst Pillar To CheckWhy
Dashboard shows last week's data on Monday morningFreshnessMost likely an ingestion stall; rules in or out the simplest cause
Revenue dropped 15 percent overnightVolumeSudden numeric drops are usually missing rows, not changed values
Average order amount is up but row count is steadyDistributionAggregate change without row count change implies value shift
Pipeline succeeded but downstream join produces nullsSchemaType or nullability mismatch is the typical cause of post-success join failures
Two consumers see different numbers from the same sourceLineageDifferent consumers may read different downstream tables; lineage exposes the divergence

A pillar is a category. A check is a specific assertion within a pillar. The pillar 'distribution' contains many checks: mean shift, stddev shift, p99 shift, category mix shift, cardinality shift. The framework value of pillars is in coverage, not in implementation. A program that has a hundred checks within four pillars and zero checks in the fifth is a program with a known blind spot.

Lineage: The Pillar Most Programs Skip

Most quality programs cover four pillars well and lineage poorly. The reason is cost. Lineage at table granularity is moderately expensive; lineage at column granularity is expensive; lineage that updates as transforms evolve is expensive to keep current. The payoff is that lineage transforms incident response. A column-level lineage system answers questions like 'what consumers depend on amount_usd in fct_orders' in seconds. Without lineage, the same question takes hours of grep-and-Slack archaeology. Tools like dbt, Dagster, and standalone catalogs (DataHub, OpenMetadata) cover this pillar; the discipline is in adopting and keeping them current.
1Column - level lineage example for fct_orders.amount_usd : raw.events.amount -> curated.fct_orders.amount_usd - + -> mart.daily_revenue.revenue_usd | + -> feature_store.user_features.lifetime_spend | + -> reverse_etl.salesforce.account_revenue
Reading the diagram backward from a consumer answers 'where does this number come from.' Reading it forward from a source column answers 'who is affected if this column changes.' Both questions show up in incident response and in change reviews. Lineage is the pillar that ties the other four together.

When To Adopt Each Pillar

Maturity StagePillars In PlaceWhat The Team Can Answer
Stage 1: Cheap checksVolume, freshnessDid the right amount of data arrive on time
Stage 2: SuiteVolume, freshness, schema, distributionIs the data structurally and statistically as expected
Stage 3: ObservableAll five pillars including lineageWhat changed, why, and who is affected
Stage 4: Contract-enforcedAll five pillars plus contracts in CISame as Stage 3, but most failures cannot ship
Without Lineage
  • Incident response starts with 'who owns this column'
  • Change reviews miss downstream consumers
  • Deprecation requires manual canvassing
  • Two consumers compute conflicting numbers; cause is hidden
With Lineage
  • Incident response starts with the lineage graph; ownership is metadata
  • Change reviews automatically surface affected consumers
  • Deprecation walks the graph and notifies every downstream
  • Conflicting numbers are explained by divergent transforms in the graph
TIP
When a quality program is in place but the team still spends hours diagnosing incidents, the missing investment is almost always lineage. The other four pillars detect; lineage interprets.
check
The five pillars are a framework for coverage and a diagnostic walk for incidents.
alert
Most programs underinvest in lineage; the cost shows up in incident response time.
query
Pillars are categories; specific checks live within them. Coverage means hitting all five categories, not running one check.

The pillars are descriptive, not prescriptive. A program with twenty schema checks and zero distribution checks is not 'four-fifths observable'; it is observable on schema and blind on distribution. Coverage is binary per pillar.

Quality SLAs vs Ops SLAs

Daily Life
Interviews

Distinguish operational from quality SLAs and state both as separate commitments with separate measurements and improvements.

An SLA states a commitment. The pipeline-as-product framing from Lesson 1 introduced two SLAs as elements of the contract: freshness SLA and quality SLA. They are commonly conflated. They are different commitments to different things, with different consequences when they fail. A pipeline that meets its operational SLA can fail its quality SLA in green. A pipeline that meets its quality SLA can miss its operational SLA without affecting correctness. The producer who treats both as one number ends up over-promising on one and under-detecting failure of the other. The conflation has visible consequences. Status pages that report a single uptime number describe operational SLA exclusively, leaving consumers with no way to distinguish 'late but correct' from 'on time but wrong'. Incident reviews that do not separate the two end up with action items that improve one without addressing the other. The split SLA is more honest, and honesty in producer commitments is the foundation of the trust that makes data products usable.

The Two SLAs

PropertyOperational SLAQuality SLA
Question answeredDid the data arrive on timeWas the data correct
Typical statementPipeline finishes by 6am every dayRow count within 50 to 200 percent of baseline; null rate below 1 percent
Failure modeLate or missing runWrong numbers in a successfully-completed run
Detected byOrchestrator monitoring; missed schedule alertsQuality gates inside the pipeline
Consumer impactDashboard or model shows yesterday's dataDashboard or model shows wrong data

Why Conflating Them Hurts

An operational SLA of 'fresh by 6am' tells the consumer when to expect the data. A quality SLA of 'correct row counts by 6am' tells the consumer when to expect the data to be both fresh and right. The two are independent. A pipeline can meet 'fresh by 6am' with a 30 percent row count drop. A pipeline can have a flawless row count and miss the 6am deadline because the warehouse was slow. Consumers who hear 'the team has a 6am SLA' assume both meanings are guaranteed. Producers who state a 6am SLA often mean only operational. The conversation has to specify which one, or both.

Stating Both Explicitly

1# BOTH SLAs stated separately IN the contract guarantees : operational_sla : statement : 'fct_orders updated by 06:00 Pacific each day' measurement : 'orchestrator-reported finish time' target : 99.0 WINDOW : 'rolling 30 days' quality_sla : statement : 'all five-pillar gates pass on the run that satisfies the operational SLA' measurement : 'count of successful runs with all gates green / total runs' target : 99.5 WINDOW : 'rolling 30 days' combined_sla : statement : 'fresh AND correct by 06:00 Pacific' target : 98.5

What The Combined SLA Actually Costs

The combined SLA is the multiplication of the two. A 99.0 percent operational SLA and a 99.5 percent quality SLA produce a combined SLA of 98.5 percent at best. Promising both individually at 99.0 percent and treating that as the joint guarantee is mathematically wrong. The cost compounds further when consumer behavior is sensitive to the combined number: a model that retrains daily on potentially-stale or potentially-wrong data needs to be designed to tolerate the combined error rate, not either component alone.
OperationalQualityCombined (Best Case)
99.0%99.0%98.0%
99.5%99.5%99.0%
99.9%99.9%99.8%
99.99%99.99%99.98%
Designing around each SLA requires different investments. The operational SLA is improved by orchestration investments: warm pools, retry policies, redundant scheduling, removing single points of failure in compute. The quality SLA is improved by suite investments: more pillars covered, tighter thresholds tuned against history, contracts that prevent shape failures from shipping. The two require different teams in some organizations and different budgets in most. Treating them as one budget produces under-investment in whichever one is currently considered solved.
Operational SLAQuality SLACombined SLA
Operational SLA
On-time delivery commitment
Pipeline finishes by a stated deadline. Measured by orchestrator finish time. Improved by warm pools, retry budgets, and redundant scheduling.
Quality SLA
Correctness commitment
Five-pillar gates pass on the run. Measured by green-gate run rate. Improved by suite coverage, tuned thresholds, and contracts in CI.
Combined SLA
Fresh AND correct
The multiplication of the two. The honest number to publish on a status page; the geometric impossibility to avoid promising.
Operational SLA Improvements
  • Warm pools and pre-provisioned compute
  • Retry budgets with exponential backoff
  • Redundant orchestrator instances
  • Critical-path identification and short-circuiting
Quality SLA Improvements
  • Coverage of all five pillars at every layer boundary
  • Threshold tuning against historical data
  • Contracts in CI to prevent shape failures
  • Lineage to shorten incident response
A real-time fraud detection feature has a tight operational SLA: data fresh within seconds. The quality SLA is also tight, but a failure that produces no result is preferable to a failure that produces wrong results. Operational and quality both matter; correctness wins ties. A monthly close finance pipeline has a relaxed operational SLA but an unforgiving quality SLA: a wrong number in the close requires a refile and a regulatory disclosure. Knowing which dominates for a given consumer is part of the contract.
Concrete operational vs quality tradeoffs in production:
  • Stripe payment events: operational tight (seconds); quality must dominate (no double-charges)
  • Daily marketing dashboard: operational moderate (by 9am); quality moderate (refile on error)
  • Monthly finance close: operational loose (any time within close window); quality near-perfect
  • ML feature store: operational tight (model retraining schedule); quality tight (drift breaks predictions)
TIP
When a consumer reports an SLA breach, ask which SLA: operational or quality. The fix is different for each. Combining them in conversation produces fixes that miss the actual problem.
Do
  • State operational and quality SLAs as separate commitments in the contract
  • Compute the combined SLA explicitly; do not promise the geometric impossible
  • Report all three (operational, quality, combined) on the producer status page
Don't
  • Treat 'the pipeline is up' as the only SLA; up-but-wrong is its own failure mode
  • Promise high quality SLA targets without a five-pillar suite to back them
  • Allow operational improvements to mask quality regression; budget for both
alert
Operational and quality SLAs are different commitments; combining them under-detects one failure mode.
check
The combined SLA is the multiplication of the two; promise math that is achievable.

Tuning Thresholds vs History

Daily Life
Interviews

Tune quality thresholds against historical data and annotate known anomalies so the alarm rate matches the team's investigation capacity.

A quality system that fires too often gets ignored. The mechanism is simple. On-call engineers receive twenty pages a week. Three of them are real. The remaining seventeen train the engineer to acknowledge alerts without reading them carefully. The next real page lands in the same Slack channel as a false one and is missed. The pipeline that the team thought was protected is, in operational terms, unprotected, because the protection mechanism has been desensitized by its own noise. The fix is not to remove checks. The fix is to tune the thresholds against historical data so that the alarm rate is low enough that every alarm is read carefully. The same dynamic appears in security operations centers, in airline cockpits, and in hospital telemetry alarms, and in every domain the conclusion is identical: an alert system has a finite signal-to-noise budget, and exceeding the budget destroys the system's value. Quality engineering has not historically thought of itself as alarm-system design, but it is.

Alert Fatigue Is A Quality Failure

SymptomUnderlying CauseConsequence
On-call ignores quality pagesMost pages are not actionable; threshold is too tightReal failures missed; consumer trust degrades
Quality dashboard shows constant redVisualizing every check as criticalThe dashboard becomes wallpaper; nobody looks at it
Page rate higher than incident rateFalse positives outnumber real signalEngineers escalate to suppress checks; coverage shrinks
Engineers create silent alert filtersThe system has not been tuned; humans are filtering insteadFiltering becomes tribal knowledge; new on-call doesn't have it
Threshold tuning runs the proposed assertion against historical data and counts the alerts that would have fired. A threshold that would have fired three hundred times on the last ninety days of data is not a threshold; it is a constant. A threshold that would have fired zero times is not a threshold; it is a non-check. A useful threshold fires somewhere between two and ten times in a ninety-day window, and each firing corresponds to either a real incident or a known anomaly that can be classified as such. The tuning is empirical, not theoretical.
1WITH daily_stats AS (
2 SELECT
3 order_date,
4 AVG(amount_usd) AS daily_mean
5 FROM fct_orders
6 WHERE order_date BETWEEN CURRENT_DATE - 90
7 AND CURRENT_DATE - 1
8 GROUP BY order_date
9),
10rolling AS (
11 SELECT
12 order_date,
13 daily_mean,
14 AVG(daily_mean) OVER (
15 ORDER BY order_date
16 ROWS BETWEEN 28 PRECEDING AND 1 PRECEDING
17 ) AS rolling_mean,
18 STDDEV(daily_mean) OVER (
19 ORDER BY order_date
20 ROWS BETWEEN 28 PRECEDING AND 1 PRECEDING
21 ) AS rolling_sd
22 FROM daily_stats
23)
24
25SELECT
26 order_date,
27 daily_mean,
28 ROUND(
29 (
30 daily_mean - rolling_mean
31 ) / rolling_sd,
32 2
33 ) AS z_score
34FROM rolling
35WHERE ABS(
36 (
37 daily_mean - rolling_mean
38 ) / rolling_sd
39 ) >= 3
40ORDER BY order_date
The query produces the dates on which a z >= 3 threshold would have fired. The team reviews each date with an analyst: was something real happening, or was the threshold too tight. Tuning continues until the firing rate matches the rate at which the team can credibly investigate every firing without ignoring any of them.

What 'Tuned' Looks Like

Quality SLA TargetImplied Page RateThreshold Tightness
99.9% (one bad day per quarter)About one page per quarter per checkLoose; only large shifts fire
99.5% (one bad day per month)About one page per month per checkModerate; large and persistent shifts fire
99.0% (one bad day per ten days)About three pages per month per checkTighter; small persistent shifts fire
95.0% (one bad day per twenty)Many pages per month per checkTight; check is approaching alert fatigue
A practical pattern uses two thresholds per check. A warning threshold fires more often, into a channel where humans review during business hours. A blocking threshold fires rarely, into the on-call rotation. The warning catches plausible-but-suspicious shifts; the blocker catches definite-and-actionable failures. The warning channel is allowed to be noisy because it does not interrupt anyone outside business hours; the blocking channel is held to a strict signal-to-noise ratio because every page interrupts an engineer.
Warning Channel
  • Reviewed during business hours
  • Tolerates noise; signal extracted by humans during review
  • Z-score thresholds in the 2 to 3 range
  • Includes plausible day-of-week and seasonal anomalies
Blocking Channel
  • Pages on-call regardless of time of day
  • Strict signal-to-noise ratio; tuned to fire rarely
  • Z-score thresholds in the 4 to 6 range
  • Excludes anomalies that historical review classified as benign
Many quality false alarms are calendar effects: holidays, end-of-month spikes, marketing campaigns, product launches. A pure z-score against a trailing window catches these as alerts, even though they are predictable. The fix is to enrich the baseline with calendar awareness: same-day-of-week comparisons, holiday flags, campaign annotations. The baseline becomes a model rather than a sliding average. The investment is worth it for high-volume tables where calendar effects produce most of the noise. For low-volume tables, the same investment is over-engineering.
1WITH baseline AS (
2 SELECT
3 EXTRACT(DOW FROM order_date) AS day_of_week,
4 AVG(daily_count) AS dow_mean,
5 STDDEV(daily_count) AS dow_sd
6 FROM fct_orders_daily
7 WHERE order_date BETWEEN CURRENT_DATE - 90
8 AND CURRENT_DATE - 1
9 GROUP BY EXTRACT(DOW FROM order_date)
10)
11
12SELECT
13 EXTRACT(DOW FROM CURRENT_DATE) AS dow_today,
14 (
15 SELECT
16 COUNT(*)
17 FROM fct_orders
18 WHERE order_date = CURRENT_DATE
19 ) AS today_count,
20 ROUND(baseline.dow_mean, 0) AS dow_mean,
21 ROUND(
22 (
23 (
24 SELECT
25 COUNT(*)
26 FROM fct_orders
27 WHERE order_date = CURRENT_DATE
28 ) - baseline.dow_mean
29 ) / baseline.dow_sd,
30 2
31 ) AS z_score
32FROM baseline
33WHERE day_of_week = EXTRACT(DOW FROM CURRENT_DATE)
Some anomalies are real and not bugs. Black Friday produces a row count spike that the threshold should know about. A planned product launch produces a feature distribution shift that the threshold should know about. Annotating these in advance prevents the threshold from firing on them. The annotation lives next to the threshold definition and is reviewed in the same PR cycle. Without annotations, the team adds suppression rules ad-hoc during the incident, and those rules outlive the event they were created for.
1# Calendar annotations consulted BY the threshold engine known_anomalies : - DATE : '2026-11-27' TABLE : fct_orders metric : row_count expected_z_shift : '+8 to +15' reason : 'Black Friday' - date_range : '2026-12-20 to 2026-12-26' TABLE : fct_orders metric : amount_usd_mean expected_z_shift : '+2 to +5' reason : 'Holiday gifting; higher AOV' - date_range : '2026-04-15 to 2026-04-22' TABLE : fct_customer_events metric : event_type_signup_pct expected_z_shift : '+3 to +6' reason : 'Spring marketing launch'
Symptoms that thresholds need re-tuning:
  • Engineers create Slack mute rules for specific quality alerts
  • On-call routinely acknowledges pages with 'expected; ignoring'
  • The same check fires on the same day of the week every week
  • New on-call rotations report being overwhelmed by quality alerts
  • A real incident is missed because the page was indistinguishable from noise
Do
  • Tune every threshold against at least 90 days of historical data before turning it on
  • Use two-tier checks: warn loosely and block strictly
  • Annotate known anomalies in version control next to the thresholds
Don't
  • Treat alert volume as a measure of quality coverage; the right measure is incidents caught
  • Tighten thresholds reactively after an incident; tighten as a deliberate review
  • Allow ad-hoc suppression rules to outlive their original cause
TIP
The cost of a false alarm is the next real alarm that gets ignored. Treat threshold tuning as part of building the check, not as a follow-up task.

Contracts on a Legacy Pipeline

Daily Life
Interviews

Roll out data contracts on a legacy pipeline by documenting the current state, inventorying consumers, observing before enforcing, and migrating behind versioned changes.

Greenfield contracts are easy. Contracts on a legacy pipeline that has run for four years and has dozens of unknown consumers are hard. The mistake most teams make is treating the rollout as a one-shot migration: write the contract, declare the producer compliant, declare consumers responsible for catching up. The mistake produces breakage and erodes the credibility of the contract program. The disciplined rollout treats existing consumer behavior as the starting contract, evolves toward the desired contract over time, and uses the same shape that semver software libraries have used for decades. The exercise below walks through the rollout for a customer events stream that is the lifeblood of three downstream consumers and an unknown number of analyst queries. The legacy character of the pipeline is what makes the rollout interesting. A greenfield rollout has the luxury of starting from a clean schema and an enumerable consumer set. A legacy rollout has neither. The discipline below is what compensates for the missing luxury, and the principles transfer to any production system that has been running long enough that the original authors have moved on and the original assumptions have drifted.

The Starting State

What the team inherits:
  • Producer is a four-year-old Kafka topic with no formal schema; events are JSON
  • Three named consumers: leadership dashboard, churn model, billing aggregator
  • Unknown number of analyst queries reading from the curated layer
  • No quality gates; one wrong-number incident every six to eight weeks
  • Schema has drifted: producers have added and renamed fields without coordination
Step 1 is to document the current contract. The first contract is not the contract the team wants; it is the contract the producer is currently delivering, including the drift. The team writes it by inspecting recent production data: every field that has appeared in the last 30 days, every type variation seen, every value range observed. The result is messy and accurate. The contract version is 1.0.0, and it captures reality, not aspiration.
1# Contract version 1.0.0 : documents what IS, NOT what should be contract : name : customer_events version : 1.0.0 status : documented_legacy SCHEMA : primary_key : event_id fields : - { name : event_id, type : string, nullable : FALSE } - { name : customer_id, type : string, nullable : TRUE, observed_null_rate : 0.008 } - { name : event_type, type : string, nullable : TRUE, observed_values : [ signup, login, purchase, churn, page_view, click, error ] } - { name : ts, type :
2UNION [ TIMESTAMP, string ], nullable : FALSE, notes : 'producers send both ISO-8601 strings and epoch ms' } - { name : amount, type :
3UNION [ DECIMAL, string ], nullable : TRUE, notes : 'string for older clients; decimal for newer' } guarantees : freshness : 'best effort, typically <= 30 minutes' volume : 'no enforced bound; observed 1.5 to 4M per day' consumers_known : - leadership_dashboard - churn_model - billing_aggregator
Step 2 is to inventory every consumer. The three named consumers are easy. The unknown analyst queries are not. Lineage tooling helps: parsing recent query logs against the table catches most analyst queries; manual outreach catches the rest. The output is a list of every reader, ranked by frequency, with an owner per consumer. Step 3 adds quality gates in observation-only mode. The gates run, log results, and do not block. After two to four weeks of observation, the team has data on which gates would have fired and tunes against it. Once tuned, gates switch from observation to enforcing. Skipping the observation phase is the most common rollout mistake.
1# Gates start IN observation mode quality_gates : - name : row_count_within_baseline layer : raw mode : observe # later : 'block' threshold : 0.50 _to_2.00 _of_dow_baseline - name : customer_id_null_rate layer : raw mode : observe threshold : 0.01 - name : amount_distribution_z_shift layer : curated mode : observe block_threshold : 5.0 warn_threshold : 3.0 - name : event_type_accepted_values layer : raw mode : observe threshold : enum_present_in_recent_history
Step 4 plans the cleanups as versioned changes. Legacy problems (union types on ts and amount, unwhitelisted event_types, ambiguous nulls) are each a versioned change. Renaming ts to event_timestamp with strict ISO-8601 is a breaking change; it ships in version 2.0.0 with a 90-day notice. Whitelisting accepted_values for event_type is backward-compatible if the producer commits to not emitting new values without a contract update; it ships as minor 1.x. Each change has a migration plan naming affected consumers and their actions. Cleanups happen one at a time, not as a single rewrite.
ChangeVersionTypeConsumer Action
Whitelist accepted_values for event_type1.1.0Backward-compatible (additive guarantee)None; consumers can adopt the tighter contract or stay on 1.0
Tighten customer_id null rate to <= 1 percent1.2.0Backward-compatible (tighter guarantee)None; consumers benefit
Rename ts to event_timestamp; strict ISO-86012.0.0BreakingPin to 1.x or migrate within 90 days
Rename amount to amount_usd; strict decimal2.0.0BreakingPin to 1.x or migrate within 90 days
Drop legacy event_types (page_view, click, error)3.0.0BreakingPin to 2.x or migrate within 180 days
Step 5 migrates consumers behind the versions. Each consumer migrates to the next major version on its own schedule within the announced notice period. The producer maintains both old and new versions during the transition window; dual-publishing is more expensive than single-publishing, and that maintenance burden falls on the producer. The benefit is that no consumer breaks in production; each migration is a deliberate code change with tests, not a surprise in an incident channel. Some consumers migrate in a week; others take the full notice period. Both are acceptable; a unilateral cutover is not. The producer tracks migration progress with a simple consumer-version query against the contract registry.
1SELECT
2 contract_version,
3 COUNT(DISTINCT
4 consumer_team
5 ) AS pinned_consumers,
6 MIN(last_seen) AS earliest_pin,
7 MAX(last_seen) AS latest_pin
8FROM contract_registry.pins
9WHERE contract_name = 'customer_events'
10AND last_seen >= CURRENT_DATE - 7
11GROUP BY contract_version
12ORDER BY contract_version
1Producer publishes : customer_events 1. x - + customer_events 2. x - + customer_events 3. x - + pre - announce Day 0 Day 30 Day 60 Day 90 Day 180 Day 270 Consumer A pins 1.0 Consumer B pins 1.0 Consumer C pins 2.0
Step 6 locks in the new state. Once consumers are migrated and the legacy version is deprecated, the producer flips the gates from observation-only to enforcing on the new contract. CI now rejects producer changes that violate the contract. Schema drift becomes impossible; incidents that used to come every six to eight weeks become rare. The rollout is complete when the producer can commit to the contract as a CI-enforced guarantee, not documentation alone.
Before The Rollout
  • Wrong-number incident every six to eight weeks
  • Three named consumers; unknown analyst surface
  • Schema drifts continuously; producers add fields without notice
  • Quality is best-effort; no SLA possible
  • Rolling back a producer change requires consumer-by-consumer triage
After The Rollout
  • Wrong-number incidents are rare; most are caught in CI
  • Every consumer is registered; lineage covers analyst queries
  • Schema evolves through versioned changes with notice
  • Operational and quality SLAs are stated and tracked
  • Producer can roll back any version because all consumers are pinned
The rollout principles, in order:
  • Document the legacy contract as it actually is, not as the team wishes
  • Inventory every consumer; surprise consumers cause migration failures
  • Add gates in observation-only mode and tune against the observed data
  • Plan cleanups as versioned changes; backward-compatible as minor, breaking as major
  • Migrate consumers behind versions on their own schedule within the notice window
  • Flip gates to enforcing only after the migration is complete
TIP
A legacy rollout that takes nine months and breaks zero consumers is a successful rollout. A legacy rollout that takes three months and breaks four consumers undoes its own credibility. Speed is not the goal; trust is.
Do
  • Treat the first contract as a documentation of reality, not aspiration
  • Use semver explicitly; minor for additive, major for breaking, with notice
  • Maintain old and new versions in parallel during the migration window
Don't
  • Cut over consumers unilaterally; the rollout is a negotiation, not a mandate
  • Skip observation mode; thresholds need real data before they can be enforced
  • Treat the rollout as one project; treat each version bump as its own change

Contracts and quality become a system only when the producer can roll back any version because every consumer is pinned. The concise statement of the rollout principle: a legacy rollout earns its trust by treating consumers as parties to a negotiation, not as inventory to be migrated, structured by versions and notice periods. The trust accumulates over the months in which no consumer is broken by surprise.

PUTTING IT ALL TOGETHER

> A staff data engineer joins a public-company data platform team. The platform has every quality check the intermediate tier prescribes. On-call rotations are burning out from alert fatigue. Half of incidents are caught by gates and half are still discovered by consumers asking 'why does this number look wrong'. The engineer is asked to turn the quality program into something the team trusts and the consumers can rely on.

Adopt data contracts as producer-side commitments enforced in CI. The contract names schema, freshness, volume, uniqueness, and evolution policy. Producer CI rejects breaking changes; pipeline gates are the second line of defense. (Builds on Lesson 1's pipeline-as-product framing.)
Cover all five pillars (freshness, distribution, volume, schema, lineage) and treat them as a diagnostic framework, not a checklist. Lineage is the pillar most programs underinvest in; the cost of skipping it is incident response time.
State operational and quality SLAs separately. The combined SLA is the multiplication of the two. Designing for one without the other under-detects the other failure mode. (Connects to Lesson 4's orchestration SLAs and Lesson 6's failure handling.)
Tune every threshold against historical data. Use two-tier checks: warn loosely, block strictly. Annotate known anomalies (holidays, launches) so the threshold engine is calendar-aware. The cost of a false alarm is the next real alarm that gets ignored.
Roll out contracts on the legacy pipeline by documenting reality first, inventorying every consumer, observing before enforcing, and migrating consumers behind versioned changes with explicit notice periods. A nine-month rollout that breaks zero consumers is the goal. (Builds on the layered architecture and decoupling concepts from Lesson 1, the four cheap checks from this lesson's beginner tier, and the five pillars from this lesson's intermediate tier.)
Combine all of the above into a quality discipline rather than a quality engineering effort. Engineering catches failures; discipline prevents them. The shift is the difference between a team that manages incidents and a team that prevents them.
KEY TAKEAWAYS
Contracts make quality enforceable: producer commits, consumer relies, CI rejects violations before they ship. Pipeline gates are the second line of defense, not the first.
The five pillars are coverage, not a checklist: freshness, distribution, volume, schema, lineage. A program with four pillars and zero lineage has a known blind spot.
Operational and quality SLAs are separate commitments: the combined SLA is their multiplication. Promise the math that is achievable, not the geometric impossible.
Threshold tuning is part of authoring a check: use historical data, two-tier warn-and-block, and calendar annotations. The cost of false alarms is the next real alarm that gets ignored.
Legacy rollouts are negotiations, not migrations: document reality, inventory consumers, observe before enforcing, version every change with notice. Speed is not the goal; trust is.

Contracts make quality enforceable; observability makes it diagnosable; tuning makes it trusted

Category
Pipeline Architecture
Difficulty
advanced
Duration
38 minutes
Challenges
0 hands-on challenges

Topics covered: Data Contracts and CI Enforcement, Five Pillars of Observability, Quality SLAs vs Ops SLAs, Tuning Thresholds vs History, Contracts on a Legacy Pipeline

Lesson Sections

  1. Data Contracts and CI Enforcement (concepts: paDataContracts, paContractCi)

    Lesson 1's advanced tier introduced the pipeline-as-product framing: a pipeline has a contract that names producer, consumer, schema, freshness SLA, quality SLA, backfill policy, and deprecation policy. This section turns that framing into a working mechanism. A data contract is the executable form of the commitment. The producer commits to a shape and a set of guarantees; the consumer relies on them; the contract is checked in CI on every change so violations cannot ship. Without enforcement, c

  2. Five Pillars of Observability (concepts: paFivePillars, paDataObservability, paLineage)

    Barr Moses and the Monte Carlo Data team named the five pillars of data observability: freshness, distribution, volume, schema, and lineage. The naming has caught on widely enough that conversations about quality use it as shorthand. The pillars are useful because they are not a checklist; they are a diagnostic framework. When something is wrong with the data, the pillar that detected the symptom narrows the search for the cause. When designing a quality program, the pillars name the gaps that h

  3. Quality SLAs vs Ops SLAs (concepts: paQualitySla, paOperationalSla)

    An SLA states a commitment. The pipeline-as-product framing from Lesson 1 introduced two SLAs as elements of the contract: freshness SLA and quality SLA. They are commonly conflated. They are different commitments to different things, with different consequences when they fail. A pipeline that meets its operational SLA can fail its quality SLA in green. A pipeline that meets its quality SLA can miss its operational SLA without affecting correctness. The producer who treats both as one number end

  4. Tuning Thresholds vs History (concepts: paAlertFatigue, paThresholdTuning)

    A quality system that fires too often gets ignored. The mechanism is simple. On-call engineers receive twenty pages a week. Three of them are real. The remaining seventeen train the engineer to acknowledge alerts without reading them carefully. The next real page lands in the same Slack channel as a false one and is missed. The pipeline that the team thought was protected is, in operational terms, unprotected, because the protection mechanism has been desensitized by its own noise. The fix is no

  5. Contracts on a Legacy Pipeline (concepts: paContractRollout, paLegacyMigration)

    Greenfield contracts are easy. Contracts on a legacy pipeline that has run for four years and has dozens of unknown consumers are hard. The mistake most teams make is treating the rollout as a one-shot migration: write the contract, declare the producer compliant, declare consumers responsible for catching up. The mistake produces breakage and erodes the credibility of the contract program. The disciplined rollout treats existing consumer behavior as the starting contract, evolves toward the des