Data Quality and Contracts: Beginner

A logistics company at Series C scale ran a nightly pipeline that aggregated package events into a delivery performance dashboard. One Tuesday the dashboard showed an on-time rate of 99.7 percent across every region. The number was wrong. An upstream service had silently dropped a Kafka partition for six hours, and only the late events that did make it through were marked late. Every check the pipeline had passed: extract finished, transform finished, load finished, the table was updated. Operations leadership made a routing decision based on the dashboard before anyone noticed the gap. The incident postmortem named the failure not as a pipeline outage but as a data quality outage. The pipeline ran. The data was wrong. The two are different events, and a pipeline that does not separate them is a pipeline that lies to its consumers in green.

Pipeline Ran vs Data Is Good

Daily Life
Interviews

Recognize the gap between operational success and semantic success, and name the failure mode that hides in that gap.

Pipelines have two distinct success criteria. One criterion is operational: did the code execute, did the writes commit, did the orchestrator mark the run green. The other criterion is semantic: does the data the pipeline produced actually describe the world correctly. Operational success is necessary but not sufficient for semantic success. The most expensive production incidents in mature data organizations are the ones where operational success and semantic failure coexist, because nobody is alerted until a human notices a number that looks wrong. Internal postmortems at Netflix, Uber, and Stripe across the last decade share the same shape: the pipeline ran, the gates were green, the dashboard updated, and the underlying data was wrong for hours or days before anyone investigated. The shared diagnosis in those postmortems is not a code bug; it is a missing signal. The pipeline had no way to say 'the data is wrong' even when it was.

Two Definitions of Success

CriterionWhat It AsksWho Notices When It Fails
Operational successDid the job exit zero, did the orchestrator mark it greenOn-call alert fires within minutes
Semantic successDoes the output table describe what the world actually containsA human asks 'why does this number look strange' hours or days later
Combined successBoth criteria pass; the pipeline ran and produced trustworthy dataNobody notices because nothing is wrong

The gap between these two criteria has a name: silent data corruption. Silent because nothing in the pipeline raises an alarm. Corruption because the data is wrong. The dashboard updates, the model retrains, the report ships, and the world moves on while the data underneath says something untrue.

Concrete Failure Modes That Look Green

What Happened UpstreamWhat the Pipeline DidWhat the Output Looked Like
A Kafka partition dropped events for six hoursRead every event that was actually present, succeeded normallyA row count 30 percent below normal that nobody noticed
A producer started sending NULL for a column that used to be requiredLoaded the column as NULL; no schema check existedDownstream joins silently lost rows; revenue dashboard dropped 12 percent
A dedupe step changed its key by accidentRan successfully against the new key; output had duplicate ordersOrder count doubled; finance reconciled three days later
Source database failover lost the last 15 minutes of writesPulled what was there, no awareness of the gapCustomer activity dashboard missed a real spike during a marketing push
The two questions every pipeline must answer separately:
  • Did the run finish without raising an exception
  • Does the output table describe a world that actually existed
  • Without the second question, the answer to the first is misleading
  • Quality checks are how the second question gets asked in code

Where Trust Comes From

Consumer trust in a data product comes from two things: the data has been correct in the past, and the producer can prove that the current run is also correct. The proof is the quality check. Without a quality check, the only proof is time, and time is the most expensive proof: a quarter of correct dashboards followed by one wrong number is not enough to keep trust intact. Once a finance team finds a wrong number, every prior number gets re-inspected. The retrospective inspection consumes more engineering hours than the original quality checks would have. The pattern repeats across teams. A single bad number on a CFO's desk often kicks off a multi-week audit of every related number, an audit that produces nothing wrong but consumes the data engineering team for the duration. The cheapest way to prevent that audit is to have evidence on hand when the question is asked.
Pipeline With No Quality Checks
  • Operational success means 'green'; semantic failure goes unnoticed
  • Consumers discover problems by spotting strange numbers in dashboards
  • Trust is fragile; one wrong number triggers retrospective audits
  • Failure root cause takes hours to find because nothing was checked
Pipeline With Basic Quality Checks
  • Green requires both operational success and quality checks passing
  • Failures route to alerts within the same run as the failure
  • Trust accumulates because correctness has been continuously proved
  • Failure root cause is narrowed by which check fired

What a Quality Check Actually Is

A quality check is a small piece of code that asserts something about a dataset and fails noisily if the assertion does not hold. The check runs as part of the pipeline, in the same DAG as the work it is checking. Its only job is to compare what was produced against what is plausible. A quality check that runs in a separate dashboard that nobody watches is not a quality check; it is a chart. The defining property is that the check has the authority to stop the pipeline or to fire an alert that a human will see. The same code that produces the check also defines the response. A check whose response is unspecified ends up as background noise; a check whose response is concrete (page on-call, halt the publish, file a ticket) is a real protective control. Authoring the response in the same PR as the check is the discipline that turns instrumentation into protection.
1# A quality check in its smallest possible form
2def assert_row_count_in_range(table, min_rows, max_rows):
3 actual = count_rows(table)
4 if actual < min_rows or actual > max_rows:
5 raise QualityCheckFailed(
6 f'{table} produced {actual} rows; expected {min_rows} to {max_rows}'
7 )
8 return actual
9
10# Wired into the pipeline immediately after the transform
11transform_orders()
12assert_row_count_in_range('mart.daily_orders', 50000, 200000)
13publish_marts()

Quality checks are not optional infrastructure for a serious pipeline. Pipelines without them are demos that happened to make it to production.

alert
Operational success and semantic success are independent; a green run can produce wrong data.
check
Silent data corruption is the failure mode that costs the most because nobody notices it in time.
query
A quality check is code that asserts a property of the data and has authority to halt or alert.
TIP
When reviewing a new pipeline, ask the author what would happen if the source silently dropped half its data tomorrow. If the answer involves a human noticing, the pipeline needs quality checks before it ships.

Four Cheap Quality Checks

Daily Life
Interviews

Apply the four cheap quality checks (row count, null rate, uniqueness, freshness) to a production table.

Quality engineering has a 90/10 rule. Roughly ninety percent of silent failures are caught by ten percent of the possible checks. The four cheap checks below cover that ninety percent. They run in seconds, they need only basic SQL, and they catch the most common production incidents. The point of starting with these four is that any of them is better than none, and arguments about more sophisticated checks are arguments about edge cases until the basics are in place. The four checks are also the easiest to explain to a non-technical stakeholder. A finance partner can understand 'the row count was outside its band'; the same partner cannot reasonably be asked to interpret a Kolmogorov-Smirnov statistic. Explainability is not a courtesy; it is what allows the alert to translate into a corrective action without a phone call.
Row countNull rateUniquenessFreshness
Row count
Did roughly the right amount of data arrive
An order table that produces 100,000 rows on a normal day and 12,000 rows today is almost certainly broken upstream. A simple range check catches dropped partitions, failed joins, and broken filters.
Null rate
Did required columns stay populated
A column that is normally 99.9 percent populated and is suddenly 80 percent NULL signals a producer change. The check is a single percentage threshold per column.
Uniqueness
Did the primary key stay unique
Duplicates in a key column mean a deduplication step failed or a join exploded. The check is a count of distinct keys versus total rows.
Freshness
Is the most recent data recent enough
An events table whose latest timestamp is from six hours ago when it should be within fifteen minutes signals an ingestion problem. The check is a single MAX timestamp comparison.

What Each Check Catches

CheckCommon Cause When It FailsWhat Would Have Happened Without It
Row count below expectedSource dropped events; filter became too restrictive; join lost rowsDashboard shows artificially low numbers; decisions made on partial data
Row count above expectedDeduplication failed; join exploded into a Cartesian product; backfill ran twiceAggregates double-count; revenue numbers inflated
Null rate spikeProducer changed a required field to optional; upstream extract bugJoins on the column drop rows silently; downstream metrics shift
Uniqueness violatedDedup step missed; primary key composition changedCounts double; downstream MERGE statements behave unpredictably
Freshness behind thresholdIngestion job stalled; source system outage; partition not closedConsumers see stale data and make decisions assuming it is current

Row Count: The Cheapest Check That Earns Its Keep

A row count check uses a recent baseline to decide what is reasonable. The simplest version compares the current run against the average of the last seven days, with a tolerance band. A run that produces fewer than fifty percent of the recent average or more than two hundred percent fails the check. The exact bounds are tuned later; the point is that the check exists and runs. A common refinement is to compare against a same-day-of-week baseline rather than a flat seven-day average, because most consumer-facing data has a weekly seasonality pattern (weekends are quieter, Tuesdays are busier, Friday afternoons drop off). The day-of-week baseline absorbs that pattern automatically and tightens the band on the days where the band can be tight.
1/* Row count check: today's value within 50% to 200% of the trailing 7-day average */
2WITH today AS (
3 SELECT
4 COUNT(*) AS row_count
5 FROM mart.daily_orders
6 WHERE order_date = CURRENT_DATE
7),
8baseline AS (
9 SELECT
10 AVG(row_count) AS expected
11 FROM mart.daily_orders_history
12 WHERE order_date BETWEEN CURRENT_DATE - 8
13 AND CURRENT_DATE - 1
14)
15
16SELECT
17 today.row_count,
18 baseline.expected,
19 today.row_count BETWEEN 0.5 * baseline.expected
20 AND 2 * baseline.expected AS within_range
21FROM today
22CROSS JOIN baseline

Null Rate: One Threshold Per Column

Null rate is calculated as the number of NULL values divided by the total row count, expressed as a percentage. A column whose null rate has historically been below one percent and is now eight percent is broken even if no individual row is technically invalid. The threshold is per column because different columns have different baseline null rates. A required field has a different threshold than an optional one. The check encodes which columns the consumer cares about and what the producer has historically delivered. The most important columns to check for null rate are the ones downstream consumers depend on for filters, joins, and aggregations. A NULL in customer_id breaks every join to dim_customer. A NULL in event_timestamp breaks every time-windowed query. A NULL in amount breaks every revenue aggregation. The consumer's query patterns name which columns matter; the null rate check protects those columns specifically.
1SELECT
2 COUNT(*) AS total_rows,
3 SUM(
4 CASE
5 WHEN customer_id IS NULL THEN 1
6 ELSE 0
7 END
8 ) AS null_customer_id,
9 ROUND(
10 100 * SUM(
11 CASE
12 WHEN customer_id IS NULL THEN 1
13 ELSE 0
14 END
15 ) / COUNT(*),
16 2
17 ) AS null_rate_pct
18FROM mart.daily_orders

Uniqueness: The Primary Key Has To Be Primary

A primary key is meant to be unique. If two rows share the same key, downstream joins, MERGEs, and aggregates are all wrong. The check compares COUNT(*) to COUNT(DISTINCT key). When the two diverge, duplicates exist. The most common cause of duplicate keys is a deduplication step that ran but did not catch every duplicate, or a join that produced more rows than expected because of an unexpected one-to-many relationship. A subtle variation of the same problem is a primary key that used to be a single column and is now a composite. A producer that started emitting events with a new partition dimension can keep the same event_id while distributing copies of an event across partitions, and the resulting duplicates are real and structurally invisible if the uniqueness check still tests only the original key. The fix is to update the uniqueness check whenever the producer changes the key composition.
1SELECT
2 COUNT(*) AS total_rows,
3 COUNT(DISTINCT order_id) AS unique_keys,
4 COUNT(*) - COUNT(DISTINCT
5 order_id
6 ) AS duplicate_count
7FROM mart.daily_orders

Freshness: The Last Timestamp

Freshness asks whether the most recent record is recent enough to match the consumer's expectation. The check is one expression: MAX of an event timestamp column compared to NOW. The threshold is the freshness SLA. A table that promises one-hour freshness fails the check when MAX(event_timestamp) is more than one hour behind the wall clock. Freshness checks catch ingestion stalls earlier than row count checks because a stalled ingestion produces zero new rows, which can look identical to a quiet hour without freshness as a separate signal. The freshness check also catches the case where the pipeline finishes successfully but processed no new data because the source's clock is wrong, which is otherwise invisible to row count and null rate checks.
Do
  • Add the four checks at the same time as the pipeline, not after the first incident
  • Keep thresholds simple and tunable; do not over-engineer at the start
  • Compute checks from production data, not from synthetic samples
Don't
  • Skip the cheap checks because more sophisticated ones are planned
  • Hardcode magic numbers without explaining the baseline they came from
  • Build dashboards of quality metrics that nobody is paid to watch
TIP
Implement all four checks in a single SQL view per table so that a single SELECT against the view summarizes the table's quality state. Reuse beats reinvention.

Quality Checks at Boundaries

Daily Life
Interviews

Place quality gates at every layer boundary so that failures stay scoped to the layer where they originated.

A common mistake in pipeline design is to place all quality checks at the end. The reasoning is that final checks protect the consumer-facing table, which is the part the world sees. The reasoning is incomplete. By the time a problem shows up at the end, several intermediate transforms have already run on bad data. The diagnostic cost climbs because the failure has to be traced back through every transform between the source and the gate. Checks at every layer boundary keep the failure scoped to the layer where it originated. The same principle applies to software testing: unit tests catch bugs at the function boundary, integration tests catch bugs at the module boundary, end-to-end tests catch bugs at the system boundary. Skipping unit tests because end-to-end tests exist makes the end-to-end failures harder to diagnose. Skipping layer-boundary quality gates because a serving-layer gate exists makes serving-layer failures harder to diagnose for the same reason.

The Layered Picture

1Source | v Raw layer < | v Curated layer < | v Serving layer < | v Consumer

Three gates, one at each boundary. Gate 1 catches ingestion failures. Gate 2 catches transform failures. Gate 3 catches serving failures. Each gate scopes the diagnostic effort to one layer.

Why End-Only Checks Fail

Where the Check SitsWhat It CatchesWhat It Fails To Catch
Only at the end (serving layer)Visible breakage of consumer tablesBad raw data that was already aggregated; root cause is buried
At every layer boundaryEach kind of failure at the layer where it originatedNothing in scope; the gates fire on the layer they protect
Only at the sourceExtraction problems; nothing about transform correctnessLogic bugs in transforms; key changes in joins

What Each Gate Asserts

GateLayer BoundaryTypical Assertions
Gate 1: IngestionSource -> RawRow count from source; primary key uniqueness; freshness of latest record
Gate 2: TransformRaw -> CuratedCurated row count proportional to raw; no orphan foreign keys; required columns populated
Gate 3: ServingCurated -> ServingAggregates within historical band; primary key still unique after grouping; SLA-relevant timestamps fresh

Concrete Example: The Same Failure Scoped Three Ways

An upstream order producer began emitting NULL for the customer_id column on Monday morning. With gates at every boundary, the failure surfaced as follows. Gate 1 caught a null rate spike on customer_id in the raw layer immediately after extraction. The pipeline halted at the raw layer. The curated and serving transforms never ran. The on-call engineer was paged, looked at the gate output, and within minutes had a one-line description of the problem: customer_id null rate jumped from 0.1 percent to 8.4 percent. With end-only checks, the same failure would have produced a 12 percent revenue drop on the dashboard several hours later, and the diagnosis would have started by reading the dashboard, then the curated table, then the raw table, then the source. The contrast is the entire reason for layered gates. Boundary gates produce a single, scoped, actionable alert. End-only gates produce an unscoped symptom that has to be unpacked. The difference between two minutes of diagnosis and two hours of diagnosis is exactly the difference between catching the failure where it happens and catching the failure where it shows up.
1# Pipeline pseudocode with gates at every layer boundary
2run_extract()
3run_quality_gate(layer='raw', table='raw.orders')
4
5run_transform_to_curated()
6run_quality_gate(layer='curated', table='curated.fct_orders')
7
8run_serving_aggregations()
9run_quality_gate(layer='serving', table='mart.daily_orders')
10
11notify_consumers()
Reasons gates belong at every boundary, not only at the end:
  • Each gate fires on the layer where the failure originated
  • Diagnostic cost is proportional to where the failure was caught
  • Downstream transforms do not run on broken data, saving compute and confusion
  • The failure root cause is named by the gate that fired

What "At The Boundary" Means

A gate at a boundary runs after the layer's writes have completed and before the next layer reads from it. The gate is a scheduled task in the orchestrator that depends on the layer's transforms. The gate's success or failure is what gates downstream work, not the orchestrator's view of whether the transform exited zero. If the transform succeeded operationally but the gate fails, the next layer does not start. The gate has the authority to stop the DAG.
Quality Check As A Dashboard
  • Runs after the pipeline; humans look at it later
  • Failure does not stop downstream work
  • Becomes background noise; staleness is normal
  • No on-call response when checks fail
Quality Check As A Gate
  • Runs in the DAG; downstream depends on it
  • Failure halts the next layer immediately
  • Failures are events; engineers respond when they fire
  • On-call response is part of the runbook

A quality check that does not have authority to stop the pipeline is information, not a control. Information is useful; controls are protective.

check
Place a quality gate at every layer boundary; failures stay scoped to where they originated.
alert
End-only checks bury root causes under successful intermediate transforms.
query
A gate is a DAG task that downstream depends on; it has authority to halt the pipeline.

Warn vs Block Authorities

Daily Life
Interviews

Choose between warning and blocking for each quality check based on the consequence of running with bad data.

Not every quality check should stop the pipeline. Some failures are catastrophic and demand a halt; others are advisory and demand a notification. Treating every check as a blocker creates an over-protective pipeline that halts on minor anomalies and wakes engineers up at 3am for problems that could have waited. Treating every check as a warning creates a pipeline that ignores its own alarms. The classification is per-check, not per-pipeline, and the rule is simple: block when running is worse than not running, warn when running is better than not running but still imperfect. The classification has to be made at the time the check is authored, not later. A check that ships as 'temporarily a warning until the team gets used to it' tends to stay a warning forever, because nothing forces the conversation about promotion. The default authority should be the long-term authority.

The Decision Rule

Failure SeverityAuthorityWhat Happens
Output would mislead consumersBlockPipeline halts; downstream tables stay on the previous run; on-call is paged
Output is degraded but usableWarnPipeline completes; consumers receive a freshness or quality annotation; ticket is filed
Output is suspicious but plausibleWarnPipeline completes; humans review during business hours; no page

Examples By Authority

BlockWarn
Block
Halt the pipeline; previous data stays
Reserved for failures where running is worse than not running: uniqueness violations, schema breaks, large row count drops. Pages on-call. Targets one or two firings per pipeline per quarter.
Warn
Continue with notification
For failures where the data is degraded but usable. Posts to a Slack channel reviewed during business hours. Does not interrupt on-call. Tolerates more noise; humans extract signal during review.
Block (Halt the Pipeline)
  • Primary key uniqueness violated; downstream joins are now wrong
  • Required column null rate above 5 percent threshold
  • Row count below 50 percent of recent baseline
  • Schema mismatch on a column the consumer parses
Warn (Notify but Continue)
  • Row count 80 to 90 percent of recent average; plausibly a quiet day
  • Optional column null rate slightly above norm
  • Distribution shift on a non-load-bearing dimension
  • Freshness slightly behind SLA but within tolerance

What "Block" Actually Means In Practice

A blocking check is implemented as a DAG task that fails when the assertion fails. The orchestrator marks the task red and refuses to start downstream tasks that depend on it. The previous run's data stays in place; consumers see the last good output. The on-call rotation receives an alert that includes the check that fired and the assertion it failed. Blocking is the strongest available authority and reserves itself for cases where running is worse than not running. The implementation pattern matters because it determines what consumers see during a failure. If the gate halts the publish, consumers see yesterday's data, which is correct (just stale) and clearly labeled as such. If the gate is implemented downstream of the publish, consumers see today's bad data with a warning attached, which is worse: the data is wrong, and consumers who do not read the warning act on it. Halting the publish is almost always the safer default.
1# A blocking check raises and halts the DAG
2def block_if_uniqueness_violated(table, key_column):
3 duplicates = run_sql(
4 f'SELECT COUNT(*) - COUNT(DISTINCT {key_column}) AS d FROM {table}'
5 )
6 if duplicates > 0:
7 raise QualityCheckFailed(
8 f'{table}.{key_column} has {duplicates} duplicate rows; downstream halted'
9 )

What "Warn" Actually Means In Practice

A warning check writes the failure to a quality log table, files a ticket, posts to a Slack channel that humans actually read, and lets the pipeline continue. Consumers can see that today's run had a warning, which informs their interpretation of the data. The warning has to be loud enough to be noticed and quiet enough not to drown out blocking alerts. A warning that fires on every run becomes background noise; a warning that fires once a quarter gets investigated. Designating a separate Slack channel for warnings keeps the blocking-channel signal-to-noise ratio honest. Every page in the blocking channel should be actionable; warnings live elsewhere and are reviewed in batches during business hours. Mixing the two erodes the value of both: actionable pages get lost in advisory noise, and advisory signals get overlooked because nobody wants to disturb the blocking-channel signal-to-noise ratio.
1# A warning check logs and notifies, but continues
2def warn_if_row_count_low(table, baseline_avg, tolerance=0.20):
3 actual = count_rows(table)
4 expected_low = baseline_avg * (1 - tolerance)
5 if actual < expected_low:
6 log_quality_warning(
7 check='row_count',
8 table=table,
9 actual=actual,
10 expected_min=expected_low,
11 severity='warn'
12 )
13 post_to_slack('#data-quality', f'{table} low: {actual} vs {expected_low}')
14 return actual

The Cost of Getting It Wrong

A pipeline that blocks too often creates a real cost: every block is an on-call interruption, and every false block trains the on-call rotation to acknowledge alerts without reading them. A pipeline that warns too often creates a different cost: warnings get ignored, and a real warning lands in a sea of noise. The discipline is to keep blocks rare and meaningful, and to keep warnings informative rather than constant. Tuning thresholds against historical data is the way to keep both rates honest.
Block when at least one of these is true:
  • Downstream consumers will make wrong decisions if the data is published
  • The failure cannot be fixed by retry and requires human intervention
  • The output will need to be reverted if it is allowed to publish
  • The failure indicates a contract violation between producer and consumer
Warn when at least one of these is true:
  • The data is degraded but still usable for the most common consumer use cases
  • The anomaly is plausible (a holiday, a marketing event, a known upstream change)
  • The failure is informative for trend analysis but not actionable in the moment
  • Halting would cause more disruption than the bad data itself
TIP
Keep the block rate to a target of one or two blocking failures per pipeline per quarter. If a particular check blocks more often than that, either the threshold needs tuning or the check should be downgraded to a warning.
alert
Block when running is worse than not running; warn when running is imperfect but better than halting.
check
Each check carries its own authority; the choice is per-check, not per-pipeline.

First Quality Gate: Row Count

Daily Life
Interviews

Build a complete first quality gate as a SQL assertion wired into a DAG, halting downstream when the assertion fails.

Concepts become useful when applied. The exercise here builds a complete first quality gate: a SQL assertion that the row count for a daily order summary table falls within an expected range. The gate is implemented as a SQL query, the query is run by the orchestrator after the transform finishes, and the gate halts the DAG when the assertion fails. The result is a working quality gate in fewer than thirty lines of code. The exercise is deliberately small. Small gates ship; large gates linger in design review. The discipline of starting with the simplest possible assertion is the same discipline that pays off in software engineering: a working primitive is the foundation that everything more sophisticated builds on top of. Once the first gate is in production and the team has confidence in it, additional gates copy the pattern with different SQL. The cost of the second gate is a fraction of the cost of the first because the orchestration wiring, the alert routing, and the runbook structure already exist.

Step 1: Pick the Table and the Assertion

The table is mart.daily_orders, produced once per day at 2am. The expected row count varies by day of week (weekends are quieter than weekdays) but stays within a stable historical band. The assertion is that today's row count must fall between fifty percent and two hundred percent of the average row count for the same day of week over the trailing eight weeks. The bounds are wide enough to absorb genuine variation and tight enough to catch real failures.

Step 2: Write the SQL

1WITH today AS (
2 SELECT
3 order_date,
4 COUNT(*) AS row_count
5 FROM mart.daily_orders
6 WHERE order_date = CURRENT_DATE
7 GROUP BY order_date
8),
9baseline AS (
10 SELECT
11 EXTRACT(DOW FROM order_date) AS day_of_week,
12 AVG(row_count) AS expected_count,
13 STDDEV(row_count) AS expected_stddev
14 FROM mart.daily_orders_history
15 WHERE order_date BETWEEN CURRENT_DATE - 56
16AND CURRENT_DATE - 1
17 AND EXTRACT(DOW FROM order_date) = EXTRACT(DOW FROM CURRENT_DATE)
18 GROUP BY EXTRACT(DOW FROM order_date)
19)
20
21SELECT
22 today.row_count,
23 baseline.expected_count,
24 ROUND(
25 today.row_count / baseline.expected_count,
26 2
27 ) AS ratio,
28 CASE
29 WHEN today.row_count BETWEEN 0.5 * baseline.expected_count
30 AND 2 * baseline.expected_count THEN 'pass'
31 ELSE 'fail'
32 END AS gate_result
33FROM today
34CROSS JOIN baseline
The query produces one row with a gate_result of 'pass' or 'fail'. The orchestrator runs the query and reads the value. A 'pass' value lets downstream tasks proceed. A 'fail' value raises an exception that halts the DAG. The same shape works for any row-count assertion against any table.

Step 3: Wire the Gate Into the DAG

1# Airflow operator wiring (sketch)
2from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
3
4transform_orders = SnowflakeOperator(
5 task_id='transform_orders',
6 sql='sql/transforms/daily_orders.sql',
7)
8
9row_count_gate = SnowflakeOperator(
10 task_id='row_count_gate',
11 sql='sql/quality/row_count_gate.sql',
12 do_xcom_push=True,
13)
14
15from airflow.operators.python import PythonOperator
16
17def raise_if_fail(**ctx):
18 result = ctx['ti'].xcom_pull(task_ids='row_count_gate')
19 if result and result[0][3] != 'pass':
20 raise ValueError(f'Quality gate failed: {result}')
21
22enforce_gate = PythonOperator(
23 task_id='enforce_gate',
24 python_callable=raise_if_fail,
25 provide_context=True,
26)
27
28publish_to_consumers = SnowflakeOperator(
29 task_id='publish_to_consumers',
30 sql='sql/serving/publish.sql',
31)
32
33transform_orders >> row_count_gate >> enforce_gate >> publish_to_consumers

Step 4: What Happens When the Gate Fires

When the gate fails, enforce_gate raises a ValueError. Airflow marks the task red. Downstream publish_to_consumers does not start. Consumers continue to read yesterday's mart, which is the most recent good data. An on-call alert fires. The runbook directs the on-call engineer to inspect the gate's output: today's row count, the expected count, the ratio, and the day of week being compared. The engineer either fixes the upstream problem and re-runs, or determines that the data is correct and the threshold needs updating, in which case the threshold change is itself a code change with a PR review. The PR review for a threshold change is more important than it might appear. Loosening a threshold without review is how quality programs silently degrade over years; every loosening looks reasonable in the moment, and the cumulative effect is that the gates no longer catch real failures. Treating threshold changes as code keeps the loosening visible and historical.

Step 5: Iterate

IterationChangeReason
1Add the row count gate as writtenEstablishes the floor of quality protection
2Add a null rate gate on customer_id, order_amountCatches producer changes that load-bearing columns started arriving NULL
3Add a uniqueness gate on order_idCatches dedup failures and join explosions
4Add a freshness gate on event_timestampCatches ingestion stalls before they become row count problems
5Tune thresholds against three months of historical dataReduces false positives; raises confidence in real failures

The Whole Picture

1transform_orders | v row_count_gate(SQL : COUNT vs baseline) | v enforce_gate(Python : raise ON fail) | v(only IF pass) publish_to_consumers | v Looker dashboard reads mart.daily_orders
Five tasks, four arrows, one quality gate. The dashboard either shows fresh data because the gate passed, or the previous good data because the gate halted the publish. Either way, consumers are protected from silent corruption. The same shape extends to as many gates as the table needs.
Do
  • Express assertions in SQL where possible; the warehouse is already running the query
  • Keep the gate's output a single row with a clear pass or fail value
  • Wire the gate as a DAG task that downstream depends on
Don't
  • Compute the assertion in Python when SQL would do; doubles the surface area
  • Skip the historical baseline; magic numbers age badly
  • Let the gate share a task with the transform; failures must be distinguishable
TIP
When the first gate is in place, every later quality investment is incremental. The hard part is the first gate; the second through tenth are copies of the pattern with different SQL.
PUTTING IT ALL TOGETHER

> A startup data team has just shipped its first end-to-end pipeline: Postgres orders extracted to S3, transformed in Snowflake into mart.daily_orders, and read by a Looker dashboard for the leadership team. The CEO begins making weekly decisions from the dashboard. The data engineer is asked to make sure the dashboard is right and to make sure the team will be notified when it is not.

Operational success of the pipeline (the run finished green) does not imply semantic success (the data describes the world correctly). The first move is to add the four cheap checks: row count, null rate, uniqueness, and freshness.
Place the gates at every layer boundary: ingestion (raw), transform (curated), and serving (mart). Each gate scopes failures to where they originated, drawing on the layered shape introduced in Lesson 1.
Classify each check as block or warn. Block when running is worse than not running (uniqueness violated, null rate above threshold). Warn when running is imperfect but better than halting.
Build the first concrete gate as a SQL assertion against a historical baseline, wired into the DAG so the publish step depends on the gate. Iterate from there: more gates, tuned thresholds, eventually a quality suite the consumer can rely on.
KEY TAKEAWAYS
Operational success is not semantic success: a green pipeline can produce wrong data, and the gap between the two is where silent corruption lives.
Four cheap checks catch most failures: row count, null rate, uniqueness, freshness. Implement these first, before more sophisticated checks.
Quality gates belong at every layer boundary: checks at the end bury root causes; checks at every boundary scope failure to the layer where it originated.
Block when running is worse than not running: warn when running is imperfect but better than halting. Keep blocking failures rare and meaningful.
A first gate is a SQL assertion in a DAG task: compare against a historical baseline, return a single pass-or-fail value, halt downstream when it fails.

A pipeline that ran is not the same as a pipeline that produced correct data

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: Pipeline Ran vs Data Is Good, Four Cheap Quality Checks, Quality Checks at Boundaries, Warn vs Block Authorities, First Quality Gate: Row Count

Lesson Sections

  1. Pipeline Ran vs Data Is Good (concepts: paDataQuality, paSilentFailure)

    Pipelines have two distinct success criteria. One criterion is operational: did the code execute, did the writes commit, did the orchestrator mark the run green. The other criterion is semantic: does the data the pipeline produced actually describe the world correctly. Operational success is necessary but not sufficient for semantic success. The most expensive production incidents in mature data organizations are the ones where operational success and semantic failure coexist, because nobody is

  2. Four Cheap Quality Checks (concepts: paFourQualityChecks, paFreshnessCheck)

    Quality engineering has a 90/10 rule. Roughly ninety percent of silent failures are caught by ten percent of the possible checks. The four cheap checks below cover that ninety percent. They run in seconds, they need only basic SQL, and they catch the most common production incidents. The point of starting with these four is that any of them is better than none, and arguments about more sophisticated checks are arguments about edge cases until the basics are in place. The four checks are also the

  3. Quality Checks at Boundaries (concepts: paQualityGate, paLayerBoundaryChecks)

    A common mistake in pipeline design is to place all quality checks at the end. The reasoning is that final checks protect the consumer-facing table, which is the part the world sees. The reasoning is incomplete. By the time a problem shows up at the end, several intermediate transforms have already run on bad data. The diagnostic cost climbs because the failure has to be traced back through every transform between the source and the gate. Checks at every layer boundary keep the failure scoped to

  4. Warn vs Block Authorities (concepts: paWarnVsBlock, paAlertSeverity)

    Not every quality check should stop the pipeline. Some failures are catastrophic and demand a halt; others are advisory and demand a notification. Treating every check as a blocker creates an over-protective pipeline that halts on minor anomalies and wakes engineers up at 3am for problems that could have waited. Treating every check as a warning creates a pipeline that ignores its own alarms. The classification is per-check, not per-pipeline, and the rule is simple: block when running is worse t

  5. First Quality Gate: Row Count (concepts: paFirstQualityGate, paSqlAssertion)

    Concepts become useful when applied. The exercise here builds a complete first quality gate: a SQL assertion that the row count for a daily order summary table falls within an expected range. The gate is implemented as a SQL query, the query is run by the orchestrator after the transform finishes, and the gate halts the DAG when the assertion fails. The result is a working quality gate in fewer than thirty lines of code. The exercise is deliberately small. Small gates ship; large gates linger in