A logistics company at Series C scale ran a nightly pipeline that aggregated package events into a delivery performance dashboard. One Tuesday the dashboard showed an on-time rate of 99.7 percent across every region. The number was wrong. An upstream service had silently dropped a Kafka partition for six hours, and only the late events that did make it through were marked late. Every check the pipeline had passed: extract finished, transform finished, load finished, the table was updated. Operations leadership made a routing decision based on the dashboard before anyone noticed the gap. The incident postmortem named the failure not as a pipeline outage but as a data quality outage. The pipeline ran. The data was wrong. The two are different events, and a pipeline that does not separate them is a pipeline that lies to its consumers in green.
Pipeline Ran vs Data Is Good
Daily Life
Interviews
Recognize the gap between operational success and semantic success, and name the failure mode that hides in that gap.
Pipelines have two distinct success criteria. One criterion is operational: did the code execute, did the writes commit, did the orchestrator mark the run green. The other criterion is semantic: does the data the pipeline produced actually describe the world correctly. Operational success is necessary but not sufficient for semantic success. The most expensive production incidents in mature data organizations are the ones where operational success and semantic failure coexist, because nobody is alerted until a human notices a number that looks wrong. Internal postmortems at Netflix, Uber, and Stripe across the last decade share the same shape: the pipeline ran, the gates were green, the dashboard updated, and the underlying data was wrong for hours or days before anyone investigated. The shared diagnosis in those postmortems is not a code bug; it is a missing signal. The pipeline had no way to say 'the data is wrong' even when it was.
Two Definitions of Success
Criterion
What It Asks
Who Notices When It Fails
Operational success
Did the job exit zero, did the orchestrator mark it green
On-call alert fires within minutes
Semantic success
Does the output table describe what the world actually contains
A human asks 'why does this number look strange' hours or days later
Combined success
Both criteria pass; the pipeline ran and produced trustworthy data
Nobody notices because nothing is wrong
The gap between these two criteria has a name: silent data corruption. Silent because nothing in the pipeline raises an alarm. Corruption because the data is wrong. The dashboard updates, the model retrains, the report ships, and the world moves on while the data underneath says something untrue.
Concrete Failure Modes That Look Green
What Happened Upstream
What the Pipeline Did
What the Output Looked Like
A Kafka partition dropped events for six hours
Read every event that was actually present, succeeded normally
A row count 30 percent below normal that nobody noticed
A producer started sending NULL for a column that used to be required
Loaded the column as NULL; no schema check existed
Downstream joins silently lost rows; revenue dashboard dropped 12 percent
A dedupe step changed its key by accident
Ran successfully against the new key; output had duplicate orders
Order count doubled; finance reconciled three days later
Source database failover lost the last 15 minutes of writes
Pulled what was there, no awareness of the gap
Customer activity dashboard missed a real spike during a marketing push
The two questions every pipeline must answer separately:
▸Did the run finish without raising an exception
▸Does the output table describe a world that actually existed
▸Without the second question, the answer to the first is misleading
▸Quality checks are how the second question gets asked in code
Where Trust Comes From
Consumer trust in a data product comes from two things: the data has been correct in the past, and the producer can prove that the current run is also correct. The proof is the quality check. Without a quality check, the only proof is time, and time is the most expensive proof: a quarter of correct dashboards followed by one wrong number is not enough to keep trust intact. Once a finance team finds a wrong number, every prior number gets re-inspected. The retrospective inspection consumes more engineering hours than the original quality checks would have. The pattern repeats across teams. A single bad number on a CFO's desk often kicks off a multi-week audit of every related number, an audit that produces nothing wrong but consumes the data engineering team for the duration. The cheapest way to prevent that audit is to have evidence on hand when the question is asked.
•Pipeline With No Quality Checks
Operational success means 'green'; semantic failure goes unnoticed
Consumers discover problems by spotting strange numbers in dashboards
Trust is fragile; one wrong number triggers retrospective audits
Failure root cause takes hours to find because nothing was checked
✓Pipeline With Basic Quality Checks
Green requires both operational success and quality checks passing
Failures route to alerts within the same run as the failure
Trust accumulates because correctness has been continuously proved
Failure root cause is narrowed by which check fired
What a Quality Check Actually Is
A quality check is a small piece of code that asserts something about a dataset and fails noisily if the assertion does not hold. The check runs as part of the pipeline, in the same DAG as the work it is checking. Its only job is to compare what was produced against what is plausible. A quality check that runs in a separate dashboard that nobody watches is not a quality check; it is a chart. The defining property is that the check has the authority to stop the pipeline or to fire an alert that a human will see. The same code that produces the check also defines the response. A check whose response is unspecified ends up as background noise; a check whose response is concrete (page on-call, halt the publish, file a ticket) is a real protective control. Authoring the response in the same PR as the check is the discipline that turns instrumentation into protection.
Quality checks are not optional infrastructure for a serious pipeline. Pipelines without them are demos that happened to make it to production.
Operational success and semantic success are independent; a green run can produce wrong data.
Silent data corruption is the failure mode that costs the most because nobody notices it in time.
A quality check is code that asserts a property of the data and has authority to halt or alert.
TIP
When reviewing a new pipeline, ask the author what would happen if the source silently dropped half its data tomorrow. If the answer involves a human noticing, the pipeline needs quality checks before it ships.
Four Cheap Quality Checks
Daily Life
Interviews
Apply the four cheap quality checks (row count, null rate, uniqueness, freshness) to a production table.
Quality engineering has a 90/10 rule. Roughly ninety percent of silent failures are caught by ten percent of the possible checks. The four cheap checks below cover that ninety percent. They run in seconds, they need only basic SQL, and they catch the most common production incidents. The point of starting with these four is that any of them is better than none, and arguments about more sophisticated checks are arguments about edge cases until the basics are in place. The four checks are also the easiest to explain to a non-technical stakeholder. A finance partner can understand 'the row count was outside its band'; the same partner cannot reasonably be asked to interpret a Kolmogorov-Smirnov statistic. Explainability is not a courtesy; it is what allows the alert to translate into a corrective action without a phone call.
Row countNull rateUniquenessFreshness
Row count
Did roughly the right amount of data arrive
An order table that produces 100,000 rows on a normal day and 12,000 rows today is almost certainly broken upstream. A simple range check catches dropped partitions, failed joins, and broken filters.
Null rate
Did required columns stay populated
A column that is normally 99.9 percent populated and is suddenly 80 percent NULL signals a producer change. The check is a single percentage threshold per column.
Uniqueness
Did the primary key stay unique
Duplicates in a key column mean a deduplication step failed or a join exploded. The check is a count of distinct keys versus total rows.
Freshness
Is the most recent data recent enough
An events table whose latest timestamp is from six hours ago when it should be within fifteen minutes signals an ingestion problem. The check is a single MAX timestamp comparison.
What Each Check Catches
Check
Common Cause When It Fails
What Would Have Happened Without It
Row count below expected
Source dropped events; filter became too restrictive; join lost rows
Dashboard shows artificially low numbers; decisions made on partial data
Row count above expected
Deduplication failed; join exploded into a Cartesian product; backfill ran twice
Aggregates double-count; revenue numbers inflated
Null rate spike
Producer changed a required field to optional; upstream extract bug
Joins on the column drop rows silently; downstream metrics shift
Ingestion job stalled; source system outage; partition not closed
Consumers see stale data and make decisions assuming it is current
Row Count: The Cheapest Check That Earns Its Keep
A row count check uses a recent baseline to decide what is reasonable. The simplest version compares the current run against the average of the last seven days, with a tolerance band. A run that produces fewer than fifty percent of the recent average or more than two hundred percent fails the check. The exact bounds are tuned later; the point is that the check exists and runs. A common refinement is to compare against a same-day-of-week baseline rather than a flat seven-day average, because most consumer-facing data has a weekly seasonality pattern (weekends are quieter, Tuesdays are busier, Friday afternoons drop off). The day-of-week baseline absorbs that pattern automatically and tightens the band on the days where the band can be tight.
1
/* Row count check: today's value within 50% to 200% of the trailing 7-day average */
2
WITHtodayAS(
3
SELECT
4
COUNT(*)ASrow_count
5
FROMmart.daily_orders
6
WHEREorder_date=CURRENT_DATE
7
),
8
baselineAS(
9
SELECT
10
AVG(row_count)ASexpected
11
FROMmart.daily_orders_history
12
WHEREorder_dateBETWEENCURRENT_DATE-8
13
ANDCURRENT_DATE-1
14
)
15
16
SELECT
17
today.row_count,
18
baseline.expected,
19
today.row_countBETWEEN0.5*baseline.expected
20
AND2*baseline.expectedASwithin_range
21
FROMtoday
22
CROSSJOINbaseline
Null Rate: One Threshold Per Column
Null rate is calculated as the number of NULL values divided by the total row count, expressed as a percentage. A column whose null rate has historically been below one percent and is now eight percent is broken even if no individual row is technically invalid. The threshold is per column because different columns have different baseline null rates. A required field has a different threshold than an optional one. The check encodes which columns the consumer cares about and what the producer has historically delivered. The most important columns to check for null rate are the ones downstream consumers depend on for filters, joins, and aggregations. A NULL in customer_id breaks every join to dim_customer. A NULL in event_timestamp breaks every time-windowed query. A NULL in amount breaks every revenue aggregation. The consumer's query patterns name which columns matter; the null rate check protects those columns specifically.
1
SELECT
2
COUNT(*)AStotal_rows,
3
SUM(
4
CASE
5
WHENcustomer_idISNULLTHEN1
6
ELSE0
7
END
8
)ASnull_customer_id,
9
ROUND(
10
100*SUM(
11
CASE
12
WHENcustomer_idISNULLTHEN1
13
ELSE0
14
END
15
)/COUNT(*),
16
2
17
)ASnull_rate_pct
18
FROMmart.daily_orders
Uniqueness: The Primary Key Has To Be Primary
A primary key is meant to be unique. If two rows share the same key, downstream joins, MERGEs, and aggregates are all wrong. The check compares COUNT(*) to COUNT(DISTINCT key). When the two diverge, duplicates exist. The most common cause of duplicate keys is a deduplication step that ran but did not catch every duplicate, or a join that produced more rows than expected because of an unexpected one-to-many relationship. A subtle variation of the same problem is a primary key that used to be a single column and is now a composite. A producer that started emitting events with a new partition dimension can keep the same event_id while distributing copies of an event across partitions, and the resulting duplicates are real and structurally invisible if the uniqueness check still tests only the original key. The fix is to update the uniqueness check whenever the producer changes the key composition.
1
SELECT
2
COUNT(*)AStotal_rows,
3
COUNT(DISTINCTorder_id)ASunique_keys,
4
COUNT(*)-COUNT(DISTINCT
5
order_id
6
)ASduplicate_count
7
FROMmart.daily_orders
Freshness: The Last Timestamp
Freshness asks whether the most recent record is recent enough to match the consumer's expectation. The check is one expression: MAX of an event timestamp column compared to NOW. The threshold is the freshness SLA. A table that promises one-hour freshness fails the check when MAX(event_timestamp) is more than one hour behind the wall clock. Freshness checks catch ingestion stalls earlier than row count checks because a stalled ingestion produces zero new rows, which can look identical to a quiet hour without freshness as a separate signal. The freshness check also catches the case where the pipeline finishes successfully but processed no new data because the source's clock is wrong, which is otherwise invisible to row count and null rate checks.
✓Do
Add the four checks at the same time as the pipeline, not after the first incident
Keep thresholds simple and tunable; do not over-engineer at the start
Compute checks from production data, not from synthetic samples
✗Don't
Skip the cheap checks because more sophisticated ones are planned
Hardcode magic numbers without explaining the baseline they came from
Build dashboards of quality metrics that nobody is paid to watch
TIP
Implement all four checks in a single SQL view per table so that a single SELECT against the view summarizes the table's quality state. Reuse beats reinvention.
Quality Checks at Boundaries
Daily Life
Interviews
Place quality gates at every layer boundary so that failures stay scoped to the layer where they originated.
A common mistake in pipeline design is to place all quality checks at the end. The reasoning is that final checks protect the consumer-facing table, which is the part the world sees. The reasoning is incomplete. By the time a problem shows up at the end, several intermediate transforms have already run on bad data. The diagnostic cost climbs because the failure has to be traced back through every transform between the source and the gate. Checks at every layer boundary keep the failure scoped to the layer where it originated. The same principle applies to software testing: unit tests catch bugs at the function boundary, integration tests catch bugs at the module boundary, end-to-end tests catch bugs at the system boundary. Skipping unit tests because end-to-end tests exist makes the end-to-end failures harder to diagnose. Skipping layer-boundary quality gates because a serving-layer gate exists makes serving-layer failures harder to diagnose for the same reason.
Three gates, one at each boundary. Gate 1 catches ingestion failures. Gate 2 catches transform failures. Gate 3 catches serving failures. Each gate scopes the diagnostic effort to one layer.
Why End-Only Checks Fail
Where the Check Sits
What It Catches
What It Fails To Catch
Only at the end (serving layer)
Visible breakage of consumer tables
Bad raw data that was already aggregated; root cause is buried
At every layer boundary
Each kind of failure at the layer where it originated
Nothing in scope; the gates fire on the layer they protect
Only at the source
Extraction problems; nothing about transform correctness
Logic bugs in transforms; key changes in joins
What Each Gate Asserts
Gate
Layer Boundary
Typical Assertions
Gate 1: Ingestion
Source -> Raw
Row count from source; primary key uniqueness; freshness of latest record
Gate 2: Transform
Raw -> Curated
Curated row count proportional to raw; no orphan foreign keys; required columns populated
Gate 3: Serving
Curated -> Serving
Aggregates within historical band; primary key still unique after grouping; SLA-relevant timestamps fresh
Concrete Example: The Same Failure Scoped Three Ways
An upstream order producer began emitting NULL for the customer_id column on Monday morning. With gates at every boundary, the failure surfaced as follows. Gate 1 caught a null rate spike on customer_id in the raw layer immediately after extraction. The pipeline halted at the raw layer. The curated and serving transforms never ran. The on-call engineer was paged, looked at the gate output, and within minutes had a one-line description of the problem: customer_id null rate jumped from 0.1 percent to 8.4 percent. With end-only checks, the same failure would have produced a 12 percent revenue drop on the dashboard several hours later, and the diagnosis would have started by reading the dashboard, then the curated table, then the raw table, then the source. The contrast is the entire reason for layered gates. Boundary gates produce a single, scoped, actionable alert. End-only gates produce an unscoped symptom that has to be unpacked. The difference between two minutes of diagnosis and two hours of diagnosis is exactly the difference between catching the failure where it happens and catching the failure where it shows up.
1
# Pipeline pseudocode with gates at every layer boundary
Reasons gates belong at every boundary, not only at the end:
▸Each gate fires on the layer where the failure originated
▸Diagnostic cost is proportional to where the failure was caught
▸Downstream transforms do not run on broken data, saving compute and confusion
▸The failure root cause is named by the gate that fired
What "At The Boundary" Means
A gate at a boundary runs after the layer's writes have completed and before the next layer reads from it. The gate is a scheduled task in the orchestrator that depends on the layer's transforms. The gate's success or failure is what gates downstream work, not the orchestrator's view of whether the transform exited zero. If the transform succeeded operationally but the gate fails, the next layer does not start. The gate has the authority to stop the DAG.
•Quality Check As A Dashboard
Runs after the pipeline; humans look at it later
Failure does not stop downstream work
Becomes background noise; staleness is normal
No on-call response when checks fail
✓Quality Check As A Gate
Runs in the DAG; downstream depends on it
Failure halts the next layer immediately
Failures are events; engineers respond when they fire
On-call response is part of the runbook
A quality check that does not have authority to stop the pipeline is information, not a control. Information is useful; controls are protective.
Place a quality gate at every layer boundary; failures stay scoped to where they originated.
End-only checks bury root causes under successful intermediate transforms.
A gate is a DAG task that downstream depends on; it has authority to halt the pipeline.
Warn vs Block Authorities
Daily Life
Interviews
Choose between warning and blocking for each quality check based on the consequence of running with bad data.
Not every quality check should stop the pipeline. Some failures are catastrophic and demand a halt; others are advisory and demand a notification. Treating every check as a blocker creates an over-protective pipeline that halts on minor anomalies and wakes engineers up at 3am for problems that could have waited. Treating every check as a warning creates a pipeline that ignores its own alarms. The classification is per-check, not per-pipeline, and the rule is simple: block when running is worse than not running, warn when running is better than not running but still imperfect. The classification has to be made at the time the check is authored, not later. A check that ships as 'temporarily a warning until the team gets used to it' tends to stay a warning forever, because nothing forces the conversation about promotion. The default authority should be the long-term authority.
The Decision Rule
Failure Severity
Authority
What Happens
Output would mislead consumers
Block
Pipeline halts; downstream tables stay on the previous run; on-call is paged
Output is degraded but usable
Warn
Pipeline completes; consumers receive a freshness or quality annotation; ticket is filed
Output is suspicious but plausible
Warn
Pipeline completes; humans review during business hours; no page
Examples By Authority
BlockWarn
Block
Halt the pipeline; previous data stays
Reserved for failures where running is worse than not running: uniqueness violations, schema breaks, large row count drops. Pages on-call. Targets one or two firings per pipeline per quarter.
Warn
Continue with notification
For failures where the data is degraded but usable. Posts to a Slack channel reviewed during business hours. Does not interrupt on-call. Tolerates more noise; humans extract signal during review.
•Block (Halt the Pipeline)
Primary key uniqueness violated; downstream joins are now wrong
Row count 80 to 90 percent of recent average; plausibly a quiet day
Optional column null rate slightly above norm
Distribution shift on a non-load-bearing dimension
Freshness slightly behind SLA but within tolerance
What "Block" Actually Means In Practice
A blocking check is implemented as a DAG task that fails when the assertion fails. The orchestrator marks the task red and refuses to start downstream tasks that depend on it. The previous run's data stays in place; consumers see the last good output. The on-call rotation receives an alert that includes the check that fired and the assertion it failed. Blocking is the strongest available authority and reserves itself for cases where running is worse than not running. The implementation pattern matters because it determines what consumers see during a failure. If the gate halts the publish, consumers see yesterday's data, which is correct (just stale) and clearly labeled as such. If the gate is implemented downstream of the publish, consumers see today's bad data with a warning attached, which is worse: the data is wrong, and consumers who do not read the warning act on it. Halting the publish is almost always the safer default.
f'SELECT COUNT(*) - COUNT(DISTINCT {key_column}) AS d FROM {table}'
5
)
6
ifduplicates>0:
7
raiseQualityCheckFailed(
8
f'{table}.{key_column} has {duplicates} duplicate rows; downstream halted'
9
)
What "Warn" Actually Means In Practice
A warning check writes the failure to a quality log table, files a ticket, posts to a Slack channel that humans actually read, and lets the pipeline continue. Consumers can see that today's run had a warning, which informs their interpretation of the data. The warning has to be loud enough to be noticed and quiet enough not to drown out blocking alerts. A warning that fires on every run becomes background noise; a warning that fires once a quarter gets investigated. Designating a separate Slack channel for warnings keeps the blocking-channel signal-to-noise ratio honest. Every page in the blocking channel should be actionable; warnings live elsewhere and are reviewed in batches during business hours. Mixing the two erodes the value of both: actionable pages get lost in advisory noise, and advisory signals get overlooked because nobody wants to disturb the blocking-channel signal-to-noise ratio.
1
# A warning check logs and notifies, but continues
post_to_slack('#data-quality',f'{table} low: {actual} vs {expected_low}')
14
returnactual
The Cost of Getting It Wrong
A pipeline that blocks too often creates a real cost: every block is an on-call interruption, and every false block trains the on-call rotation to acknowledge alerts without reading them. A pipeline that warns too often creates a different cost: warnings get ignored, and a real warning lands in a sea of noise. The discipline is to keep blocks rare and meaningful, and to keep warnings informative rather than constant. Tuning thresholds against historical data is the way to keep both rates honest.
Block when at least one of these is true:
▸Downstream consumers will make wrong decisions if the data is published
▸The failure cannot be fixed by retry and requires human intervention
▸The output will need to be reverted if it is allowed to publish
▸The failure indicates a contract violation between producer and consumer
Warn when at least one of these is true:
▸The data is degraded but still usable for the most common consumer use cases
▸The anomaly is plausible (a holiday, a marketing event, a known upstream change)
▸The failure is informative for trend analysis but not actionable in the moment
▸Halting would cause more disruption than the bad data itself
TIP
Keep the block rate to a target of one or two blocking failures per pipeline per quarter. If a particular check blocks more often than that, either the threshold needs tuning or the check should be downgraded to a warning.
Block when running is worse than not running; warn when running is imperfect but better than halting.
Each check carries its own authority; the choice is per-check, not per-pipeline.
First Quality Gate: Row Count
Daily Life
Interviews
Build a complete first quality gate as a SQL assertion wired into a DAG, halting downstream when the assertion fails.
Concepts become useful when applied. The exercise here builds a complete first quality gate: a SQL assertion that the row count for a daily order summary table falls within an expected range. The gate is implemented as a SQL query, the query is run by the orchestrator after the transform finishes, and the gate halts the DAG when the assertion fails. The result is a working quality gate in fewer than thirty lines of code. The exercise is deliberately small. Small gates ship; large gates linger in design review. The discipline of starting with the simplest possible assertion is the same discipline that pays off in software engineering: a working primitive is the foundation that everything more sophisticated builds on top of. Once the first gate is in production and the team has confidence in it, additional gates copy the pattern with different SQL. The cost of the second gate is a fraction of the cost of the first because the orchestration wiring, the alert routing, and the runbook structure already exist.
Step 1: Pick the Table and the Assertion
The table is mart.daily_orders, produced once per day at 2am. The expected row count varies by day of week (weekends are quieter than weekdays) but stays within a stable historical band. The assertion is that today's row count must fall between fifty percent and two hundred percent of the average row count for the same day of week over the trailing eight weeks. The bounds are wide enough to absorb genuine variation and tight enough to catch real failures.
The query produces one row with a gate_result of 'pass' or 'fail'. The orchestrator runs the query and reads the value. A 'pass' value lets downstream tasks proceed. A 'fail' value raises an exception that halts the DAG. The same shape works for any row-count assertion against any table.
When the gate fails, enforce_gate raises a ValueError. Airflow marks the task red. Downstream publish_to_consumers does not start. Consumers continue to read yesterday's mart, which is the most recent good data. An on-call alert fires. The runbook directs the on-call engineer to inspect the gate's output: today's row count, the expected count, the ratio, and the day of week being compared. The engineer either fixes the upstream problem and re-runs, or determines that the data is correct and the threshold needs updating, in which case the threshold change is itself a code change with a PR review. The PR review for a threshold change is more important than it might appear. Loosening a threshold without review is how quality programs silently degrade over years; every loosening looks reasonable in the moment, and the cumulative effect is that the gates no longer catch real failures. Treating threshold changes as code keeps the loosening visible and historical.
Step 5: Iterate
Iteration
Change
Reason
1
Add the row count gate as written
Establishes the floor of quality protection
2
Add a null rate gate on customer_id, order_amount
Catches producer changes that load-bearing columns started arriving NULL
3
Add a uniqueness gate on order_id
Catches dedup failures and join explosions
4
Add a freshness gate on event_timestamp
Catches ingestion stalls before they become row count problems
5
Tune thresholds against three months of historical data
Reduces false positives; raises confidence in real failures
Five tasks, four arrows, one quality gate. The dashboard either shows fresh data because the gate passed, or the previous good data because the gate halted the publish. Either way, consumers are protected from silent corruption. The same shape extends to as many gates as the table needs.
✓Do
Express assertions in SQL where possible; the warehouse is already running the query
Keep the gate's output a single row with a clear pass or fail value
Wire the gate as a DAG task that downstream depends on
✗Don't
Compute the assertion in Python when SQL would do; doubles the surface area
Skip the historical baseline; magic numbers age badly
Let the gate share a task with the transform; failures must be distinguishable
TIP
When the first gate is in place, every later quality investment is incremental. The hard part is the first gate; the second through tenth are copies of the pattern with different SQL.
❯❯❯PUTTING IT ALL TOGETHER
> A startup data team has just shipped its first end-to-end pipeline: Postgres orders extracted to S3, transformed in Snowflake into mart.daily_orders, and read by a Looker dashboard for the leadership team. The CEO begins making weekly decisions from the dashboard. The data engineer is asked to make sure the dashboard is right and to make sure the team will be notified when it is not.
Operational success of the pipeline (the run finished green) does not imply semantic success (the data describes the world correctly). The first move is to add the four cheap checks: row count, null rate, uniqueness, and freshness.
Place the gates at every layer boundary: ingestion (raw), transform (curated), and serving (mart). Each gate scopes failures to where they originated, drawing on the layered shape introduced in Lesson 1.
Classify each check as block or warn. Block when running is worse than not running (uniqueness violated, null rate above threshold). Warn when running is imperfect but better than halting.
Build the first concrete gate as a SQL assertion against a historical baseline, wired into the DAG so the publish step depends on the gate. Iterate from there: more gates, tuned thresholds, eventually a quality suite the consumer can rely on.
KEY TAKEAWAYS
Operational success is not semantic success: a green pipeline can produce wrong data, and the gap between the two is where silent corruption lives.
Four cheap checks catch most failures: row count, null rate, uniqueness, freshness. Implement these first, before more sophisticated checks.
Quality gates belong at every layer boundary: checks at the end bury root causes; checks at every boundary scope failure to the layer where it originated.
Block when running is worse than not running: warn when running is imperfect but better than halting. Keep blocking failures rare and meaningful.
A first gate is a SQL assertion in a DAG task: compare against a historical baseline, return a single pass-or-fail value, halt downstream when it fails.
A pipeline that ran is not the same as a pipeline that produced correct data
Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges
Topics covered: Pipeline Ran vs Data Is Good, Four Cheap Quality Checks, Quality Checks at Boundaries, Warn vs Block Authorities, First Quality Gate: Row Count
Pipelines have two distinct success criteria. One criterion is operational: did the code execute, did the writes commit, did the orchestrator mark the run green. The other criterion is semantic: does the data the pipeline produced actually describe the world correctly. Operational success is necessary but not sufficient for semantic success. The most expensive production incidents in mature data organizations are the ones where operational success and semantic failure coexist, because nobody is
Quality engineering has a 90/10 rule. Roughly ninety percent of silent failures are caught by ten percent of the possible checks. The four cheap checks below cover that ninety percent. They run in seconds, they need only basic SQL, and they catch the most common production incidents. The point of starting with these four is that any of them is better than none, and arguments about more sophisticated checks are arguments about edge cases until the basics are in place. The four checks are also the
A common mistake in pipeline design is to place all quality checks at the end. The reasoning is that final checks protect the consumer-facing table, which is the part the world sees. The reasoning is incomplete. By the time a problem shows up at the end, several intermediate transforms have already run on bad data. The diagnostic cost climbs because the failure has to be traced back through every transform between the source and the gate. Checks at every layer boundary keep the failure scoped to
Not every quality check should stop the pipeline. Some failures are catastrophic and demand a halt; others are advisory and demand a notification. Treating every check as a blocker creates an over-protective pipeline that halts on minor anomalies and wakes engineers up at 3am for problems that could have waited. Treating every check as a warning creates a pipeline that ignores its own alarms. The classification is per-check, not per-pipeline, and the rule is simple: block when running is worse t
Concepts become useful when applied. The exercise here builds a complete first quality gate: a SQL assertion that the row count for a daily order summary table falls within an expected range. The gate is implemented as a SQL query, the query is run by the orchestrator after the transform finishes, and the gate halts the DAG when the assertion fails. The result is a working quality gate in fewer than thirty lines of code. The exercise is deliberately small. Small gates ship; large gates linger in