Loading lesson...

Data Quality and Contracts: Beginner

A pipeline that ran is not the same as a pipeline that produced correct data

A pipeline that ran is not the same as a pipeline that produced correct data

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: Pipeline Ran vs Data Is Good, Four Cheap Quality Checks, Quality Checks at Boundaries, Warn vs Block Authorities, First Quality Gate: Row Count

Lesson Sections

  1. Pipeline Ran vs Data Is Good (concepts: paDataQuality, paSilentFailure)

    Pipelines have two distinct success criteria. One criterion is operational: did the code execute, did the writes commit, did the orchestrator mark the run green. The other criterion is semantic: does the data the pipeline produced actually describe the world correctly. Operational success is necessary but not sufficient for semantic success. The most expensive production incidents in mature data organizations are the ones where operational success and semantic failure coexist, because nobody is

  2. Four Cheap Quality Checks (concepts: paFourQualityChecks, paFreshnessCheck)

    Quality engineering has a 90/10 rule. Roughly ninety percent of silent failures are caught by ten percent of the possible checks. The four cheap checks below cover that ninety percent. They run in seconds, they need only basic SQL, and they catch the most common production incidents. The point of starting with these four is that any of them is better than none, and arguments about more sophisticated checks are arguments about edge cases until the basics are in place. The four checks are also the

  3. Quality Checks at Boundaries (concepts: paQualityGate, paLayerBoundaryChecks)

    A common mistake in pipeline design is to place all quality checks at the end. The reasoning is that final checks protect the consumer-facing table, which is the part the world sees. The reasoning is incomplete. By the time a problem shows up at the end, several intermediate transforms have already run on bad data. The diagnostic cost climbs because the failure has to be traced back through every transform between the source and the gate. Checks at every layer boundary keep the failure scoped to

  4. Warn vs Block Authorities (concepts: paWarnVsBlock, paAlertSeverity)

    Not every quality check should stop the pipeline. Some failures are catastrophic and demand a halt; others are advisory and demand a notification. Treating every check as a blocker creates an over-protective pipeline that halts on minor anomalies and wakes engineers up at 3am for problems that could have waited. Treating every check as a warning creates a pipeline that ignores its own alarms. The classification is per-check, not per-pipeline, and the rule is simple: block when running is worse t

  5. First Quality Gate: Row Count (concepts: paFirstQualityGate, paSqlAssertion)

    Concepts become useful when applied. The exercise here builds a complete first quality gate: a SQL assertion that the row count for a daily order summary table falls within an expected range. The gate is implemented as a SQL query, the query is run by the orchestrator after the transform finishes, and the gate halts the DAG when the assertion fails. The result is a working quality gate in fewer than thirty lines of code. The exercise is deliberately small. Small gates ship; large gates linger in