Data Quality for Data Engineers
A 3am page at a payments company once traced back to one upstream CSV where the vendor flipped the sign on refund amounts. Net revenue on the exec dashboard flipped positive. Nobody caught it for two days. The fix was seven characters in a CAST expression. The cleanup was six weeks of reconciling billing statements with finance. That is data quality. It is not a dashboard widget, it is the thing that keeps your name off the post-mortem.
Monitoring and Alerting
Tests catch known failure modes. Monitoring catches unknown ones. A good data quality setup combines explicit tests with statistical monitoring.
Freshness monitoring: Track the timestamp of the most recent row in each table. Alert when freshness exceeds the expected SLA. Tools like Monte Carlo, Bigeye, and Elementary do this automatically.
Volume monitoring: Track row counts over time. A sudden drop (from 1M daily rows to 100K) indicates a source problem. A sudden spike might indicate duplicates. Alert on deviations beyond 2 standard deviations from the rolling average.
Distribution monitoring: Track column value distributions (nulls, distinct counts, percentile values). A column that was 2% null yesterday and is 40% null today has a problem, even if no explicit test covers it.
Schema change detection: Monitor source schemas for added, removed, or altered columns. A schema change in a source API or database can break downstream pipelines silently. Detect it at ingestion and alert before the pipeline runs.
Know Data Quality the way the interviewer who asks it knows it.
Building Data Quality Into Pipelines
Every pipeline you inherit has a graveyard of quality bolt-ons added after something broke at 2am. Copy these patterns in from day one.
Idempotent pipelines: Make every pipeline step produce the same output regardless of how many times it runs. This prevents duplicates from retries and backfills. Use MERGE / upsert semantics instead of blind INSERT.
Schema enforcement at ingestion: Validate incoming data against a schema before writing it to the warehouse. Reject or quarantine records that do not conform. Schema registries (for Kafka/Avro) and contract tests (for APIs) catch validity issues at the boundary.
Data contracts: Define explicit agreements between data producers and consumers. A contract specifies schema, freshness SLA, allowed values, and quality thresholds. When a producer breaks the contract, the pipeline stops rather than propagating bad data.
Quarantine pattern: Route records that fail quality checks to a separate quarantine table instead of dropping them. This lets you investigate failures without losing data and without polluting production tables.
Circuit breaker: If quality checks fail beyond a threshold (e.g., more than 5% of rows fail validation), halt the pipeline instead of loading bad data. This prevents cascade failures where bad data flows into multiple downstream tables.
Two Hundred Million Redirects
Billions of clicks. One tiny code. Two very different clocks.
Pulled from debriefs where system design separated levels.
The Six Dimensions of Data Quality
These six dimensions form the standard framework for thinking about data quality. Each one is a different axis that can fail independently. A dataset can be complete but inaccurate. It can be timely but inconsistent.
1. Accuracy
Does the data correctly represent the real-world entity or event it describes? Accuracy failures are the hardest to detect because the data looks structurally valid. Cross-reference against a trusted source. Compare aggregates between your pipeline output and the source system. Flag records where values fall outside expected ranges.
2. Completeness
Are all expected records present? Are all required fields populated? Compare source row counts to target row counts after each pipeline run. Track null rates per column over time and alert on spikes. For partitioned tables, verify every expected partition exists.
3. Consistency
Do related datasets agree with each other? Consistency problems erode trust faster than any other dimension because users see two different numbers for the same thing. Run reconciliation queries between related tables. Track referential integrity: every foreign key should resolve to a valid primary key.
4. Timeliness
Is the data available when consumers need it? Timeliness is defined by the consumer, not the producer. Monitor table freshness: when was the last row inserted or updated? Track pipeline completion times and alert when SLA thresholds are breached.
5. Uniqueness
Are there unwanted duplicates? Duplicate records inflate metrics, break joins, and cause downstream consumers to double-count. Assert primary key uniqueness after every load. Build idempotent pipelines that produce the same result regardless of how many times they run.
6. Validity
Does the data conform to defined formats, types, and business rules? An email column that contains phone numbers is a validity failure. Validate column types and formats at ingestion. Check enum columns against allowed values. Use schema registries for event streams to enforce validity at the producer.
Great Expectations: Declare and Run Checks
# Great Expectations: define and run checks
import great_expectations as gx
context = gx.get_context()
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="orders_suite"
)
# Completeness check
validator.expect_column_values_to_not_be_null("order_id")
# Validity check
validator.expect_column_values_to_be_in_set(
"status", ["pending", "shipped", "delivered", "cancelled"]
)
# Uniqueness check
validator.expect_column_values_to_be_unique("order_id")
# Accuracy check (range)
validator.expect_column_values_to_be_between(
"amount", min_value=0, max_value=100000
)
results = validator.validate()Great Expectations lets you define expectations as code, version them alongside your pipeline, and run them as part of CI/CD.
dbt Tests: Built-in and Custom
# schema.yml
models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['pending', 'shipped', 'delivered', 'cancelled']
- name: customer_id
tests:
- relationships:
to: ref('customers')
field: iddbt tests run after models build, catching quality issues before downstream models consume the data.
Custom SQL Reconciliation Check
-- Reconciliation: source vs target row count
WITH source_count AS (
SELECT COUNT(*) AS cnt FROM raw.orders
WHERE loaded_at >= CURRENT_DATE
),
target_count AS (
SELECT COUNT(*) AS cnt FROM warehouse.orders
WHERE order_date >= CURRENT_DATE
)
SELECT
s.cnt AS source_rows,
t.cnt AS target_rows,
ABS(s.cnt - t.cnt) AS diff
FROM source_count s, target_count t
WHERE ABS(s.cnt - t.cnt) > s.cnt * 0.01;
-- Returns a row only if diff exceeds 1%Custom SQL queries that return rows on failure. Zero rows means the check passes. Chain these after pipeline completion via Airflow or dbt post-hooks.
Data Quality in Interviews
System Design: How would you ensure data correctness in this pipeline?
Structure your answer in layers: schema validation at ingestion (reject or quarantine invalid records), reconciliation after transformation (compare source and target row counts and aggregates), business rule assertions after loading (revenue non-negative, every order has a customer), and freshness monitoring with SLA alerting. Mention specific tools (Great Expectations, dbt tests, Monte Carlo) if you have used them.
Behavioral: Tell me about a time you dealt with a data quality issue.
Have a specific example ready: what broke, how you detected it (monitoring alert, stakeholder report, your own investigation), what you fixed, and what you put in place to prevent recurrence. The prevention step is what separates strong answers from mediocre ones. Interviewers want to know you learned from the incident.
Data Quality FAQ
What are the six dimensions of data quality?+
How do you test data quality in a pipeline?+
What is the difference between data quality and data observability?+
How does data quality come up in data engineering interviews?+
Practice Data Quality Scenarios
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition