A 3am page at a payments company once traced back to one upstream CSV where the vendor flipped the sign on refund amounts. Net revenue on the exec dashboard flipped positive. Nobody caught it for two days. The fix was seven characters in a CAST expression. The cleanup was six weeks of reconciling billing statements with finance. That's data quality. It's not a dashboard widget, it's the thing that keeps you off the post-mortem. This guide covers the six dimensions, the tests that would've caught that CSV at 2:59am, and the stories you'll tell in the behavioral round when someone asks about a time you caught a bad load.
System Design Share
Staff-Level Rounds
Quality Dimensions
Library Challenges
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
These six dimensions form the standard framework for thinking about data quality. Each one is a different axis that can fail independently. A dataset can be complete but inaccurate. It can be timely but inconsistent. Understanding each dimension helps you design targeted checks rather than vague “data looks wrong” investigations.
Does the data correctly represent the real-world entity or event it describes? An inaccurate record might have the wrong price, the wrong timestamp, or the wrong customer ID. Accuracy failures are the hardest to detect because the data looks structurally valid. It just does not match reality.
How to test: Cross-reference against a trusted source. Compare aggregates between your pipeline output and the source system. Flag records where values fall outside expected ranges (negative ages, future dates on historical records, prices that differ by more than 2x from the rolling median).
Are all expected records present? Are all required fields populated? Completeness failures show up as missing rows (a source had 1M records, but only 950K loaded) or missing values (email is NULL for 30% of users when it should always be populated).
How to test: Compare source row counts to target row counts after each pipeline run. Track null rates per column over time and alert on spikes. For partitioned tables, verify every expected partition exists.
Do related datasets agree with each other? If the orders table says a customer spent $500 this month and the billing table says $480, you have a consistency failure. Consistency problems erode trust faster than any other dimension because users see two different numbers for the same thing.
How to test: Run reconciliation queries between related tables. Compare fact-table aggregates against dimension-table totals. Track referential integrity: every foreign key should resolve to a valid primary key.
Is the data available when consumers need it? A daily dashboard that updates at 2 PM instead of 8 AM is a timeliness failure. A real-time fraud detection pipeline that is 10 minutes behind is a timeliness failure. Timeliness is defined by the consumer, not the producer.
How to test: Monitor table freshness: when was the last row inserted or updated? Track pipeline completion times and alert when SLA thresholds are breached. Compare the latest event timestamp in the target against the current time.
Are there unwanted duplicates? Duplicate records inflate metrics, break joins, and cause downstream consumers to double-count. Uniqueness failures typically originate from non-idempotent pipelines, retried API calls, or missing deduplication logic.
How to test: Assert primary key uniqueness after every load. Check for duplicate composite keys. Compare COUNT(*) vs COUNT(DISTINCT pk) on target tables. Build idempotent pipelines that produce the same result regardless of how many times they run.
Does the data conform to defined formats, types, and business rules? An email column that contains phone numbers is a validity failure. A status column that contains values outside the allowed enum is a validity failure. Validity checks enforce the contract between producers and consumers.
How to test: Validate column types and formats at ingestion. Check enum columns against allowed values. Validate date formats, numeric ranges, and string patterns (regex for email, phone, UUID). Use schema registries for event streams to enforce validity at the producer.
Data quality tests run at different points in the pipeline. Some catch problems at ingestion before bad data enters the warehouse. Others validate transformation output before downstream consumers see it. The best pipelines test at both points.
Great Expectations is a Python library for declaring data validation rules as “expectations.” You define what you expect (column X should never be null, column Y should be between 0 and 100), run the expectations against a dataset, and get a pass/fail report. Expectations are versioned alongside your pipeline code.
# Great Expectations: define and run checks
import great_expectations as gx
context = gx.get_context()
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="orders_suite"
)
# Completeness check
validator.expect_column_values_to_not_be_null("order_id")
# Validity check
validator.expect_column_values_to_be_in_set(
"status", ["pending", "shipped", "delivered", "cancelled"]
)
# Uniqueness check
validator.expect_column_values_to_be_unique("order_id")
# Accuracy check (range)
validator.expect_column_values_to_be_between(
"amount", min_value=0, max_value=100000
)
results = validator.validate()dbt has built-in tests (unique, not_null, accepted_values, relationships) and supports custom tests written as SQL queries that return failing rows. Tests run after models build, catching quality issues before downstream models consume the data.
# schema.yml
models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['pending', 'shipped', 'delivered', 'cancelled']
- name: customer_id
tests:
- relationships:
to: ref('customers')
field: idNot every check fits a framework. Custom SQL queries that run after pipeline completion handle domain-specific rules. These queries return rows that violate the rule. Zero rows means the check passes.
-- Reconciliation: source vs target row count
WITH source_count AS (
SELECT COUNT(*) AS cnt FROM raw.orders
WHERE loaded_at >= CURRENT_DATE
),
target_count AS (
SELECT COUNT(*) AS cnt FROM warehouse.orders
WHERE order_date >= CURRENT_DATE
)
SELECT
s.cnt AS source_rows,
t.cnt AS target_rows,
ABS(s.cnt - t.cnt) AS diff
FROM source_count s, target_count t
WHERE ABS(s.cnt - t.cnt) > s.cnt * 0.01;
-- Returns a row only if diff exceeds 1%Interview note: When discussing data quality in a system design interview, mention specific check types: schema validation at ingestion, row count reconciliation after load, business rule assertions after transformation, and freshness monitoring for timeliness. This shows you think in layers, not just “we will add some tests.”
Tests catch known failure modes. Monitoring catches unknown ones. A good data quality setup combines explicit tests with statistical monitoring that detects anomalies without predefined rules.
Freshness monitoring: Track the timestamp of the most recent row in each table. Alert when freshness exceeds the expected SLA. Tools like Monte Carlo, Bigeye, and Elementary do this automatically.
Volume monitoring: Track row counts over time. A sudden drop (from 1M daily rows to 100K) indicates a source problem. A sudden spike might indicate duplicates. Alert on deviations beyond 2 standard deviations from the rolling average.
Distribution monitoring: Track column value distributions (nulls, distinct counts, percentile values). A column that was 2% null yesterday and is 40% null today has a problem, even if no explicit test covers it.
Schema change detection: Monitor source schemas for added, removed, or altered columns. A schema change in a source API or database can break downstream pipelines silently. Detect it at ingestion and alert before the pipeline runs.
Every pipeline you inherit has a graveyard of quality bolt-ons. Spark jobs with a regex check glued on after the fact. Airflow DAGs that run tests in a task nobody looks at. The patterns below are the ones engineers add after something breaks at 2am and they swear it won't happen again. Copy them in from day one.
Idempotent pipelines: Make every pipeline step produce the same output regardless of how many times it runs. This prevents duplicates from retries and backfills. Use MERGE / upsert semantics instead of blind INSERT.
Schema enforcement at ingestion: Validate incoming data against a schema before writing it to the warehouse. Reject or quarantine records that do not conform. Schema registries (for Kafka/Avro) and contract tests (for APIs) catch validity issues at the boundary.
Data contracts: Define explicit agreements between data producers and consumers. A contract specifies schema, freshness SLA, allowed values, and quality thresholds. When a producer breaks the contract, the pipeline stops rather than propagating bad data.
Quarantine pattern: Route records that fail quality checks to a separate quarantine table instead of dropping them. This lets you investigate failures without losing data and without polluting production tables.
Circuit breaker: If quality checks fail beyond a threshold (e.g., more than 5% of rows fail validation), halt the pipeline instead of loading bad data. This prevents cascading quality failures in downstream tables.
Interviewers evaluate data quality understanding in two ways: directly (asking what data quality means and how you ensure it) and indirectly (embedding quality concerns in system design problems). Being ready for both gives you an advantage.
When designing a pipeline, proactively mention data quality. Say: “At the ingestion layer, I would validate the schema and reject malformed records. After transformation, I would run row count reconciliation and business rule checks. I would monitor freshness and alert if the pipeline misses its SLA.” Interviewers notice when you bring up quality without being asked.
Prepare a story about a data quality incident. Structure it as: what happened, how you discovered it, what the root cause was, how you fixed it, and what you put in place to prevent recurrence. The best answers show that you implemented automated checks that caught the problem class permanently, not just a one-time fix.
Every behavioral round will ask for a data quality war story. Practice the SQL reconciliation queries and system design patterns that give you one worth telling.
Monitoring, alerting, and anomaly detection for data pipelines and warehouse tables
Batch, streaming, and hybrid pipeline patterns for data engineering interviews
How to approach data engineering system design interviews, with frameworks and examples