Data Quality for Data Engineers

A 3am page at a payments company once traced back to one upstream CSV where the vendor flipped the sign on refund amounts. Net revenue on the exec dashboard flipped positive. Nobody caught it for two days. The fix was seven characters in a CAST expression. The cleanup was six weeks of reconciling billing statements with finance. That is data quality. It is not a dashboard widget, it is the thing that keeps your name off the post-mortem.

System Design Share

17%

Staff-Level Rounds

Quality Dimensions

1,418

Library Challenges

The Six Dimensions of Data Quality

These six dimensions form the standard framework for thinking about data quality. Each one is a different axis that can fail independently. A dataset can be complete but inaccurate. It can be timely but inconsistent.

1. Accuracy

Does the data correctly represent the real-world entity or event it describes? Accuracy failures are the hardest to detect because the data looks structurally valid. Cross-reference against a trusted source. Compare aggregates between your pipeline output and the source system. Flag records where values fall outside expected ranges.

2. Completeness

Are all expected records present? Are all required fields populated? Compare source row counts to target row counts after each pipeline run. Track null rates per column over time and alert on spikes. For partitioned tables, verify every expected partition exists.

3. Consistency

Do related datasets agree with each other? Consistency problems erode trust faster than any other dimension because users see two different numbers for the same thing. Run reconciliation queries between related tables. Track referential integrity: every foreign key should resolve to a valid primary key.

4. Timeliness

Is the data available when consumers need it? Timeliness is defined by the consumer, not the producer. Monitor table freshness: when was the last row inserted or updated? Track pipeline completion times and alert when SLA thresholds are breached.

5. Uniqueness

Are there unwanted duplicates? Duplicate records inflate metrics, break joins, and cause downstream consumers to double-count. Assert primary key uniqueness after every load. Build idempotent pipelines that produce the same result regardless of how many times they run.

6. Validity

Does the data conform to defined formats, types, and business rules? An email column that contains phone numbers is a validity failure. Validate column types and formats at ingestion. Check enum columns against allowed values. Use schema registries for event streams to enforce validity at the producer.

Great Expectations: Declare and Run Checks

# Great Expectations: define and run checks
import great_expectations as gx

context = gx.get_context()
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="orders_suite"
)

# Completeness check
validator.expect_column_values_to_not_be_null("order_id")

# Validity check
validator.expect_column_values_to_be_in_set(
    "status", ["pending", "shipped", "delivered", "cancelled"]
)

# Uniqueness check
validator.expect_column_values_to_be_unique("order_id")

# Accuracy check (range)
validator.expect_column_values_to_be_between(
    "amount", min_value=0, max_value=100000
)

results = validator.validate()

Great Expectations lets you define expectations as code, version them alongside your pipeline, and run them as part of CI/CD.

dbt Tests: Built-in and Custom

# schema.yml
models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled']
      - name: customer_id
        tests:
          - relationships:
              to: ref('customers')
              field: id

dbt tests run after models build, catching quality issues before downstream models consume the data.

Custom SQL Reconciliation Check

-- Reconciliation: source vs target row count
WITH source_count AS (
  SELECT COUNT(*) AS cnt FROM orders
),
target_count AS (
  SELECT COUNT(DISTINCT order_id) AS cnt FROM order_items
)
SELECT
  s.cnt AS source_rows,
  t.cnt AS target_rows,
  ABS(s.cnt - t.cnt) AS diff
FROM source_count s, target_count t
WHERE ABS(s.cnt - t.cnt) > s.cnt * 0.01;
-- Returns a row only if diff exceeds 1%

Custom SQL queries that return rows on failure. Zero rows means the check passes. Chain these after pipeline completion via Airflow or dbt post-hooks.

Monitoring and Alerting

Tests catch known failure modes. Monitoring catches unknown ones. A good data quality setup combines explicit tests with statistical monitoring.

Freshness monitoring: Track the timestamp of the most recent row in each table. Alert when freshness exceeds the expected SLA. Tools like Monte Carlo, Bigeye, and Elementary do this automatically.

Volume monitoring: Track row counts over time. A sudden drop (from 1M daily rows to 100K) indicates a source problem. A sudden spike might indicate duplicates. Alert on deviations beyond 2 standard deviations from the rolling average.

Distribution monitoring: Track column value distributions (nulls, distinct counts, percentile values). A column that was 2% null yesterday and is 40% null today has a problem, even if no explicit test covers it.

Schema change detection: Monitor source schemas for added, removed, or altered columns. A schema change in a source API or database can break downstream pipelines silently. Detect it at ingestion and alert before the pipeline runs.

Prepare for the interview

01 / Open invite

02min.

Know Data Quality the way the interviewer who asks it knows it.

a Data Quality query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

IntuitInterview question

Solve a Data Quality problem

Building Data Quality Into Pipelines

Every pipeline you inherit has a graveyard of quality bolt-ons added after something broke at 2am. Copy these patterns in from day one.

Idempotent pipelines: Make every pipeline step produce the same output regardless of how many times it runs. This prevents duplicates from retries and backfills. Use MERGE / upsert semantics instead of blind INSERT.

Schema enforcement at ingestion: Validate incoming data against a schema before writing it to the warehouse. Reject or quarantine records that do not conform. Schema registries (for Kafka/Avro) and contract tests (for APIs) catch validity issues at the boundary.

Data contracts: Define explicit agreements between data producers and consumers. A contract specifies schema, freshness SLA, allowed values, and quality thresholds. When a producer breaks the contract, the pipeline stops rather than propagating bad data.

Quarantine pattern: Route records that fail quality checks to a separate quarantine table instead of dropping them. This lets you investigate failures without losing data and without polluting production tables.

Circuit breaker: If quality checks fail beyond a threshold (e.g., more than 5% of rows fail validation), halt the pipeline instead of loading bad data. This prevents cascade failures where bad data flows into multiple downstream tables.

Data Quality in Interviews

System Design: How would you ensure data correctness in this pipeline?

Structure your answer in layers: schema validation at ingestion (reject or quarantine invalid records), reconciliation after transformation (compare source and target row counts and aggregates), business rule assertions after loading (revenue non-negative, every order has a customer), and freshness monitoring with SLA alerting. Mention specific tools (Great Expectations, dbt tests, Monte Carlo) if you have used them.

Behavioral: Tell me about a time you dealt with a data quality issue.

Have a specific example ready: what broke, how you detected it (monitoring alert, stakeholder report, your own investigation), what you fixed, and what you put in place to prevent recurrence. The prevention step is what separates strong answers from mediocre ones. Interviewers want to know you learned from the incident.

Data Quality FAQ

What are the six dimensions of data quality?+

The six standard dimensions are accuracy (does the data reflect reality), completeness (are all expected records and fields present), consistency (do related datasets agree with each other), timeliness (is the data available when it is needed), uniqueness (are there no unwanted duplicates), and validity (does the data conform to defined formats, ranges, and business rules). Different frameworks use slightly different names, but these six cover the territory that data engineers are responsible for in production pipelines.

How do you test data quality in a pipeline?+

Three approaches are common. First, schema-level checks: validate column types, nullability, and primary key uniqueness before loading. Second, statistical checks: compare row counts, value distributions, and aggregates against historical baselines to catch anomalies. Third, business-rule checks: encode domain-specific rules (like 'revenue should never be negative' or 'every order must have a customer_id') as assertions. Tools like Great Expectations, dbt tests, Soda, and Monte Carlo automate these checks at different points in the pipeline.

What is the difference between data quality and data observability?+

Data quality is about defining and enforcing correctness rules on your data. Data observability is about monitoring your data systems for anomalies, freshness issues, and schema changes, often without predefined rules. Think of data quality as unit tests (you write specific assertions) and data observability as monitoring and alerting (the system detects unusual patterns automatically). In practice, mature data teams use both: explicit quality checks for known rules and observability tools for catching unknown issues.

How does data quality come up in data engineering interviews?+

Interviewers ask about data quality in system design rounds and behavioral rounds. In system design, they ask how you would ensure data correctness in a pipeline you are designing. They want to hear about validation at ingestion, idempotent transformations, reconciliation checks between source and target, and alerting on quality regressions. In behavioral rounds, they ask about a time you caught or fixed a data quality issue. Have a specific example ready: what broke, how you detected it, what you fixed, and what you put in place to prevent recurrence.

02 / Why practice

Practice Data Quality Scenarios

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Start Practicing

Related Guides

Data Observability→

Monitoring without predefined rules: anomaly detection for data systems

Data Governance→

Access control, PII handling, and audit trails for data engineers

Pipeline Architecture→

End-to-end pipeline design patterns with failure handling