Data Engineering

Data Quality for Data Engineers

A 3am page at a payments company once traced back to one upstream CSV where the vendor flipped the sign on refund amounts. Net revenue on the exec dashboard flipped positive. Nobody caught it for two days. The fix was seven characters in a CAST expression. The cleanup was six weeks of reconciling billing statements with finance. That's data quality. It's not a dashboard widget, it's the thing that keeps you off the post-mortem. This guide covers the six dimensions, the tests that would've caught that CSV at 2:59am, and the stories you'll tell in the behavioral round when someone asks about a time you caught a bad load.

3%

System Design Share

17%

Staff-Level Rounds

6

Quality Dimensions

1,418

Library Challenges

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

The Six Dimensions of Data Quality

These six dimensions form the standard framework for thinking about data quality. Each one is a different axis that can fail independently. A dataset can be complete but inaccurate. It can be timely but inconsistent. Understanding each dimension helps you design targeted checks rather than vague “data looks wrong” investigations.

1. Accuracy

Does the data correctly represent the real-world entity or event it describes? An inaccurate record might have the wrong price, the wrong timestamp, or the wrong customer ID. Accuracy failures are the hardest to detect because the data looks structurally valid. It just does not match reality.

How to test: Cross-reference against a trusted source. Compare aggregates between your pipeline output and the source system. Flag records where values fall outside expected ranges (negative ages, future dates on historical records, prices that differ by more than 2x from the rolling median).

2. Completeness

Are all expected records present? Are all required fields populated? Completeness failures show up as missing rows (a source had 1M records, but only 950K loaded) or missing values (email is NULL for 30% of users when it should always be populated).

How to test: Compare source row counts to target row counts after each pipeline run. Track null rates per column over time and alert on spikes. For partitioned tables, verify every expected partition exists.

3. Consistency

Do related datasets agree with each other? If the orders table says a customer spent $500 this month and the billing table says $480, you have a consistency failure. Consistency problems erode trust faster than any other dimension because users see two different numbers for the same thing.

How to test: Run reconciliation queries between related tables. Compare fact-table aggregates against dimension-table totals. Track referential integrity: every foreign key should resolve to a valid primary key.

4. Timeliness

Is the data available when consumers need it? A daily dashboard that updates at 2 PM instead of 8 AM is a timeliness failure. A real-time fraud detection pipeline that is 10 minutes behind is a timeliness failure. Timeliness is defined by the consumer, not the producer.

How to test: Monitor table freshness: when was the last row inserted or updated? Track pipeline completion times and alert when SLA thresholds are breached. Compare the latest event timestamp in the target against the current time.

5. Uniqueness

Are there unwanted duplicates? Duplicate records inflate metrics, break joins, and cause downstream consumers to double-count. Uniqueness failures typically originate from non-idempotent pipelines, retried API calls, or missing deduplication logic.

How to test: Assert primary key uniqueness after every load. Check for duplicate composite keys. Compare COUNT(*) vs COUNT(DISTINCT pk) on target tables. Build idempotent pipelines that produce the same result regardless of how many times they run.

6. Validity

Does the data conform to defined formats, types, and business rules? An email column that contains phone numbers is a validity failure. A status column that contains values outside the allowed enum is a validity failure. Validity checks enforce the contract between producers and consumers.

How to test: Validate column types and formats at ingestion. Check enum columns against allowed values. Validate date formats, numeric ranges, and string patterns (regex for email, phone, UUID). Use schema registries for event streams to enforce validity at the producer.

Data Quality Testing Patterns

Data quality tests run at different points in the pipeline. Some catch problems at ingestion before bad data enters the warehouse. Others validate transformation output before downstream consumers see it. The best pipelines test at both points.

Great Expectations

Great Expectations is a Python library for declaring data validation rules as “expectations.” You define what you expect (column X should never be null, column Y should be between 0 and 100), run the expectations against a dataset, and get a pass/fail report. Expectations are versioned alongside your pipeline code.

# Great Expectations: define and run checks
import great_expectations as gx

context = gx.get_context()
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="orders_suite"
)

# Completeness check
validator.expect_column_values_to_not_be_null("order_id")

# Validity check
validator.expect_column_values_to_be_in_set(
    "status", ["pending", "shipped", "delivered", "cancelled"]
)

# Uniqueness check
validator.expect_column_values_to_be_unique("order_id")

# Accuracy check (range)
validator.expect_column_values_to_be_between(
    "amount", min_value=0, max_value=100000
)

results = validator.validate()

dbt Tests

dbt has built-in tests (unique, not_null, accepted_values, relationships) and supports custom tests written as SQL queries that return failing rows. Tests run after models build, catching quality issues before downstream models consume the data.

# schema.yml
models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled']
      - name: customer_id
        tests:
          - relationships:
              to: ref('customers')
              field: id

Custom SQL Checks

Not every check fits a framework. Custom SQL queries that run after pipeline completion handle domain-specific rules. These queries return rows that violate the rule. Zero rows means the check passes.

-- Reconciliation: source vs target row count
WITH source_count AS (
  SELECT COUNT(*) AS cnt FROM raw.orders
  WHERE loaded_at >= CURRENT_DATE
),
target_count AS (
  SELECT COUNT(*) AS cnt FROM warehouse.orders
  WHERE order_date >= CURRENT_DATE
)
SELECT
  s.cnt AS source_rows,
  t.cnt AS target_rows,
  ABS(s.cnt - t.cnt) AS diff
FROM source_count s, target_count t
WHERE ABS(s.cnt - t.cnt) > s.cnt * 0.01;
-- Returns a row only if diff exceeds 1%

Interview note: When discussing data quality in a system design interview, mention specific check types: schema validation at ingestion, row count reconciliation after load, business rule assertions after transformation, and freshness monitoring for timeliness. This shows you think in layers, not just “we will add some tests.”

Monitoring and Alerting

Tests catch known failure modes. Monitoring catches unknown ones. A good data quality setup combines explicit tests with statistical monitoring that detects anomalies without predefined rules.

Freshness monitoring: Track the timestamp of the most recent row in each table. Alert when freshness exceeds the expected SLA. Tools like Monte Carlo, Bigeye, and Elementary do this automatically.

Volume monitoring: Track row counts over time. A sudden drop (from 1M daily rows to 100K) indicates a source problem. A sudden spike might indicate duplicates. Alert on deviations beyond 2 standard deviations from the rolling average.

Distribution monitoring: Track column value distributions (nulls, distinct counts, percentile values). A column that was 2% null yesterday and is 40% null today has a problem, even if no explicit test covers it.

Schema change detection: Monitor source schemas for added, removed, or altered columns. A schema change in a source API or database can break downstream pipelines silently. Detect it at ingestion and alert before the pipeline runs.

Building Data Quality Into Pipelines

Every pipeline you inherit has a graveyard of quality bolt-ons. Spark jobs with a regex check glued on after the fact. Airflow DAGs that run tests in a task nobody looks at. The patterns below are the ones engineers add after something breaks at 2am and they swear it won't happen again. Copy them in from day one.

Idempotent pipelines: Make every pipeline step produce the same output regardless of how many times it runs. This prevents duplicates from retries and backfills. Use MERGE / upsert semantics instead of blind INSERT.

Schema enforcement at ingestion: Validate incoming data against a schema before writing it to the warehouse. Reject or quarantine records that do not conform. Schema registries (for Kafka/Avro) and contract tests (for APIs) catch validity issues at the boundary.

Data contracts: Define explicit agreements between data producers and consumers. A contract specifies schema, freshness SLA, allowed values, and quality thresholds. When a producer breaks the contract, the pipeline stops rather than propagating bad data.

Quarantine pattern: Route records that fail quality checks to a separate quarantine table instead of dropping them. This lets you investigate failures without losing data and without polluting production tables.

Circuit breaker: If quality checks fail beyond a threshold (e.g., more than 5% of rows fail validation), halt the pipeline instead of loading bad data. This prevents cascading quality failures in downstream tables.

Data Quality in Interviews

Interviewers evaluate data quality understanding in two ways: directly (asking what data quality means and how you ensure it) and indirectly (embedding quality concerns in system design problems). Being ready for both gives you an advantage.

System Design Angle

When designing a pipeline, proactively mention data quality. Say: “At the ingestion layer, I would validate the schema and reject malformed records. After transformation, I would run row count reconciliation and business rule checks. I would monitor freshness and alert if the pipeline misses its SLA.” Interviewers notice when you bring up quality without being asked.

Behavioral Angle

Prepare a story about a data quality incident. Structure it as: what happened, how you discovered it, what the root cause was, how you fixed it, and what you put in place to prevent recurrence. The best answers show that you implemented automated checks that caught the problem class permanently, not just a one-time fix.

Data Quality FAQ

What are the six dimensions of data quality?+
The six standard dimensions are accuracy (does the data reflect reality), completeness (are all expected records and fields present), consistency (do related datasets agree with each other), timeliness (is the data available when it is needed), uniqueness (are there no unwanted duplicates), and validity (does the data conform to defined formats, ranges, and business rules). Different frameworks use slightly different names, but these six cover the territory that data engineers are responsible for in production pipelines.
How do you test data quality in a pipeline?+
Three approaches are common. First, schema-level checks: validate column types, nullability, and primary key uniqueness before loading. Second, statistical checks: compare row counts, value distributions, and aggregates against historical baselines to catch anomalies. Third, business-rule checks: encode domain-specific rules (like 'revenue should never be negative' or 'every order must have a customer_id') as assertions. Tools like Great Expectations, dbt tests, Soda, and Monte Carlo automate these checks at different points in the pipeline.
What is the difference between data quality and data observability?+
Data quality is about defining and enforcing correctness rules on your data. Data observability is about monitoring your data systems for anomalies, freshness issues, and schema changes, often without predefined rules. Think of data quality as unit tests (you write specific assertions) and data observability as monitoring and alerting (the system detects unusual patterns automatically). In practice, mature data teams use both: explicit quality checks for known rules and observability tools for catching unknown issues.
How does data quality come up in data engineering interviews?+
Interviewers ask about data quality in system design rounds and behavioral rounds. In system design, they ask how you would ensure data correctness in a pipeline you are designing. They want to hear about validation at ingestion, idempotent transformations, reconciliation checks between source and target, and alerting on quality regressions. In behavioral rounds, they ask about a time you caught or fixed a data quality issue. Have a specific example ready: what broke, how you detected it, what you fixed, and what you put in place to prevent recurrence.

Build The Story Before The Page At 2am Writes It For You

Every behavioral round will ask for a data quality war story. Practice the SQL reconciliation queries and system design patterns that give you one worth telling.