Data observability is the practice of monitoring the health of your data pipelines and the data flowing through them. It covers five pillars: freshness, quality, volume, schema, and lineage. When a table goes stale, a row count drops 80%, or a column type changes without warning, observability catches it before your stakeholders do.
This page covers what data observability is, the five pillars in detail, how it differs from data quality, the tools market, and how interviewers test observability concepts in system design rounds.
Core Pillars
System Design Rounds
Challenges Built
Companies in Dataset
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
These five pillars were originally articulated by Barr Moses at Monte Carlo. They provide a structured framework for thinking about pipeline health. Each pillar answers a different question, and together they cover the full surface area of what can go wrong with your data.
Freshness measures how up-to-date your data is. If a dashboard shows yesterday's numbers at 3 PM, something is wrong. Freshness monitoring tracks when each table was last updated and alerts when it falls behind its expected schedule. This is the most common observability failure: the pipeline ran, it produced data, but it produced stale data because an upstream source delivered late.
In practice, freshness checks compare the maximum timestamp in a table against the current time. If the gap exceeds the SLA threshold (say, 2 hours for a table that should refresh hourly), an alert fires. Simple to implement, hard to calibrate. Set thresholds too tight and you get alert fatigue. Set them too loose and you miss real issues. The best teams set freshness SLAs per table based on downstream consumer requirements, not arbitrary defaults.
Data quality monitoring checks whether values in your tables meet expectations. NULL rates, duplicate counts, value distributions, referential integrity, and custom business rules. A pipeline can run on time, produce the right number of rows, and still be broken if 40% of a required column is NULL because the source changed its API contract.
Quality checks fall into two categories: statistical (is the NULL rate for this column within its historical range?) and rule-based (every order must have a positive amount, every user must have a valid email format). Statistical checks catch drift over time. Rule-based checks catch hard failures immediately. The best observability setups use both. Tools like Great Expectations and dbt tests encode quality checks as code that runs inside the pipeline, not as afterthoughts.
Volume monitoring tracks how many rows or bytes each table receives per load. A sudden 80% drop in row count usually means the source is broken. A sudden 10x spike might mean duplicate data or a schema change that expanded a one-to-many relationship. Volume anomalies are the most reliable early-warning signal because they catch problems before quality checks even run.
Track row counts over time and flag deviations beyond a configurable threshold (often 2 to 3 standard deviations from the rolling average). Simple implementations use a daily COUNT(*) comparison. Sophisticated implementations track volume at the partition level: did today's partition arrive with the expected row count? Volume monitoring is cheap to implement and catches a surprisingly large percentage of pipeline failures. Start here if you are building observability from scratch.
Schema monitoring detects changes in table structure: added columns, dropped columns, type changes, renamed fields. A single column rename in the source can cascade through 20 downstream tables if your pipeline does not catch it at the ingestion boundary. Schema monitoring provides the first line of defense against breaking changes.
There are two approaches: reactive (compare the current schema to the last known schema on each pipeline run and alert on differences) and proactive (use a schema registry that requires explicit approval before a breaking change can propagate). Reactive is easier to implement. Proactive is safer for production systems. Avro and Protobuf schemas with a Confluent-style registry enforce backward/forward compatibility rules at the serialization layer, catching breaks before data enters the pipeline.
Lineage tracks how data flows from source to destination: which tables feed which transformations, which transformations produce which outputs, and which dashboards consume which tables. When a freshness alert fires, lineage tells you where the bottleneck is. When a quality issue appears in a Gold table, lineage traces it back to the specific Bronze or Silver table where the problem originated.
Column-level lineage is the gold standard: knowing not just that Table A feeds Table B, but that Table A's user_email column maps to Table B's email column after a LOWER() transformation. This level of granularity lets you do impact analysis before making changes (if I drop this column, what breaks?) and root cause analysis when something goes wrong. dbt builds lineage automatically from its SQL models. For non-dbt pipelines, tools like OpenLineage standardize lineage collection across different orchestrators and compute engines.
These terms get conflated constantly. They overlap but serve different purposes. Data quality is a pillar within observability. Observability is the broader system that also covers freshness, volume, schema, and lineage.
| Aspect | Data Observability | Data Quality |
|---|---|---|
| Scope | Full pipeline health: freshness, volume, schema, lineage, and quality | Data values only: null rates, duplicates, distributions, business rules |
| When it runs | Continuously, across every pipeline run | At specific checkpoints, usually after transformation |
| What it answers | Is the pipeline healthy? Where is the problem? | Does the data meet expectations? |
| Root cause | Uses lineage to trace issues to their source | Identifies the symptom but not always the cause |
| Typical tools | Monte Carlo, Bigeye, Metaplane, custom solutions | Great Expectations, dbt tests, Soda, custom SQL checks |
The observability space has matured rapidly since 2021. Tools fall into three categories: full-platform observability, quality-focused testing, and custom/open-source solutions. The right choice depends on your pipeline complexity, team size, and budget.
End-to-end platforms that monitor all five pillars: freshness, volume, schema, quality, and lineage. They connect to your warehouse, ingest metadata, and use ML-based anomaly detection to flag issues automatically. Typical deployment takes days to weeks. These platforms are strongest for organizations with hundreds of tables and complex lineage where writing custom checks for every table is not feasible.
Tools that focus specifically on data quality checks: validating expectations against data (null rates, uniqueness, distributions, custom SQL assertions). Great Expectations, dbt tests, and Soda fall into this category. These integrate directly into your pipeline code and run as part of the DAG. They are strong for quality but do not cover freshness, volume, schema, or lineage natively. Many teams use a quality testing tool inside their pipeline plus a platform tool for the broader observability layer.
Many teams build observability with SQL checks, cron jobs, and dashboards (Grafana, Datadog). Write a query to check each table's freshness, row count, and null rates. Store results in a metrics table. Build dashboards with alerting rules. Open-source options like Elementary (built on dbt) and OpenMetadata add lineage and anomaly detection. This approach is cheaper but requires ongoing maintenance. It works well for teams with strong engineering culture and moderate pipeline complexity.
Observability concepts appear in system design rounds. When an interviewer asks you to design a data pipeline, they expect you to address monitoring. Candidates who only describe the happy path and skip failure detection and debugging leave a weaker impression than those who build observability into their design from the start.
How to approach this
Start with volume checks at the ingestion layer: verify row counts per partition match expected ranges. Add freshness monitoring: set SLAs for each table based on downstream consumer needs. Add schema validation at the boundary between raw and cleaned layers. Add quality checks (null rates, value distributions) after the transformation step. Pipe all metrics to a central dashboard (Grafana, Datadog, or a custom solution) with alerting rules. Emphasize that observability is built into the pipeline, not bolted on after the fact.
How to approach this
Follow the lineage. Start from the dashboard, trace back to the Gold table it queries, check freshness (did it update on time?), check volume (did the row count drop?), check quality (is the revenue column populated?). If the Gold table looks normal, trace back to the Silver and Bronze layers. Check the source system. Often the root cause is one of three things: a source delivered late, a schema change broke a transformation, or a filter condition accidentally excluded valid data. Observability lets you narrow the search in minutes instead of hours.
How to approach this
Store metadata for each table: expected update frequency, SLA threshold, and the query to check the max timestamp. Run a scheduled job (every 15 minutes) that checks each table against its SLA. Group tables by priority tier. Tier 1 (executive dashboards) gets a 30-minute SLA with PagerDuty alerts. Tier 2 (analyst-facing tables) gets a 2-hour SLA with Slack alerts. Tier 3 (internal staging) gets a daily check with email notifications. This tiered approach prevents alert fatigue while still catching critical issues.
How to approach this
Data quality checks whether values are correct. Data observability checks whether the entire pipeline is healthy. Quality is a subset of observability. Invest in quality first if you have a small number of high-value tables with known business rules. Invest in observability first if you have hundreds of tables, complex lineage, and frequent pipeline failures where the root cause is hard to find. Most mature data teams need both: quality checks inside transformations and observability monitoring across the full pipeline.
You do not need a six-figure observability platform on day one. Start with three checks that catch the majority of pipeline failures: freshness, volume, and null rates. Here is a practical starting point.
Step 1: Freshness table. Create a table that stores the last update timestamp for each monitored table. After every pipeline run, insert the current timestamp. A cron job checks every 15 minutes: if any table is past its SLA threshold, fire an alert. This takes an afternoon to build and catches the most common pipeline failure (data not arriving on time).
Step 2: Volume checks. After each pipeline run, record the row count for each target table. Compare against a 7-day rolling average. If the count deviates by more than 50% (calibrate this threshold per table), flag it. Volume anomalies are the second most common signal.
Step 3: Column-level quality. For your top 10 most important tables, add null rate checks on required columns and uniqueness checks on primary keys. Run these after the transformation step. Store results in a quality metrics table. Build a simple dashboard that shows quality trends over time.
Step 4: Lineage (later). Lineage is the hardest to build from scratch. If you use dbt, you get it for free. If you do not, consider OpenLineage as a standard format and build lineage collection into your orchestrator. Lineage becomes critical when you have enough tables that you cannot hold the full dependency graph in your head.
System design questions test your ability to build observable, fault-tolerant pipelines. Practice on DataDriven.