Data Observability: 5 Pillars and Tools (2026)

Data observability is the practice of monitoring the health of your data pipelines and the data flowing through them. It covers five pillars: freshness, quality, volume, schema, and lineage. When a table goes stale, a row count drops 80%, or a column type changes without warning, observability catches it before your stakeholders do.

Core Pillars

System Design Rounds

1,418

Challenges Built

275

Companies in Dataset

The 5 Pillars of Data Observability

These five pillars were originally articulated by Barr Moses at Monte Carlo. They provide a structured framework for thinking about pipeline health. Each pillar answers a different question, and together they cover the full surface area of what can go wrong with your data.

Time since last update

Pillar 1: Freshness

Freshness measures how up-to-date your data is. If a dashboard shows yesterday's numbers at 3 PM, something is wrong. Freshness monitoring tracks when each table was last updated and alerts when it falls behind its expected schedule. This is the most common observability failure: the pipeline ran, it produced data, but it produced stale data because an upstream source delivered late. In practice, freshness checks compare the maximum timestamp in a table against the current time. If the gap exceeds the SLA threshold (say, 2 hours for a table that should refresh hourly), an alert fires. Simple to implement, hard to calibrate. Set thresholds too tight and you get alert fatigue. Set them too loose and you miss real issues. The best teams set freshness SLAs per table based on downstream consumer requirements, not arbitrary defaults.

Null rate, uniqueness, value range

Pillar 2: Quality

Data quality monitoring checks whether values in your tables meet expectations. NULL rates, duplicate counts, value distributions, referential integrity, and custom business rules. A pipeline can run on time, produce the right number of rows, and still be broken if 40% of a required column is NULL because the source changed its API contract. Quality checks fall into two categories: statistical (is the NULL rate for this column within its historical range?) and rule-based (every order must have a positive amount, every user must have a valid email format). Statistical checks catch drift over time. Rule-based checks catch hard failures immediately. The best observability setups use both. Tools like Great Expectations and dbt tests encode quality checks as code that runs inside the pipeline, not as afterthoughts.

Row count trends

Pillar 3: Volume

Volume monitoring tracks how many rows or bytes each table receives per load. A sudden 80% drop in row count usually means the source is broken. A sudden 10x spike might mean duplicate data or a schema change that expanded a one-to-many relationship. Volume anomalies are the most reliable early-warning signal because they catch problems before quality checks even run. Track row counts over time and flag deviations beyond a configurable threshold (often 2 to 3 standard deviations from the rolling average). Simple implementations use a daily COUNT(*) comparison. Sophisticated implementations track volume at the partition level: did today's partition arrive with the expected row count? Volume monitoring is cheap to implement and catches a surprisingly large percentage of pipeline failures. Start here if you are building observability from scratch.

Column additions, removals, type changes

Pillar 4: Schema

Schema monitoring detects changes in table structure: added columns, dropped columns, type changes, renamed fields. A single column rename in the source can cascade through 20 downstream tables if your pipeline does not catch it at the ingestion boundary. Schema monitoring provides the first line of defense against breaking changes. There are two approaches: reactive (compare the current schema to the last known schema on each pipeline run and alert on differences) and proactive (use a schema registry that requires explicit approval before a breaking change can propagate). Reactive is easier to implement. Proactive is safer for production systems. Avro and Protobuf schemas with a Confluent-style registry enforce backward/forward compatibility rules at the serialization layer, catching breaks before data enters the pipeline.

Upstream/downstream dependency graph

Pillar 5: Lineage

Lineage tracks how data flows from source to destination: which tables feed which transformations, which transformations produce which outputs, and which dashboards consume which tables. When a freshness alert fires, lineage tells you where the bottleneck is. When a quality issue appears in a Gold table, lineage traces it back to the specific Bronze or Silver table where the problem originated. Column-level lineage is the gold standard: knowing not just that Table A feeds Table B, but that Table A's user_email column maps to Table B's email column after a LOWER() transformation. This level of granularity lets you do impact analysis before making changes (if I drop this column, what breaks?) and root cause analysis when something goes wrong. dbt builds lineage automatically from its SQL models. For non-dbt pipelines, tools like OpenLineage standardize lineage collection across different orchestrators and compute engines.

Data Observability vs Data Quality

These terms get conflated constantly. They overlap but serve different purposes. Data quality is a pillar within observability. Observability is the broader system that also covers freshness, volume, schema, and lineage.

Aspect	Data Observability	Data Quality
Scope	Full pipeline health: freshness, volume, schema, lineage, and quality	Data values only: null rates, duplicates, distributions, business rules
When it runs	Continuously, across every pipeline run	At specific checkpoints, usually after transformation
What it answers	Is the pipeline healthy? Where is the problem?	Does the data meet expectations?
Root cause	Uses lineage to trace issues to their source	Identifies the symptom but not always the cause
Typical tools	Monte Carlo, Bigeye, Metaplane, custom solutions	Great Expectations, dbt tests, Soda, custom SQL checks

Data Observability Tool Categories

The observability space has matured rapidly since 2021. Tools fall into three categories: full-platform observability, quality-focused testing, and custom/open-source solutions.

Full-Platform Observability: End-to-end platforms that monitor all five pillars: freshness, volume, schema, quality, and lineage. They connect to your warehouse, ingest metadata, and use ML-based anomaly detection to flag issues automatically. These platforms are strongest for organizations with hundreds of tables and complex lineage where writing custom checks for every table is not feasible.

Quality-Focused Testing: Tools that focus specifically on data quality checks: validating expectations against data (null rates, uniqueness, distributions, custom SQL assertions). Great Expectations, dbt tests, and Soda fall into this category. These integrate directly into your pipeline code and run as part of the DAG. They are strong for quality but do not cover freshness, volume, schema, or lineage natively.

Custom and Open-Source Solutions: Many teams build observability with SQL checks, cron jobs, and dashboards (Grafana, Datadog). Write a query to check each table's freshness, row count, and null rates. Store results in a metrics table. Build dashboards with alerting rules. Open-source options like Elementary (built on dbt) and OpenMetadata add lineage and anomaly detection.

Data Observability in Interviews

Observability concepts appear in system design rounds. When an interviewer asks you to design a data pipeline, they expect you to address monitoring. Candidates who only describe the happy path leave a weaker impression than those who build observability into their design from the start.

Q1: You are designing a data pipeline that loads 50M events per day into a warehouse. How would you build observability into the pipeline?. Start with volume checks at the ingestion layer: verify row counts per partition match expected ranges. Add freshness monitoring: set SLAs for each table based on downstream consumer needs. Add schema validation at the boundary between raw and cleaned layers. Add quality checks (null rates, value distributions) after the transformation step. Pipe all metrics to a central dashboard (Grafana, Datadog, or a custom solution) with alerting rules. Emphasize that observability is built into the pipeline, not bolted on after the fact.
Q2: A downstream dashboard shows revenue dropped 50% overnight. How do you diagnose this using observability?. Follow the lineage. Start from the dashboard, trace back to the Gold table it queries, check freshness (did it update on time?), check volume (did the row count drop?), check quality (is the revenue column populated?). If the Gold table looks normal, trace back to the Silver and Bronze layers. Check the source system. Often the root cause is one of three things: a source delivered late, a schema change broke a transformation, or a filter condition accidentally excluded valid data. Observability lets you narrow the search in minutes instead of hours.
Q3: How would you implement data freshness monitoring for 200 tables with different SLAs?. Store metadata for each table: expected update frequency, SLA threshold, and the query to check the max timestamp. Run a scheduled job (every 15 minutes) that checks each table against its SLA. Group tables by priority tier. Tier 1 (executive dashboards) gets a 30-minute SLA with PagerDuty alerts. Tier 2 (analyst-facing tables) gets a 2-hour SLA with Slack alerts. Tier 3 (internal staging) gets a daily check with email notifications. This tiered approach prevents alert fatigue while still catching critical issues.
Q4: Explain the difference between data observability and data quality. When would you invest in each?. Data quality checks whether values are correct. Data observability checks whether the entire pipeline is healthy. Quality is a subset of observability. Invest in quality first if you have a small number of high-value tables with known business rules. Invest in observability first if you have hundreds of tables, complex lineage, and frequent pipeline failures where the root cause is hard to find. Most mature data teams need both: quality checks inside transformations and observability monitoring across the full pipeline.

Building Observability from Scratch

You do not need a six-figure observability platform on day one. Start with three checks that catch the majority of pipeline failures: freshness, volume, and null rates.

Step 1: Freshness table. Create a table that stores the last update timestamp for each monitored table. After every pipeline run, insert the current timestamp. A cron job checks every 15 minutes: if any table is past its SLA threshold, fire an alert. This takes an afternoon to build and catches the most common pipeline failure (data not arriving on time).

Step 2: Volume checks. After each pipeline run, record the row count for each target table. Compare against a 7-day rolling average. If the count deviates by more than 50% (calibrate this threshold per table), flag it. Volume anomalies are the second most common signal.

Step 3: Column-level quality. For your top 10 most important tables, add null rate checks on required columns and uniqueness checks on primary keys. Run these after the transformation step. Store results in a quality metrics table. Build a simple dashboard that shows quality trends over time.

Step 4: Lineage (later). Lineage is the hardest to build from scratch. If you use dbt, you get it for free. If you do not, consider OpenLineage as a standard format and build lineage collection into your orchestrator. Lineage becomes critical when you have enough tables that you cannot hold the full dependency graph in your head.

Prepare for the interview

01 / Open invite

02min.

Know Data Observability the way the interviewer who asks it knows it.

a Data Observability query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

SnowflakeInterview question

Solve a Data Observability problem

Data Observability FAQ

What is data observability in simple terms?+

Data observability is the ability to understand the health of your data pipelines and the data flowing through them. It monitors five things: whether data is fresh, whether it is correct, whether the right amount of data arrived, whether the schema changed, and how data flows between systems. When something breaks, observability helps you find the root cause quickly instead of debugging blind.

Is data observability the same as data monitoring?+

Monitoring is a subset. Monitoring checks specific metrics (row count, null rate, freshness) and fires alerts when thresholds are breached. Observability goes further: it provides the context you need to diagnose problems. Lineage, historical trends, cross-table correlations, and impact analysis are observability capabilities that go beyond simple monitoring. Think of monitoring as the smoke alarm and observability as the full fire investigation toolkit.

Do I need a dedicated data observability tool?+

Not necessarily. Many teams start with custom SQL checks (count rows, check freshness, validate distributions) and a monitoring dashboard (Grafana, Datadog). This works well for small to medium pipelines. Dedicated tools (Monte Carlo, Bigeye, Metaplane) become valuable when you have hundreds of tables, complex lineage, and a team that cannot keep up with writing custom checks for every new data source. The build-vs-buy decision depends on your pipeline complexity and team bandwidth.

How does data observability come up in interviews?+

It appears in system design rounds. When an interviewer asks you to design a data pipeline, they expect you to address monitoring and failure handling, not just the happy path. Mentioning the five pillars (freshness, quality, volume, schema, lineage), explaining how you would detect and diagnose failures, and discussing alerting strategies shows production-level thinking. It is a signal that you have operated real pipelines, not just built toy projects.

02 / Why practice

Practice Pipeline Design Questions

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Try a Pipeline Design Problem

Related Guides

Pipeline Architecture→

End-to-end pipeline design patterns and best practices

System Design for DE→

How to approach system design rounds in DE interviews

Data Catalog→

Metadata management and data discovery for data teams