Data Lineage: Interview Questions & Deep Dive (2026)

Data lineage tracks where data comes from, how it transforms, and where it goes. It creates a dependency graph across tables, columns, and jobs so engineers can debug issues, plan migrations, and prove compliance without guessing.

What this guide actually says

Data lineage answers three questions: where did this data come from, what happened to it, and where did it go. Table-level lineage is the floor; column-level is what makes lineage useful in practice. Without lineage, debugging a wrong dashboard is grep + Slack. With it, it's a deterministic trace. Most lineage projects fail not on technology but on automation: manual lineage goes stale within weeks.

What is data lineage

1

The core idea

Data lineage tracks where data comes from, how it moves, and what transforms it along the way. Think of it as a directed graph: sources are upstream nodes, destinations are downstream nodes, edges represent transformations. When a dashboard number looks wrong, lineage tells you which tables, jobs, and transformations touched that data so you can trace the problem to its origin.

2

Upstream vs downstream

Upstream lineage answers 'where did this data come from?' Starting from a reporting table, you trace backward through transformations to raw source systems. Downstream lineage answers 'what breaks if this source changes?' Starting from a source table, you trace forward to every dashboard, model, and export that depends on it. Use both terms precisely.

3

Column-level vs table-level

Table-level lineage shows which tables feed into which other tables. Column-level lineage is more granular: it shows that revenue_summary.total_revenue is derived from orders.unit_price multiplied by orders.quantity. Column-level is harder to capture but far more useful for debugging and impact analysis.

Types of lineage

Granularity exists on a spectrum. Knowing these three levels and their trade-offs shows interviewers you understand the practical reality.

Easiest

Table-level lineage

Tracks dependencies between tables and views. The simplest form: 'table B reads from table A.' Captured by parsing SQL or inspecting DAG definitions. Most orchestrators (Airflow, dbt) provide this natively. Strength: easy to capture automatically. Weakness: limited diagnostic value for column-level issues.

Most useful

Column-level lineage

Tracks how individual columns flow through transformations. Shows that report.total_revenue comes from SUM(orders.unit_price * orders.quantity). Requires SQL parsing or runtime instrumentation. dbt, SQLGlot, and OpenLineage can extract this. Strength: precise enough for debugging and impact analysis. Weakness: harder to capture, especially across languages.

Most complete

Transformation-level lineage

Captures the exact logic applied at each step. Knows that a column was filtered (WHERE status = 'active'), aggregated (SUM), or joined (LEFT JOIN on customer_id). Enables answering 'why is this row missing?' by showing which filter excluded it. Strength: most complete picture. Weakness: few tools provide this fully; requires deep SQL parsing.

Why interviewers care about lineage

Not academic. Real operational problems interviewers have dealt with firsthand.

Debugging production issues

When a metric drops 40% overnight, the first question is 'what changed upstream?' Without lineage, engineers grep through DAG code and Slack threads. With lineage, they trace the metric back through every transformation to the source that broke.

Impact analysis before changes

Before deprecating a column or changing a table schema, you need to know every downstream consumer. Lineage provides this automatically. Without it, you send a Slack message asking 'does anyone use this table?' and hope for the best.

Regulatory compliance

GDPR, CCPA, and HIPAA require organizations to document where personal data flows. Lineage metadata answers audit questions like 'which systems store customer email addresses?' and 'can you prove this PII was deleted from all downstream tables?'

Governance and trust

Data consumers need to trust the numbers they see. Lineage connects a dashboard metric to its exact SQL definition and source tables. When a VP asks 'where does this revenue number come from?', lineage provides a verifiable answer.

Real lineage systems and tools

Know the ecosystem. Name the tool, but more importantly, explain where it fits in the architecture.

Open Standard

OpenLineage

An open specification for lineage metadata events. Defines a JSON schema for run, job, and dataset events. Supported by Airflow, Spark, dbt, Flink. The lingua franca for interoperable lineage.

Open Source

Marquez

Reference implementation of OpenLineage. Stores lineage events and provides a REST API and UI for exploring lineage graphs. Lightweight, easy to self-host.

Open Source

DataHub

Metadata platform by LinkedIn (now Acryl Data). Captures lineage alongside data discovery, quality, ownership. Modern architecture with a GraphQL API.

Open Source

Apache Atlas

Metadata governance platform originally built for Hadoop. Captures lineage from Hive, Spark, Hadoop ecosystem tools. Common in legacy on-prem environments.

Transformation

dbt

Captures table-level and column-level lineage as a byproduct of compiling SQL models. The ref() function creates explicit dependencies. The most common lineage source in modern data stacks.

Databricks

Unity Catalog

Databricks governance layer capturing lineage across notebooks, jobs, SQL queries automatically. Column-level lineage for Spark and SQL workloads.

Azure

Microsoft Purview

Azure unified data governance service. Scans data sources across Azure, AWS, on-prem to build lineage maps. Integrates with Synapse, Data Factory, Power BI.

Enterprise

Collibra

Enterprise data intelligence platform combining lineage with catalog, glossary, policy management. Targeted at large enterprises with complex governance requirements.

How interviewers test this

Scenarios that require you to apply lineage thinking, not definitions.

Broken dashboard

Revenue on the executive dashboard dropped 30% yesterday. Walk me through how you investigate.

Start with lineage: trace the metric back to source tables, check each transformation for changes, identify which upstream job or schema change caused the drop. Candidates who jump straight to 'check the SQL' miss the systematic approach lineage enables.

Design

How would you build lineage tracking for a platform with 500 dbt models and 200 Spark jobs?

OpenLineage as the event format, a metadata store (Marquez or DataHub), integration points with the scheduler (Airflow emitting lineage events), column-level parsing via SQL analysis. Bonus for discussing Python transformation lineage.

Schema migration

We need to rename customer_id to cust_id in our source system. What's the process?

Use lineage to find every downstream table, view, dashboard, ML model that references customer_id. Then plan the migration with versioning, backward compatibility, or coordinated changes.

Compliance

How do you prove to an auditor that customer email addresses aren't stored in any analytics tables?

Lineage shows every downstream destination of the email column. Combined with classification tags, you verify PII columns are masked, hashed, or excluded from analytics layers.

Legacy modernization

We have a pipeline with no documentation: raw files, stored procedures, reporting tables. What would you change?

Incremental lineage adoption: start with SQL parsing for table-level lineage, add OpenLineage events to the orchestrator, build a catalog UI, then iterate toward column-level granularity. Pragmatic rollout over naming tools.

Interview questions with guidance

Ten questions covering lineage concepts, tools, implementation, and governance.

What is data lineage and why does it matter?

Tracks origin, movement, and transformation of data across systems. Matters for debugging (trace a wrong metric to root cause), impact analysis (know what breaks before changing), compliance (prove where PII flows), trust (connect dashboard metrics to their source definitions). Anchor your answer to a concrete scenario.

Column-level vs table-level lineage?

Table-level tracks which tables depend on which others. Column-level tracks how individual columns flow through transformations, including specific operations (SUM, JOIN, FILTER). Column-level is harder to capture but significantly more useful for debugging and impact analysis.

How would you implement lineage in a dbt project?

dbt captures lineage natively by parsing SQL model references. The ref() function creates explicit table-level dependencies. For column-level, recent dbt versions parse SQL to trace column origins. Expose the lineage graph via dbt docs or integrate with DataHub by emitting OpenLineage events.

How does OpenLineage work?

Defines a standard JSON schema for lineage events. Each event contains a job (the transformation), input datasets, output datasets, and facets (schema, row count, SQL text). Integrations emit events at job start and completion. A backend like Marquez collects them. Key insight: decouples lineage emission from storage.

How do you handle lineage for Python or Spark transformations?

For PySpark, OpenLineage has a Spark integration that intercepts the query plan and emits lineage events. For arbitrary Python (pandas, custom scripts), lineage is harder. Options: instrument DataFrame operations, require explicit input/output declarations, or parse AST. Acknowledge Python lineage is unsolved at full generality.

A column is being deprecated. Walk me through the process.

Query lineage for every downstream consumer. Notify owners. Add the new column alongside the old one (backward compatibility). Migrate consumers. Verify via lineage that no active references remain. Drop the column. Lineage converts this from hope-based to verifiable.

How does lineage relate to data quality?

Lineage is the backbone of root cause analysis for quality issues. When a quality check fails, lineage tells you which upstream transformation or source caused the problem. Without lineage, quality alerts are alarms with no investigation path. Detect with quality checks, diagnose with lineage.

What is active vs passive lineage collection?

Passive: captured by parsing SQL or DAG definitions. Shows intended data flow but may miss runtime behavior. Active: captured during job execution by instrumenting the engine (Spark listener, Airflow OpenLineage plugin). Reflects actual data flow. The strongest systems combine both.

How would you evaluate lineage tools?

Granularity (column vs table), coverage (which engines and languages), integration (orchestrator, warehouse, BI tools), freshness (real-time or batch collection), scalability, cost. Consider build vs buy: OpenLineage + Marquez vs Collibra or Atlan.

How does lineage support data mesh?

In a mesh, domain teams own their data products. Lineage provides cross-domain visibility that prevents the mesh from becoming a maze. When team A changes a schema, lineage shows which products in B, C, D are affected. Lineage metadata is a key component of a data product contract alongside schema and SLAs.

Common interview mistakes

Confusing lineage with a data catalog

A catalog describes what data exists (schema, descriptions, owners). Lineage describes how data flows (origins, transformations, dependencies). Complementary but solve different problems.

Only mentioning table-level lineage

Table-level is the minimum. Column-level is what makes lineage useful in practice. If you only discuss table-level, interviewers may conclude you have surface knowledge.

Treating lineage as a one-time documentation effort

Lineage must be automated and continuously updated. Manual documentation becomes stale within weeks. Instrument pipelines to emit lineage events automatically.

Naming tools without explaining the architecture

'We use DataHub' is not an answer. Explain the flow: pipelines emit OpenLineage events, metadata store indexes them, UI renders the graph, teams query lineage programmatically.

Ignoring lineage for non-SQL transformations

Many pipelines include Python or Spark code that doesn't go through a SQL parser. Acknowledging this gap and describing mitigation strategies shows real-world experience.

Frequently asked questions

What is data lineage?+
The record of where data comes from, how it transforms, and where it goes. Creates a graph of dependencies across your data platform, enabling debugging, impact analysis, and compliance auditing.
What are the best data lineage tools?+
Open source: OpenLineage (standard), Marquez (backend), DataHub (full platform). Cloud native: Unity Catalog (Databricks), Purview (Azure). Enterprise: Collibra, Atlan. dbt provides lineage as a built-in feature.
What is column-level lineage?+
Tracks how individual columns flow through transformations. Shows that output_table.total_revenue is derived from SUM(source_table.unit_price * source_table.quantity). More granular and more useful than table-level lineage.
How is lineage different from a data catalog?+
A catalog describes what data exists: schemas, descriptions, owners, tags. Lineage describes how data flows: origins, transformations, dependencies. Most modern platforms combine both capabilities.
What is OpenLineage?+
An open standard for lineage metadata events. Defines a JSON schema that pipelines emit at job start and completion. Supported by Airflow, Spark, dbt, Flink. Decouples lineage collection from storage.
How does lineage help with GDPR?+
GDPR requires documenting where personal data is processed. Lineage shows every downstream table and system that receives PII columns. Combined with data classification, it proves PII is handled correctly.
Can you have lineage without dbt?+
Yes. OpenLineage integrations exist for Airflow, Spark, Flink, and other engines. You can also parse SQL from any tool using SQLGlot. dbt makes lineage easy, but it's not the only path.
02 / Why practice

Practice data engineering concepts

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Related guides