Concepts

Data Lineage: What Interviewers Test

Data lineage tracks where data comes from, how it transforms, and where it goes. It creates a dependency graph across tables, columns, and jobs so engineers can debug issues, plan migrations, and prove compliance without guessing.

According to Gartner, organizations with active metadata management (including lineage) reduce data incident resolution time by up to 60%. In a 2025 dbt Labs survey, 72% of data teams cited lineage as their top priority for improving data trust.

What Is Data Lineage

Data lineage answers three questions: where did this data come from, what happened to it, and where did it go.

1

The Core Idea

Data lineage tracks where data comes from, how it moves, and what transforms it along the way. Think of it as a directed graph: sources are upstream nodes, destinations are downstream nodes, and edges represent transformations. When a dashboard number looks wrong, lineage tells you which tables, jobs, and transformations touched that data so you can trace the problem to its origin.

2

Upstream vs Downstream

Upstream lineage answers 'where did this data come from?' Starting from a reporting table, you trace backward through transformations to raw source systems. Downstream lineage answers 'what breaks if this source changes?' Starting from a source table, you trace forward to every dashboard, model, and export that depends on it. Interviewers expect you to use both terms precisely.

3

Column-Level vs Table-Level

Table-level lineage shows which tables feed into which other tables. Column-level lineage is more granular: it shows that revenue_summary.total_revenue is derived from orders.unit_price multiplied by orders.quantity. Column-level lineage is harder to capture but far more useful for debugging and impact analysis.

How Lineage Flows

Upstream Sources

PostgreSQL, Kafka, S3, APIs

Transformations

dbt models, Spark jobs, SQL views

Downstream Consumers

Dashboards, ML models, exports

Lineage connects every node in this graph, enabling both upstream debugging and downstream impact analysis.

Types of Lineage

Lineage granularity exists on a spectrum. Knowing these three levels and their trade-offs shows interviewers you understand the practical reality.

Table-Level

Tracks dependencies between tables and views. The simplest form: 'table B reads from table A.' Captured by parsing SQL or inspecting DAG definitions. Most orchestrators (Airflow, dbt) provide this level natively.

Easy to capture automaticallyLimited diagnostic value for column-level issues
Column-Level

Tracks how individual columns flow through transformations. Shows that report.total_revenue comes from SUM(orders.unit_price * orders.quantity). Requires SQL parsing or runtime instrumentation. dbt, SQLGlot, and OpenLineage can extract this.

Precise enough for debugging and impact analysisHarder to capture, especially across languages
Transformation-Level

Captures the exact logic applied at each step. Knows that a column was filtered (WHERE status = 'active'), aggregated (SUM), or joined (LEFT JOIN on customer_id). Enables answering 'why is this row missing?' by showing which filter excluded it.

Most complete picture of data flowFew tools provide this fully; requires deep SQL parsing

Why Interviewers Care About Lineage

Lineage is not academic. It solves real operational problems that interviewers have dealt with firsthand.

Debugging Production Issues

When a metric drops 40% overnight, the first question is 'what changed upstream?' Without lineage, engineers grep through DAG code and Slack threads. With lineage, they trace the metric back through every transformation to the source that broke.

Impact Analysis Before Changes

Before deprecating a column or changing a table schema, you need to know every downstream consumer. Lineage provides this automatically. Without it, you send a Slack message asking 'does anyone use this table?' and hope for the best.

Regulatory Compliance

GDPR, CCPA, and HIPAA require organizations to document where personal data flows. Lineage metadata answers audit questions like 'which systems store customer email addresses?' and 'can you prove this PII was deleted from all downstream tables?'

Governance and Trust

Data consumers need to trust the numbers they see. Lineage connects a dashboard metric to its exact SQL definition and source tables. When a VP asks 'where does this revenue number come from?', lineage provides a verifiable answer.

Real Lineage Systems and Tools

Interviewers expect you to know the ecosystem. Name the tool, but more importantly, explain where it fits in the architecture.

OpenLineageOpen Standard

An open specification for lineage metadata events. Defines a JSON schema for run, job, and dataset events. Supported by Airflow, Spark, dbt, and Flink. The lingua franca for interoperable lineage.

MarquezOpen Source

Reference implementation of OpenLineage. Stores lineage events and provides a REST API and UI for exploring lineage graphs. Lightweight, easy to self-host.

DataHubOpen Source

Metadata platform by LinkedIn (now Acryl Data). Captures lineage alongside data discovery, quality, and ownership. Modern architecture with a GraphQL API.

Apache AtlasOpen Source

Metadata governance platform originally built for Hadoop. Captures lineage from Hive, Spark, and Hadoop ecosystem tools. Common in legacy on-prem environments.

dbtTransformation

Captures table-level and column-level lineage as a byproduct of compiling SQL models. The ref() function creates explicit dependencies. The most common lineage source in modern data stacks.

Unity CatalogDatabricks

Databricks governance layer capturing lineage across notebooks, jobs, and SQL queries automatically. Column-level lineage for Spark and SQL workloads.

Microsoft PurviewAzure

Azure unified data governance service. Scans data sources across Azure, AWS, and on-prem to build lineage maps. Integrates with Synapse, Data Factory, and Power BI.

CollibraEnterprise

Enterprise data intelligence platform combining lineage with data catalog, glossary, and policy management. Targeted at large enterprises with complex governance requirements.

How Interviewers Test This

Interviewers rarely ask “define data lineage.” They create scenarios that require you to apply lineage thinking.

Scenario 1|A broken dashboard

Revenue on the executive dashboard dropped 30% yesterday. Walk me through how you investigate.

What they want to hear

Start with lineage: trace the metric back to its source tables, check each transformation for changes, and identify which upstream job or schema change caused the drop. Candidates who jump straight to 'check the SQL' miss the systematic approach that lineage enables.

Scenario 2|Design a lineage system

How would you build lineage tracking for a platform with 500 dbt models and 200 Spark jobs?

What they want to hear

OpenLineage as the event format, a metadata store (Marquez or DataHub), integration points with the scheduler (Airflow emitting lineage events), and column-level parsing via SQL analysis. Bonus for discussing Python transformation lineage.

Scenario 3|Schema migration blast radius

We need to rename the customer_id column to cust_id in our source system. What is the process?

What they want to hear

Use lineage to find every downstream table, view, dashboard, and ML model that references customer_id. Then plan the migration with versioning, backward compatibility, or coordinated changes.

Scenario 4|Compliance audit

How do you prove to an auditor that customer email addresses are not stored in any analytics tables?

What they want to hear

Lineage shows every downstream destination of the email column. Combined with data classification tags, you can verify PII columns are masked, hashed, or excluded from analytics layers.

Scenario 5|Legacy pipeline modernization

We have a pipeline with no documentation: raw files, stored procedures, reporting tables. What would you change?

What they want to hear

Incremental lineage adoption: start with SQL parsing for table-level lineage, add OpenLineage events to the orchestrator, build a catalog UI, then iterate toward column-level granularity. Pragmatic rollout over naming tools.

Interview Questions with Guidance

Q1

What is data lineage and why does it matter?

A strong answer includes:

Data lineage tracks the origin, movement, and transformation of data across systems. It matters for debugging (trace a wrong metric to its root cause), impact analysis (know what breaks before changing a table), compliance (prove where PII flows), and trust (connect dashboard metrics to their source definitions). Anchor your answer to a concrete scenario.

Q2

Difference between column-level and table-level lineage?

A strong answer includes:

Table-level lineage tracks which tables depend on which other tables. Column-level lineage tracks how individual columns flow through transformations, including specific operations (SUM, JOIN, FILTER). Column-level is harder to capture but significantly more useful for debugging and impact analysis.

Q3

How would you implement lineage in a dbt project?

A strong answer includes:

dbt captures lineage natively by parsing SQL model references. The ref() function creates explicit table-level dependencies. For column-level lineage, recent dbt versions parse SQL to trace column origins. Expose the lineage graph via dbt docs or integrate with DataHub by emitting OpenLineage events.

Q4

How does OpenLineage work?

A strong answer includes:

OpenLineage defines a standard JSON schema for lineage events. Each event contains a job (the transformation), input datasets, output datasets, and facets (schema, row count, SQL text). Integrations emit events at job start and completion. A backend like Marquez collects them. Key insight: it decouples lineage emission from storage.

Q5

How do you handle lineage for Python or Spark transformations?

A strong answer includes:

For PySpark, OpenLineage has a Spark integration that intercepts the query plan and emits lineage events. For arbitrary Python (pandas, custom scripts), lineage is harder. Options: instrument DataFrame operations, require explicit input/output declarations, or parse AST. Acknowledge Python lineage is unsolved at full generality.

Q6

A column is being deprecated. Walk me through the process.

A strong answer includes:

Query lineage for every downstream consumer. Notify owners. Add the new column alongside the old one (backward compatibility). Migrate consumers. Verify via lineage that no active references remain. Drop the column. Lineage converts this from hope-based to verifiable.

Q7

How does lineage relate to data quality?

A strong answer includes:

Lineage is the backbone of root cause analysis for quality issues. When a quality check fails, lineage tells you which upstream transformation or source caused the problem. Without lineage, quality alerts are alarms with no investigation path. Connect lineage to observability: detect with quality checks, diagnose with lineage.

Q8

What is active vs passive lineage collection?

A strong answer includes:

Passive lineage is captured by parsing SQL or DAG definitions. Shows intended data flow but may miss runtime behavior. Active lineage is captured during job execution by instrumenting the engine (Spark listener, Airflow OpenLineage plugin). Reflects actual data flow. The strongest systems combine both.

Q9

How would you evaluate lineage tools?

A strong answer includes:

Key criteria: granularity (column vs table), coverage (which engines and languages), integration (orchestrator, warehouse, BI tools), freshness (real-time or batch collection), scalability, and cost. Also consider build vs buy: OpenLineage plus Marquez vs Collibra or Atlan.

Q10

How does lineage support data mesh?

A strong answer includes:

In a data mesh, domain teams own their data products. Lineage provides cross-domain visibility that prevents the mesh from becoming a maze. When team A changes a schema, lineage shows which products in team B, C, D are affected. Lineage metadata is a key component of a data product contract alongside schema and SLAs.

Common Interview Mistakes

Confusing lineage with a data catalog

A catalog describes what data exists (schema, descriptions, owners). Lineage describes how data flows (origins, transformations, dependencies). They are complementary but solve different problems.

Only mentioning table-level lineage

Table-level is the minimum. Column-level is what makes lineage actually useful. If you only discuss table-level, interviewers may conclude you have surface knowledge.

Treating lineage as a one-time documentation effort

Lineage must be automated and continuously updated. Manual documentation becomes stale within weeks. Instrument pipelines to emit lineage events automatically.

Naming tools without explaining the architecture

Saying 'we use DataHub' is not an answer. Explain the flow: pipelines emit OpenLineage events, metadata store indexes them, UI renders the graph, teams query lineage programmatically.

Ignoring lineage for non-SQL transformations

Many pipelines include Python or Spark code that does not go through a SQL parser. Acknowledging this gap and describing mitigation strategies shows real-world experience.

Frequently Asked Questions

What is data lineage?+
Data lineage is the record of where data comes from, how it transforms, and where it goes. It creates a graph of dependencies across your data platform, enabling debugging, impact analysis, and compliance auditing.
What are the best data lineage tools?+
Open source: OpenLineage (standard), Marquez (backend), DataHub (full platform). Cloud native: Unity Catalog (Databricks), Purview (Azure). Enterprise: Collibra, Atlan. dbt provides lineage as a built-in feature.
Do interviews ask about lineage?+
Yes, increasingly so. Lineage questions appear in system design rounds, debugging scenarios, and governance discussions. It signals senior-level thinking about data operations.
What is column-level lineage?+
Column-level lineage tracks how individual columns flow through transformations. It shows that output_table.total_revenue is derived from SUM(source_table.unit_price * source_table.quantity). More granular and more useful than table-level lineage.
How is lineage different from a data catalog?+
A catalog describes what data exists: schemas, descriptions, owners, tags. Lineage describes how data flows: origins, transformations, dependencies. Most modern platforms combine both capabilities.
What is OpenLineage?+
An open standard for lineage metadata events. Defines a JSON schema that pipelines emit at job start and completion. Supported by Airflow, Spark, dbt, and Flink. Decouples lineage collection from storage.
How does lineage help with GDPR?+
GDPR requires documenting where personal data is processed. Lineage shows every downstream table and system that receives PII columns. Combined with data classification, it proves PII is handled correctly.
Can you have lineage without dbt?+
Absolutely. OpenLineage integrations exist for Airflow, Spark, Flink, and other engines. You can also parse SQL from any tool using SQLGlot. dbt makes lineage easy, but it is not the only path.

Practice Data Engineering Concepts

DataDriven covers data modeling, SQL, pipeline design, and system design at interview difficulty.

Start Practicing