Concepts
Data lineage tracks where data comes from, how it transforms, and where it goes. It creates a dependency graph across tables, columns, and jobs so engineers can debug issues, plan migrations, and prove compliance without guessing.
According to Gartner, organizations with active metadata management (including lineage) reduce data incident resolution time by up to 60%. In a 2025 dbt Labs survey, 72% of data teams cited lineage as their top priority for improving data trust.
Data lineage answers three questions: where did this data come from, what happened to it, and where did it go.
Data lineage tracks where data comes from, how it moves, and what transforms it along the way. Think of it as a directed graph: sources are upstream nodes, destinations are downstream nodes, and edges represent transformations. When a dashboard number looks wrong, lineage tells you which tables, jobs, and transformations touched that data so you can trace the problem to its origin.
Upstream lineage answers 'where did this data come from?' Starting from a reporting table, you trace backward through transformations to raw source systems. Downstream lineage answers 'what breaks if this source changes?' Starting from a source table, you trace forward to every dashboard, model, and export that depends on it. Interviewers expect you to use both terms precisely.
Table-level lineage shows which tables feed into which other tables. Column-level lineage is more granular: it shows that revenue_summary.total_revenue is derived from orders.unit_price multiplied by orders.quantity. Column-level lineage is harder to capture but far more useful for debugging and impact analysis.
Upstream Sources
PostgreSQL, Kafka, S3, APIs
Transformations
dbt models, Spark jobs, SQL views
Downstream Consumers
Dashboards, ML models, exports
Lineage connects every node in this graph, enabling both upstream debugging and downstream impact analysis.
Lineage granularity exists on a spectrum. Knowing these three levels and their trade-offs shows interviewers you understand the practical reality.
Tracks dependencies between tables and views. The simplest form: 'table B reads from table A.' Captured by parsing SQL or inspecting DAG definitions. Most orchestrators (Airflow, dbt) provide this level natively.
Tracks how individual columns flow through transformations. Shows that report.total_revenue comes from SUM(orders.unit_price * orders.quantity). Requires SQL parsing or runtime instrumentation. dbt, SQLGlot, and OpenLineage can extract this.
Captures the exact logic applied at each step. Knows that a column was filtered (WHERE status = 'active'), aggregated (SUM), or joined (LEFT JOIN on customer_id). Enables answering 'why is this row missing?' by showing which filter excluded it.
Lineage is not academic. It solves real operational problems that interviewers have dealt with firsthand.
When a metric drops 40% overnight, the first question is 'what changed upstream?' Without lineage, engineers grep through DAG code and Slack threads. With lineage, they trace the metric back through every transformation to the source that broke.
Before deprecating a column or changing a table schema, you need to know every downstream consumer. Lineage provides this automatically. Without it, you send a Slack message asking 'does anyone use this table?' and hope for the best.
GDPR, CCPA, and HIPAA require organizations to document where personal data flows. Lineage metadata answers audit questions like 'which systems store customer email addresses?' and 'can you prove this PII was deleted from all downstream tables?'
Data consumers need to trust the numbers they see. Lineage connects a dashboard metric to its exact SQL definition and source tables. When a VP asks 'where does this revenue number come from?', lineage provides a verifiable answer.
Interviewers expect you to know the ecosystem. Name the tool, but more importantly, explain where it fits in the architecture.
An open specification for lineage metadata events. Defines a JSON schema for run, job, and dataset events. Supported by Airflow, Spark, dbt, and Flink. The lingua franca for interoperable lineage.
Reference implementation of OpenLineage. Stores lineage events and provides a REST API and UI for exploring lineage graphs. Lightweight, easy to self-host.
Metadata platform by LinkedIn (now Acryl Data). Captures lineage alongside data discovery, quality, and ownership. Modern architecture with a GraphQL API.
Metadata governance platform originally built for Hadoop. Captures lineage from Hive, Spark, and Hadoop ecosystem tools. Common in legacy on-prem environments.
Captures table-level and column-level lineage as a byproduct of compiling SQL models. The ref() function creates explicit dependencies. The most common lineage source in modern data stacks.
Databricks governance layer capturing lineage across notebooks, jobs, and SQL queries automatically. Column-level lineage for Spark and SQL workloads.
Azure unified data governance service. Scans data sources across Azure, AWS, and on-prem to build lineage maps. Integrates with Synapse, Data Factory, and Power BI.
Enterprise data intelligence platform combining lineage with data catalog, glossary, and policy management. Targeted at large enterprises with complex governance requirements.
Interviewers rarely ask “define data lineage.” They create scenarios that require you to apply lineage thinking.
“Revenue on the executive dashboard dropped 30% yesterday. Walk me through how you investigate.”
Start with lineage: trace the metric back to its source tables, check each transformation for changes, and identify which upstream job or schema change caused the drop. Candidates who jump straight to 'check the SQL' miss the systematic approach that lineage enables.
“How would you build lineage tracking for a platform with 500 dbt models and 200 Spark jobs?”
OpenLineage as the event format, a metadata store (Marquez or DataHub), integration points with the scheduler (Airflow emitting lineage events), and column-level parsing via SQL analysis. Bonus for discussing Python transformation lineage.
“We need to rename the customer_id column to cust_id in our source system. What is the process?”
Use lineage to find every downstream table, view, dashboard, and ML model that references customer_id. Then plan the migration with versioning, backward compatibility, or coordinated changes.
“How do you prove to an auditor that customer email addresses are not stored in any analytics tables?”
Lineage shows every downstream destination of the email column. Combined with data classification tags, you can verify PII columns are masked, hashed, or excluded from analytics layers.
“We have a pipeline with no documentation: raw files, stored procedures, reporting tables. What would you change?”
Incremental lineage adoption: start with SQL parsing for table-level lineage, add OpenLineage events to the orchestrator, build a catalog UI, then iterate toward column-level granularity. Pragmatic rollout over naming tools.
What is data lineage and why does it matter?
Data lineage tracks the origin, movement, and transformation of data across systems. It matters for debugging (trace a wrong metric to its root cause), impact analysis (know what breaks before changing a table), compliance (prove where PII flows), and trust (connect dashboard metrics to their source definitions). Anchor your answer to a concrete scenario.
Difference between column-level and table-level lineage?
Table-level lineage tracks which tables depend on which other tables. Column-level lineage tracks how individual columns flow through transformations, including specific operations (SUM, JOIN, FILTER). Column-level is harder to capture but significantly more useful for debugging and impact analysis.
How would you implement lineage in a dbt project?
dbt captures lineage natively by parsing SQL model references. The ref() function creates explicit table-level dependencies. For column-level lineage, recent dbt versions parse SQL to trace column origins. Expose the lineage graph via dbt docs or integrate with DataHub by emitting OpenLineage events.
How does OpenLineage work?
OpenLineage defines a standard JSON schema for lineage events. Each event contains a job (the transformation), input datasets, output datasets, and facets (schema, row count, SQL text). Integrations emit events at job start and completion. A backend like Marquez collects them. Key insight: it decouples lineage emission from storage.
How do you handle lineage for Python or Spark transformations?
For PySpark, OpenLineage has a Spark integration that intercepts the query plan and emits lineage events. For arbitrary Python (pandas, custom scripts), lineage is harder. Options: instrument DataFrame operations, require explicit input/output declarations, or parse AST. Acknowledge Python lineage is unsolved at full generality.
A column is being deprecated. Walk me through the process.
Query lineage for every downstream consumer. Notify owners. Add the new column alongside the old one (backward compatibility). Migrate consumers. Verify via lineage that no active references remain. Drop the column. Lineage converts this from hope-based to verifiable.
How does lineage relate to data quality?
Lineage is the backbone of root cause analysis for quality issues. When a quality check fails, lineage tells you which upstream transformation or source caused the problem. Without lineage, quality alerts are alarms with no investigation path. Connect lineage to observability: detect with quality checks, diagnose with lineage.
What is active vs passive lineage collection?
Passive lineage is captured by parsing SQL or DAG definitions. Shows intended data flow but may miss runtime behavior. Active lineage is captured during job execution by instrumenting the engine (Spark listener, Airflow OpenLineage plugin). Reflects actual data flow. The strongest systems combine both.
How would you evaluate lineage tools?
Key criteria: granularity (column vs table), coverage (which engines and languages), integration (orchestrator, warehouse, BI tools), freshness (real-time or batch collection), scalability, and cost. Also consider build vs buy: OpenLineage plus Marquez vs Collibra or Atlan.
How does lineage support data mesh?
In a data mesh, domain teams own their data products. Lineage provides cross-domain visibility that prevents the mesh from becoming a maze. When team A changes a schema, lineage shows which products in team B, C, D are affected. Lineage metadata is a key component of a data product contract alongside schema and SLAs.
Confusing lineage with a data catalog
A catalog describes what data exists (schema, descriptions, owners). Lineage describes how data flows (origins, transformations, dependencies). They are complementary but solve different problems.
Only mentioning table-level lineage
Table-level is the minimum. Column-level is what makes lineage actually useful. If you only discuss table-level, interviewers may conclude you have surface knowledge.
Treating lineage as a one-time documentation effort
Lineage must be automated and continuously updated. Manual documentation becomes stale within weeks. Instrument pipelines to emit lineage events automatically.
Naming tools without explaining the architecture
Saying 'we use DataHub' is not an answer. Explain the flow: pipelines emit OpenLineage events, metadata store indexes them, UI renders the graph, teams query lineage programmatically.
Ignoring lineage for non-SQL transformations
Many pipelines include Python or Spark code that does not go through a SQL parser. Acknowledging this gap and describing mitigation strategies shows real-world experience.
DataDriven covers data modeling, SQL, pipeline design, and system design at interview difficulty.
Start Practicing