Loading section...

Design a Data Lineage Graph System

Concepts: pyLineageSystem, pyImpactAnalysis, pyLineageDAGDesign

This is the capstone system design problem for tree traversal in data engineering. Data lineage answers two questions: where did this data come from (upstream), and what would break if this changes (downstream impact analysis). Building a system that stores, queries, and validates a lineage graph requires choosing the right graph representation, implementing efficient traversal for both upstream and downstream queries, detecting cycles so the graph stays valid, and handling scale. This is exactly the kind of design question you get in final rounds at Databricks, Airflow, dbt Labs, or any company with a serious data platform team. Representation: Adjacency List (Bidirectional)