Loading lesson...
DAGs, segment trees, distributed lineage — staff-level problems, staff-level answers.
DAGs, segment trees, distributed lineage — staff-level problems, staff-level answers.
Topics covered: DAG Traversal with Cycle Detection, Segment Tree for Range Queries, Parallel Tree Traversal and Distributed Lineage, Tree-Based Query Optimization, Design a Data Lineage Graph System
Pure trees cannot have cycles. Data pipeline DAGs can — accidentally. A cyclic dependency in an Airflow DAG (table A depends on B, B depends on C, C depends on A) causes an infinite loop and must be caught before execution. The industry-standard algorithm is 3-color DFS: white means unvisited, gray means currently in the DFS call stack (in-progress), black means fully processed. If you encounter a gray node during DFS, you found a cycle. This is exactly what Airflow's DAG validator runs on start
A segment tree is a binary tree built over an array where each node stores an aggregate (sum, min, max) for a contiguous range of the array. The leaves are the individual elements. Internal nodes aggregate their children. This structure supports both range queries (what is the sum of elements 3 to 7?) and point updates (update element 5 to a new value) in O(log n) time. The critical advantage over a prefix sum array: prefix sums are O(1) for queries but O(n) for updates. Segment trees are O(log
In a standard topological sort, you process nodes one at a time in order. But nodes at the same topological depth have no dependencies on each other — they can be processed in parallel. This is the key insight behind Spark's DAG scheduler, Airflow's task parallelism, and dbt's parallel model execution. The algorithm: compute all nodes with zero in-degree (no dependencies), execute them all in parallel, decrement the in-degree of their dependents, and repeat for the next wave. This is BFS-style t
Every SQL query you write gets parsed into a tree. SELECT, FROM, WHERE, JOIN, GROUP BY — these become nodes in an abstract syntax tree (AST). The database's query optimizer traverses this tree, applies transformation rules, and rewrites it into a more efficient form. The two most important transformations — predicate pushdown and join reordering — are both tree traversal algorithms. Understanding this at a conceptual level is staff-level data engineering knowledge. Interviewers at companies with
This is the capstone system design problem for tree traversal in data engineering. Data lineage answers two questions: where did this data come from (upstream), and what would break if this changes (downstream impact analysis). Building a system that stores, queries, and validates a lineage graph requires choosing the right graph representation, implementing efficient traversal for both upstream and downstream queries, detecting cycles so the graph stays valid, and handling scale. This is exactl