Loading lesson...

Merge Intervals: Advanced

From O(n^2) temporal joins to petabyte compaction — this is where intervals get serious.

Category: Python
Difficulty: advanced
Duration: 40 minutes
Challenges: 0 hands-on challenges

Topics covered: Interval Scheduling with DAG Dependencies, Segment Trees: Dynamic Interval Queries at O(log n), Temporal Joins: From O(n^2) to O(n log n), Sessionization at Scale, Design: Time-Series Compaction System

Lesson Sections

Interval Scheduling with DAG Dependencies (concepts: pyDAGScheduling, pyTopologicalIntervals, pyDualHeap)
The meeting rooms problem assumes each task can start at any valid time. Real pipeline scheduling adds dependency constraints: task B cannot start until task A finishes. You now have both interval constraints (task i occupies [start_i, end_i] time slots on a worker) and DAG constraints (topological ordering must be respected). Solving both simultaneously is a staff-level problem. The approach: topological sort first (to establish a valid execution order), then assign time intervals greedily whil
Segment Trees: Dynamic Interval Queries at O(log n) (concepts: pySegmentTree, pyRangeCoverageQuery, pyLazyPropagation)
Prefix sums and sorted-list scans handle static and append-only workloads. But when you need O(log n) range updates AND O(log n) range queries on a dataset where both insertions and deletions happen continuously, you need a segment tree. The segment tree partitions the value space (not the data) into a binary tree of ranges, allowing you to update a range and query a range in O(log n) each. This is the data structure behind range max, range sum, and coverage queries in dynamic systems. Segment T
Temporal Joins: From O(n^2) to O(n log n) (concepts: pyTemporalJoin, pyMergeAsof, pySparkRangeJoin)
The temporal join is one of the most common and most expensive operations in data engineering. 'Join events from table A with events from table B that occurred within ±5 minutes.' The naive nested loop is O(n^2): for every event in A, scan all events in B and check the time condition. On tables with 10 million rows, that is 100 trillion comparisons. The sort + sweep approach brings this to O(n log n), and understanding WHY is a key staff-level interview signal. The Naive O(n^2) Problem Sort + Sw
Sessionization at Scale (concepts: pySessionization, pyLateEvents, pyFlinkSessions, pySparkSessionize)
Sessionization is one of the most common data engineering problems at consumer tech companies and one of the least well understood. A session is a group of user events that are 'close enough' in time. The boundary conditions: gap-based (session ends after 30 minutes of inactivity), activity-based (session ends after N events regardless of time), or hybrid (whichever comes first). Each has a different algorithm, different correctness guarantees, and critically, different behavior under late-arriv
Design: Time-Series Compaction System (concepts: pyCompaction, pyTombstones, pyMergeAmplification, pyLSMTree)
This is the staff-level system design version of merge intervals. A time-series database accumulates write files (SSTables, segments, L0 files) continuously. Over time, these files overlap in time range: file A covers [1000, 2000], file B covers [1500, 2500], file C covers [800, 1200]. Reads that touch any of these time ranges must scan all overlapping files. As file count grows, read amplification explodes. Compaction merges overlapping files into fewer, non-overlapping ones — restoring read pe