DataDriven
LearnPracticeInterviewDiscussDaily
HelpContactPrivacyTermsSecurityiOS App

© 2026 DataDriven

Loading lesson...

  1. Home
  2. Learn
  3. Merge Intervals: Advanced

Merge Intervals: Advanced

From O(n^2) temporal joins to petabyte compaction — this is where intervals get serious.

From O(n^2) temporal joins to petabyte compaction — this is where intervals get serious.

Category
Python
Difficulty
advanced
Duration
40 minutes
Challenges
0 hands-on challenges

Topics covered: Interval Scheduling with DAG Dependencies, Segment Trees: Dynamic Interval Queries at O(log n), Temporal Joins: From O(n^2) to O(n log n), Sessionization at Scale, Design: Time-Series Compaction System

Lesson Sections

  1. Interval Scheduling with DAG Dependencies (concepts: pyDAGScheduling, pyTopologicalIntervals, pyDualHeap)

    The meeting rooms problem assumes each task can start at any valid time. Real pipeline scheduling adds dependency constraints: task B cannot start until task A finishes. You now have both interval constraints (task i occupies [start_i, end_i] time slots on a worker) and DAG constraints (topological ordering must be respected). Solving both simultaneously is a staff-level problem. The approach: topological sort first (to establish a valid execution order), then assign time intervals greedily whil

  2. Segment Trees: Dynamic Interval Queries at O(log n) (concepts: pySegmentTree, pyRangeCoverageQuery, pyLazyPropagation)

    Prefix sums and sorted-list scans handle static and append-only workloads. But when you need O(log n) range updates AND O(log n) range queries on a dataset where both insertions and deletions happen continuously, you need a segment tree. The segment tree partitions the value space (not the data) into a binary tree of ranges, allowing you to update a range and query a range in O(log n) each. This is the data structure behind range max, range sum, and coverage queries in dynamic systems. Segment T

  3. Temporal Joins: From O(n^2) to O(n log n) (concepts: pyTemporalJoin, pyMergeAsof, pySparkRangeJoin)

    The temporal join is one of the most common and most expensive operations in data engineering. 'Join events from table A with events from table B that occurred within ±5 minutes.' The naive nested loop is O(n^2): for every event in A, scan all events in B and check the time condition. On tables with 10 million rows, that is 100 trillion comparisons. The sort + sweep approach brings this to O(n log n), and understanding WHY is a key staff-level interview signal. The Naive O(n^2) Problem Sort + Sw

  4. Sessionization at Scale (concepts: pySessionization, pyLateEvents, pyFlinkSessions, pySparkSessionize)

    Sessionization is one of the most common data engineering problems at consumer tech companies and one of the least well understood. A session is a group of user events that are 'close enough' in time. The boundary conditions: gap-based (session ends after 30 minutes of inactivity), activity-based (session ends after N events regardless of time), or hybrid (whichever comes first). Each has a different algorithm, different correctness guarantees, and critically, different behavior under late-arriv

  5. Design: Time-Series Compaction System (concepts: pyCompaction, pyTombstones, pyMergeAmplification, pyLSMTree)

    This is the staff-level system design version of merge intervals. A time-series database accumulates write files (SSTables, segments, L0 files) continuously. Over time, these files overlap in time range: file A covers [1000, 2000], file B covers [1500, 2500], file C covers [800, 1200]. Reads that touch any of these time ranges must scan all overlapping files. As file count grows, read amplification explodes. Compaction merges overlapping files into fewer, non-overlapping ones — restoring read pe

Related

  • All Lessons
  • Practice Problems
  • Mock Interview Practice
  • Daily Challenges