Loading section...
Temporal Joins: From O(n^2) to O(n log n)
Concepts: pyTemporalJoin, pyMergeAsof, pySparkRangeJoin
The temporal join is one of the most common and most expensive operations in data engineering. 'Join events from table A with events from table B that occurred within ±5 minutes.' The naive nested loop is O(n^2): for every event in A, scan all events in B and check the time condition. On tables with 10 million rows, that is 100 trillion comparisons. The sort + sweep approach brings this to O(n log n), and understanding WHY is a key staff-level interview signal. The Naive O(n^2) Problem Sort + Sweep: O(n log n) This is merge intervals applied to a join. Each event in A defines an interval [ts_a - W, ts_a + W]. We want all events from B that fall within this interval. Instead of scanning all of B for each A, we maintain a sliding window of B events and slide it as we advance through A. Each