Loading section...
Sessionization at Scale
Concepts: pySessionization, pyLateEvents, pyFlinkSessions, pySparkSessionize
Sessionization is one of the most common data engineering problems at consumer tech companies and one of the least well understood. A session is a group of user events that are 'close enough' in time. The boundary conditions: gap-based (session ends after 30 minutes of inactivity), activity-based (session ends after N events regardless of time), or hybrid (whichever comes first). Each has a different algorithm, different correctness guarantees, and critically, different behavior under late-arriving events. Gap-Based Sessionization: Python Batch Spark SQL Equivalent Late-Arriving Events: The Hard Problem Batch sessionization is clean because you have all events before you start. Streaming sessionization is hard because events arrive late. An event with timestamp 10:25 might arrive at 11:15,