Loading lesson...
Data that never forgets
Data that never forgets
Topics covered: Event-Driven Architecture, Immutable Append-Only Logs, Event Sourcing, Clickstream Modeling, Handling Late-Arriving Data
State vs Events: Two Ways to Model Reality A state-based system stores the current truth: account_balance = $1,000. An event-based system stores what happened: deposit($500), withdrawal($200), deposit($700). The current balance is derived by replaying the events. Both representations contain the same information, but events are more powerful because you can reconstruct ANY past state, not just the current one. This is the fundamental insight of event-driven data modeling: events are the source o
The Power of Never Deleting An immutable log is a sequence of events that can only be appended to. You can add new events but never modify or delete existing ones. Kafka topics, database write-ahead logs, and Git commit histories are all immutable logs. This immutability gives you three superpowers: replay, audit, and debugging. Replay: if your downstream aggregation is wrong, fix the logic and replay the log. The events are still there. Audit: every action is recorded with a timestamp and actor
Deriving State from Events Event sourcing is the pattern where events are the source of truth and all state is derived by replaying them. Instead of storing 'account balance = $1,000,' you store every deposit and withdrawal event. The balance is computed by summing all events for that account. This is powerful but expensive. Replaying 10 years of events to compute a current balance is impractical. The solution: snapshots. Periodically compute the current state and save it. To get the balance, st
Modeling User Behavior as Events Clickstream data is the most common event stream in data engineering. Every page view, button click, scroll, and search is captured as an event. The volume is massive (millions to billions of events per day) and the schema is semi-structured (each event type has different properties). Clickstream events typically share a common schema: event_id, user_id, session_id, event_type, event_timestamp, page_url, and a properties payload with event-specific data. The prop
When Events Arrive After the Window Closes In the real world, events do not arrive in order. A mobile app queues clicks while offline and sends them hours later. A payment gateway batches settlements daily. A sensor loses connectivity and dumps a backlog. If your pipeline processes events by wall-clock time (when the pipeline sees them), all of these produce wrong results. The mobile clicks land in the wrong hour. The settlements land on the wrong day. The fix: process by event time (when the ev