Event Streams

Traditional databases store the current state: the customer's address is Seattle. Event streams store what happened: the customer moved from Portland to Seattle on March 15. This is a fundamentally different way to model data. Events are immutable, append-only, and ordered. They enable replay, audit trails, and reconstructing state at any point in time. This lesson teaches you event-driven architecture, immutable logs, event sourcing, clickstream modeling, and how to handle the messy reality of events that arrive late.

Event-Driven Architecture

Daily Life
Interviews

Model systems that communicate via events

State vs Events: Two Ways to Model Reality

A state-based system stores the current truth: account_balance = $1,000. An event-based system stores what happened: deposit($500), withdrawal($200), deposit($700). The current balance is derived by replaying the events. Both representations contain the same information, but events are more powerful because you can reconstruct ANY past state, not just the current one.
This is the fundamental insight of event-driven data modeling: events are the source of truth. State is a projection of events at a point in time. If you store the events, you can always recompute the state. If you store only the state, the events are lost.
State-Based (OLTP)
  • Stores current value: balance = $1,000
  • Updates overwrite: SET balance = balance + 500
  • Simple queries: SELECT balance WHERE account_id = 1
  • History lost unless separately tracked
Event-Based (Stream)
  • Stores what happened: deposit($500)
  • Append-only: never update or delete
  • State is derived: SUM(amount) GROUP BY account_id
  • Full history preserved by default
In data engineering, you work with both. Source systems are state-based (PostgreSQL, MySQL). Your pipeline captures the changes as events (CDC, Kafka). Your analytical model stores either events (fact tables) or derived state (periodic snapshots). Understanding both paradigms is essential.
EventStateCommand
Event
Something That Happened
Immutable, timestamped, self-contained. 'User alice clicked button checkout at 2024-03-15T14:30:00Z.' Cannot be modified after creation.
State
Current Truth Derived from Events
'Alice has 3 items in her cart.' Computed by replaying cart_add and cart_remove events. Changes when new events arrive.
Command
Request to Do Something
'Add item X to cart.' May be accepted or rejected. If accepted, produces an event. Commands are transient; events are permanent.
TIP
When designing a data model, ask: 'Am I storing what happened, or what is currently true?' If the answer is 'what happened,' you are building an event model. If 'what is true,' you are building a state model. Most analytical systems need both: events for analysis, state for dashboards.

Immutable Append-Only Logs

Daily Life
Interviews

Store events that never get modified

The Power of Never Deleting

An immutable log is a sequence of events that can only be appended to. You can add new events but never modify or delete existing ones. Kafka topics, database write-ahead logs, and Git commit histories are all immutable logs. This immutability gives you three superpowers: replay, audit, and debugging.
Replay: if your downstream aggregation is wrong, fix the logic and replay the log. The events are still there. Audit: every action is recorded with a timestamp and actor. Debugging: reproduce any bug by replaying events up to the point of failure.

Designing an Event Schema

A well-designed event has: a unique event_id (UUID), an event_type ('order_placed', 'payment_received'), a timestamp (when it happened, in UTC), an actor (who or what caused it), and a payload (the data associated with the event).
eventsPKevent_id (UUID)event_typetimestamp_utcactor_idpayload (JSONB)source_system
The payload is typically semi-structured (JSON) because different event types have different fields. An order_placed event has order_id, items, total. A payment_received event has payment_id, amount, method. Storing the payload as JSONB or VARIANT allows a single table to hold all event types.

Corrections in an Immutable Log

If events are immutable, how do you correct mistakes? You do not modify the original event. You append a new event that compensates for it. 'Order O1 amount was $100' followed by 'Correction: Order O1 amount adjusted to $110.' The SUM of all events for O1 is the correct value. The full history is preserved.
Mutable State (DELETE/UPDATE)
  • Overwrite the wrong value
  • History of the error is lost
  • Cannot audit what happened
  • Simple but destroys information
Immutable Log (Compensating Events)
  • Append a correction event
  • Original error is visible for audit
  • SUM of all events = correct value
  • More complex but preserves everything
TIP
In financial systems, immutable logs with compensating events are not just a nice pattern. They are a regulatory requirement. Auditors must be able to see every change, including mistakes and corrections, with timestamps.

Event Sourcing

Daily Life
Interviews

Rebuild state from a sequence of events

Deriving State from Events

Event sourcing is the pattern where events are the source of truth and all state is derived by replaying them. Instead of storing 'account balance = $1,000,' you store every deposit and withdrawal event. The balance is computed by summing all events for that account.
This is powerful but expensive. Replaying 10 years of events to compute a current balance is impractical. The solution: snapshots. Periodically compute the current state and save it. To get the balance, start from the last snapshot and replay only events since then.

When Event Sourcing Makes Sense

Audit requirements: financial transactions, healthcare records, compliance systems where every change must be traceable.
Complex state derivation: systems where the current state depends on the full sequence of events (game state, workflow engines).
Temporal queries: 'What was the account balance on March 15?' requires replaying events up to that date.
Overkill for simple CRUD: a settings page where the user toggles preferences does not benefit from event sourcing.
Expensive to implement: requires event store, snapshot mechanism, replay logic, and schema versioning for events.

Event Sourcing in Data Engineering

As a data engineer, you rarely build an event-sourced application. But you frequently consume event-sourced data. A payments team stores all transactions as events. Your pipeline reads those events, derives the current state (balances, totals, counts), and loads it into the warehouse as periodic snapshots or materialized views.
Understanding event sourcing helps you design pipelines that are reprocessable. If your pipeline consumes events and produces state, you can always reprocess the events to fix bugs in your state derivation logic. This is the same principle.

Clickstream Modeling

Daily Life
Interviews

Track user behavior as structured events

Modeling User Behavior as Events

Clickstream data is the most common event stream in data engineering. Every page view, button click, scroll, and search is captured as an event. The volume is massive (millions to billions of events per day) and the schema is semi-structured (each event type has different properties).
Clickstream events typically share a common schema: event_id, user_id, session_id, event_type, event_timestamp, page_url, and a properties payload with event-specific data. The properties vary by event type: a 'purchase' event has amount and product_id, a 'search' event has query_text and result_count.
clickstream_eventsPKevent_iduser_idsession_idevent_typeevent_timestamppage_urlproperties (JSONB)

Sessionization

Raw clickstream events do not have a session_id. A session is a derived concept: a group of events by the same user with no gap longer than N minutes (typically 30). Assigning session IDs requires computing the time gap between consecutive events per user and starting a new session when the gap exceeds the threshold.
The pattern: LAG to compute the gap, CASE WHEN to flag session boundaries, cumulative SUM of flags to assign session IDs. This is one of the most common streaming and batch transformations in data engineering.

Event Properties: Typed vs Semi-Structured

Single Table + JSONB Properties
  • All event types in one table
  • properties column holds event-specific data
  • Schema-flexible: new event types need no DDL
  • Harder to query: JSON path access is slower
Table Per Event Type
  • Separate table for clicks, purchases, searches
  • Typed columns: amount DECIMAL, query_text VARCHAR
  • Fast queries: columnar encoding works natively
  • Schema-rigid: new event types need new tables
Most production systems use a hybrid: a single raw event table (bronze) with JSONB properties, then ETL extracts high-value event types into typed tables (silver/gold). The raw table preserves everything; the typed tables make common queries fast.
TIP
Partition clickstream tables by event_date and cluster by user_id. This makes both time-range queries ('events this week') and user-centric queries ('all events for user X') fast. Without this, every query scans the entire table.

Handling Late-Arriving Data

Daily Life
Interviews

Process events that arrive out of order

When Events Arrive After the Window Closes

In the real world, events do not arrive in order. A mobile app queues clicks while offline and sends them hours later. A payment gateway batches settlements daily. A sensor loses connectivity and dumps a backlog. If your pipeline processes events by wall-clock time (when the pipeline sees them), all of these produce wrong results. The mobile clicks land in the wrong hour. The settlements land on the wrong day.
The fix: process by event time (when the event happened), not processing time (when the pipeline sees it). But this creates a new problem: how do you know when all events for a given hour have arrived? The answer is: you do not. You use a watermark, a heuristic that says 'I believe all events before time T have arrived.' Events arriving after the watermark are late.

Strategies for Late Data

Strategy 1Strategy 2Strategy 3
Strategy 1
Insert into the historical partition
Reopen the closed partition, insert the late event at its correct event_time, recompute downstream aggregates. Most correct, but requires idempotent downstream pipelines.
Strategy 2
Side output for batch reconciliation
Route late events to a separate store (S3 bucket, DLQ topic). A daily batch job merges them into the main tables. Simpler operationally but introduces delay.
Strategy 3
Allowed lateness with grace period
Keep windows open for a grace period after the watermark passes. Events within the grace period update the result. Events after the grace period go to a side output.

Event Time vs Processing Time

Process by Event Time (Correct)
  • Event at 2:00 PM counted in the 2:00 PM window
  • Late events land in the correct historical window
  • Requires watermarks and late-data handling
  • Correct for analytics and reporting
Process by Processing Time (Wrong)
  • Event at 2:00 PM arriving at 5:00 PM counted in the 5:00 PM window
  • Late events corrupt the current window
  • Simple but produces wrong results
  • Only correct when events arrive in real-time order
Store BOTH timestamps on every event: event_time (when it happened) and processing_time (when the pipeline saw it). Aggregate by event_time for correctness. Use processing_time for pipeline monitoring: the gap between them is your lateness distribution.
TIP
Late data is not an edge case. It is the normal case for any system with mobile clients, IoT sensors, or batch upstream feeds. Design for it from day one. The simplest rule: store all events with event_time, partition by event_date, and never use NOW() in a pipeline query.
PUTTING IT ALL TOGETHER

> You are building a pipeline for a mobile app's clickstream data. Events arrive via Kafka, some up to 24 hours late due to offline queuing.

You store raw events in an append-only bronze table partitioned by event_date, not processing_date. Each event has event_id, user_id, event_type, event_timestamp, and a JSONB properties column.
Late events are inserted into their correct historical partition. Downstream daily aggregates are marked for recomputation when late events arrive.
You extract high-value event types (purchase, signup) into typed silver tables with proper columns. The raw bronze table preserves everything for replay.
KEY TAKEAWAYS
Events are the source of truth: state is derived by replaying events; if you store events, you can always recompute state
Immutable logs: append-only, never modify; corrections use compensating events
Event sourcing: powerful for audit and temporal queries; use snapshots to avoid full replay
Clickstream: raw events in JSONB, sessionization via LAG + cumulative SUM, partition by date + cluster by user
Late data is normal: process by event_time, use watermarks, store both event_time and processing_time

Event Streams

Data that never forgets

Category
Data Modeling
Duration
27 minutes
Challenges
12 hands-on challenges

Topics covered: Event-Driven Architecture, Immutable Append-Only Logs, Event Sourcing, Clickstream Modeling, Handling Late-Arriving Data

Lesson Sections

  1. Event-Driven Architecture (concepts: dmEventSourcing)

    State vs Events: Two Ways to Model Reality A state-based system stores the current truth: account_balance = $1,000. An event-based system stores what happened: deposit($500), withdrawal($200), deposit($700). The current balance is derived by replaying the events. Both representations contain the same information, but events are more powerful because you can reconstruct ANY past state, not just the current one. This is the fundamental insight of event-driven data modeling: events are the source o

  2. Immutable Append-Only Logs (concepts: dmImmutableLogs)

    The Power of Never Deleting An immutable log is a sequence of events that can only be appended to. You can add new events but never modify or delete existing ones. Kafka topics, database write-ahead logs, and Git commit histories are all immutable logs. This immutability gives you three superpowers: replay, audit, and debugging. Replay: if your downstream aggregation is wrong, fix the logic and replay the log. The events are still there. Audit: every action is recorded with a timestamp and actor

  3. Event Sourcing

    Deriving State from Events Event sourcing is the pattern where events are the source of truth and all state is derived by replaying them. Instead of storing 'account balance = $1,000,' you store every deposit and withdrawal event. The balance is computed by summing all events for that account. This is powerful but expensive. Replaying 10 years of events to compute a current balance is impractical. The solution: snapshots. Periodically compute the current state and save it. To get the balance, st

  4. Clickstream Modeling

    Modeling User Behavior as Events Clickstream data is the most common event stream in data engineering. Every page view, button click, scroll, and search is captured as an event. The volume is massive (millions to billions of events per day) and the schema is semi-structured (each event type has different properties). Clickstream events typically share a common schema: event_id, user_id, session_id, event_type, event_timestamp, page_url, and a properties payload with event-specific data. The prop

  5. Handling Late-Arriving Data (concepts: dmLateArriving)

    When Events Arrive After the Window Closes In the real world, events do not arrive in order. A mobile app queues clicks while offline and sends them hours later. A payment gateway batches settlements daily. A sensor loses connectivity and dumps a backlog. If your pipeline processes events by wall-clock time (when the pipeline sees them), all of these produce wrong results. The mobile clicks land in the wrong hour. The settlements land on the wrong day. The fix: process by event time (when the ev