Clickstream pipeline interview questions for data engineer roles. 10B-event-per-day ingestion patterns. Web SDK to local buffer to CDN-fronted ingest to Kafka to Spark Structured Streaming to Parquet on S3 to dbt to Snowflake. Sessionization with 30-minute gap. Deduplication with composite natural key. Late-arriving events with watermarks and idempotent MERGE.

Clickstream pipeline design is one of the most-asked scenarios in 2026 data engineer interviews at consumer-internet companies (Meta, Pinterest, Spotify, Airbnb, Uber, DoorDash). The canonical architecture starts with the web or mobile SDK and ends with a gold star schema in the warehouse. Each component has its own failure modes and the senior data engineer is expected to drill on each.

The canonical 10B-event-per-day clickstream architecture. Step 1: web SDK writes events to a local buffer (sessionStorage or IndexedDB) so the app does not block on network. Step 2: local buffer flushes to a CDN-fronted ingest endpoint (Cloudflare Workers or AWS CloudFront with Lambda@Edge) that validates the event schema and writes to Kafka. Step 3: Kafka topic with 24 partitions for 100k-events-per-second throughput, replication factor 3, retention 7 days for replay. Step 4: Spark Structured Streaming consumer with 1-minute trigger reads Kafka, dedups on (event_id, source), writes Parquet to S3 partitioned by date and hour. Step 5: dbt micro-batch on hourly cadence reads S3, applies sessionization (LAG on event_ts, new session when gap exceeds 30 minutes), and writes to silver tables. Step 6: dbt on the silver tables produces gold star schemas in Snowflake with conformed dim_user, dim_page, dim_device.

Five failure modes the L5+ rubric explicitly tests. SDK buffer overflow: the device loses network for hours; buffer fills; oldest events get dropped. Mitigation: bounded buffer (10MB max), drop-oldest policy, telemetry counter for dropped events. CDN edge failure: the CDN region serving the user fails; events fail to ingest. Mitigation: multi-CDN failover, retry with exponential backoff from the SDK side. Kafka broker dies: replication factor 3 handles N-1 failures; alerting on ISR shrinkage. Spark Structured Streaming consumer lag: events accumulate in Kafka faster than Spark can consume. Mitigation: autoscale executors, increase parallelism, alert on consumer lag exceeding threshold. Late-arriving events: a phone offline for 3 hours surfaces 1000 events with old event_time. Mitigation: watermark configured to allow 24 hours of lateness, dbt sessionization re-runs on late partitions, gold tables MERGE on (event_id) so late events update past aggregates.

Sessionization is the SQL-heavy part of the clickstream design. The 30-minute-gap session definition: group consecutive events into sessions where the gap between consecutive events does not exceed 30 minutes. SQL pattern: LAG(event_ts) to get the previous event's timestamp; CASE WHEN current minus lagged exceeds 30 minutes THEN 1 ELSE 0 AS new_session; SUM(new_session) OVER PARTITION BY user_id ORDER BY event_ts AS session_id; GROUP BY (user_id, session_id) for per-session aggregates. Edge cases: the first event for each user has NULL LAG; COALESCE to a large value to start a new session. Same-millisecond events: secondary sort by event_id within ORDER BY to make session assignment deterministic.

Deduplication is the other recurring concern. Web SDKs sometimes send the same event twice due to retry-on-network-error logic. Server-side dedup uses the composite key (event_id, source) with the source identifying the SDK version or ingest endpoint. ROW_NUMBER OVER PARTITION BY (event_id, source) ORDER BY ingest_ts ASC, filter rn = 1 keeps the first arrival. Downstream gold tables MERGE on the same composite key so re-runs are idempotent.

Clickstream Pipeline Interview Questions

Clickstream ingestion pipeline design for data engineer interview prep.

123 practice problems matching this filter. Difficulty: medium (57), hard (66).

Pipeline Architecture (123)

Common questions

What is the canonical architecture for a 10B-event-per-day clickstream pipeline?
Web SDK to local buffer to CDN-fronted ingest endpoint to Kafka (24 partitions, replication factor 3) to Spark Structured Streaming (1-minute trigger, dedup on event_id+source) to Parquet on S3 (partitioned by date/hour) to dbt micro-batch (hourly sessionization) to gold star schemas in Snowflake. Failure modes drilled per component.
How does a data engineer handle clickstream deduplication?
Composite key (event_id, source) where event_id is generated SDK-side (UUID v7 for time-orderable) and source identifies the SDK version. Server-side dedup with ROW_NUMBER OVER PARTITION BY (event_id, source) ORDER BY ingest_ts ASC, filter rn=1. Downstream gold tables MERGE on the same key so re-runs are idempotent. SDKs sometimes resend events due to network-retry logic; dedup is unavoidable.
How does sessionization work in a clickstream pipeline?
30-minute-gap session definition typically. SQL pattern: LAG(event_ts) to get previous event timestamp, CASE WHEN current minus lagged exceeds 30 minutes THEN 1 ELSE 0 AS new_session, SUM(new_session) OVER PARTITION BY user_id ORDER BY event_ts AS session_id, GROUP BY (user_id, session_id) for per-session aggregates. Edge cases: first event per user has NULL LAG (COALESCE to large value), same-millisecond events need secondary sort.
How does a clickstream pipeline handle late-arriving events?
Watermark configured to allow up to 24 hours of lateness in the streaming consumer. Spark Structured Streaming with watermark + foreachBatch sink for late-data handling. dbt sessionization re-runs on late partitions. Gold tables MERGE on (event_id) so late events update past aggregates without overwriting. Allowed lateness beyond the watermark goes to a side-output for inspection.
What is the role of the local SDK buffer?
Avoids blocking the app on network calls. Events write synchronously to a local store (sessionStorage, IndexedDB, or platform-specific equivalent). A background flusher batches and ships to the ingest endpoint. Bounded buffer size (10MB max) with drop-oldest policy on overflow. Telemetry counter for dropped events alerts when buffer policy needs revisiting.
How many Kafka partitions does a 10B-event-per-day pipeline need?
Roughly 24 partitions for 100k events-per-second throughput with 1KB events. Math: 10B/day = 116k events/sec average, peak ~5x = 580k events/sec, 1KB each = 580 MB/sec peak. Kafka safely handles 10-20 MB/sec per partition for low-latency. 580 / 20 = 29 partitions for peak headroom. Round up to 24 or 32. Partition count is hard to change post-creation; size for 2-3x growth.
How does a data engineer prevent the thundering-herd problem on the ingest endpoint?
CDN-fronted with edge-cached schema validation (Cloudflare Workers or Lambda@Edge). SDK adds random jitter to retry-after timestamps to avoid synchronized retries. Rate-limit per-client at the edge. Circuit breaker on the ingest endpoint to shed load gracefully when downstream Kafka is backed up.