What is the canonical architecture for a 10B-event-per-day clickstream pipeline?

Web SDK to local buffer to CDN-fronted ingest endpoint to Kafka (24 partitions, replication factor 3) to Spark Structured Streaming (1-minute trigger, dedup on event_id+source) to Parquet on S3 (partitioned by date/hour) to dbt micro-batch (hourly sessionization) to gold star schemas in Snowflake. Failure modes drilled per component.

How does a clickstream pipeline handle late-arriving events?

Watermark configured to allow up to 24 hours of lateness in the streaming consumer. Spark Structured Streaming with watermark + foreachBatch sink for late-data handling. dbt sessionization re-runs on late partitions. Gold tables MERGE on (event_id) so late events update past aggregates without overwriting. Allowed lateness beyond the watermark goes to a side-output for inspection.

What is the role of the local SDK buffer?

Avoids blocking the app on network calls. Events write synchronously to a local store (sessionStorage, IndexedDB, or platform-specific equivalent). A background flusher batches and ships to the ingest endpoint. Bounded buffer size (10MB max) with drop-oldest policy on overflow. Telemetry counter for dropped events alerts when buffer policy needs revisiting.

How many Kafka partitions does a 10B-event-per-day pipeline need?

Roughly 24 partitions for 100k events-per-second throughput with 1KB events. Math: 10B/day = 116k events/sec average, peak ~5x = 580k events/sec, 1KB each = 580 MB/sec peak. Kafka safely handles 10-20 MB/sec per partition for low-latency. 580 / 20 = 29 partitions for peak headroom. Round up to 24 or 32. Partition count is hard to change post-creation; size for 2-3x growth.

How does a data engineer prevent the thundering-herd problem on the ingest endpoint?

CDN-fronted with edge-cached schema validation (Cloudflare Workers or Lambda@Edge). SDK adds random jitter to retry-after timestamps to avoid synchronized retries. Rate-limit per-client at the edge. Circuit breaker on the ingest endpoint to shed load gracefully when downstream Kafka is backed up.

Clickstream Pipeline Interview Questions (10B/Day)

Clickstream Pipeline Interview Questions

Clickstream ingestion pipeline design for data engineer interview prep.

Clickstream pipeline interview questions for data engineer roles. 10B-event-per-day ingestion patterns. Web SDK to local buffer to CDN-fronted ingest to Kafka to Spark Structured Streaming to Parquet on S3 to dbt to Snowflake. Sessionization with 30-minute gap. Deduplication with composite natural key. Late-arriving events with watermarks and idempotent MERGE.

Clickstream pipeline design is one of the most-asked scenarios in 2026 data engineer interviews at consumer-internet companies (Meta, Pinterest, Spotify, Airbnb, Uber, DoorDash). The canonical architecture starts with the web or mobile SDK and ends with a gold star schema in the warehouse. Each component has its own failure modes and the senior data engineer is expected to drill on each.

The canonical 10B-event-per-day clickstream architecture. Step 1: web SDK writes events to a local buffer (sessionStorage or IndexedDB) so the app does not block on network. Step 2: local buffer flushes to a CDN-fronted ingest endpoint (Cloudflare Workers or AWS CloudFront with Lambda@Edge) that validates the event schema and writes to Kafka. Step 3: Kafka topic with 24 partitions for 100k-events-per-second throughput, replication factor 3, retention 7 days for replay. Step 4: Spark Structured Streaming consumer with 1-minute trigger reads Kafka, dedups on (event_id, source), writes Parquet to S3 partitioned by date and hour. Step 5: dbt micro-batch on hourly cadence reads S3, applies sessionization (LAG on event_ts, new session when gap exceeds 30 minutes), and writes to silver tables. Step 6: dbt on the silver tables produces gold star schemas in Snowflake with conformed dim_user, dim_page, dim_device.

Five failure modes the L5+ rubric explicitly tests. SDK buffer overflow: the device loses network for hours; buffer fills; oldest events get dropped. Mitigation: bounded buffer (10MB max), drop-oldest policy, telemetry counter for dropped events. CDN edge failure: the CDN region serving the user fails; events fail to ingest. Mitigation: multi-CDN failover, retry with exponential backoff from the SDK side. Kafka broker dies: replication factor 3 handles N-1 failures; alerting on ISR shrinkage. Spark Structured Streaming consumer lag: events accumulate in Kafka faster than Spark can consume. Mitigation: autoscale executors, increase parallelism, alert on consumer lag exceeding threshold. Late-arriving events: a phone offline for 3 hours surfaces 1000 events with old event_time. Mitigation: watermark configured to allow 24 hours of lateness, dbt sessionization re-runs on late partitions, gold tables MERGE on (event_id) so late events update past aggregates.

Sessionization is the SQL-heavy part of the clickstream design. The 30-minute-gap session definition: group consecutive events into sessions where the gap between consecutive events does not exceed 30 minutes. SQL pattern: LAG(event_ts) to get the previous event's timestamp; CASE WHEN current minus lagged exceeds 30 minutes THEN 1 ELSE 0 AS new_session; SUM(new_session) OVER PARTITION BY user_id ORDER BY event_ts AS session_id; GROUP BY (user_id, session_id) for per-session aggregates. Edge cases: the first event for each user has NULL LAG; COALESCE to a large value to start a new session. Same-millisecond events: secondary sort by event_id within ORDER BY to make session assignment deterministic.

Deduplication is the other recurring concern. Web SDKs sometimes send the same event twice due to retry-on-network-error logic. Server-side dedup uses the composite key (event_id, source) with the source identifying the SDK version or ingest endpoint. ROW_NUMBER OVER PARTITION BY (event_id, source) ORDER BY ingest_ts ASC, filter rn = 1 keeps the first arrival. Downstream gold tables MERGE on the same composite key so re-runs are idempotent.

Data pipeline interview questions - The pipeline pillar: end-to-end design plus operational rounds, backfills, retries, schema drift.
Data engineer system design interview prep - Full design prep including clickstream and other scenarios.
Streaming system design interview questions - Kafka + Flink + Spark Structured Streaming patterns.
Data engineering system design questions - Scenario catalog including clickstream variants.
Data engineer system design problems - Practice problems with rubric-scored verdicts.
Kafka system design interview questions - Partition strategy for high-throughput ingest.
CDC pipeline interview questions - Complementary pattern for transactional source data.
Advanced SQL: sessionization - LAG + SUM OVER for clickstream sessionization.
Meta data engineer interview questions - Meta's clickstream and ads attribution pipelines.
PySpark for clickstream pipelines - Structured streaming code for clickstream ingest.

What is the canonical architecture for a 10B-event-per-day clickstream pipeline?: Web SDK to local buffer to CDN-fronted ingest endpoint to Kafka (24 partitions, replication factor 3) to Spark Structured Streaming (1-minute trigger, dedup on event_id+source) to Parquet on S3 (partitioned by date/hour) to dbt micro-batch (hourly sessionization) to gold star schemas in Snowflake. Failure modes drilled per component.
How does a data engineer handle clickstream deduplication?: Composite key (event_id, source) where event_id is generated SDK-side (UUID v7 for time-orderable) and source identifies the SDK version. Server-side dedup with ROW_NUMBER OVER PARTITION BY (event_id, source) ORDER BY ingest_ts ASC, filter rn=1. Downstream gold tables MERGE on the same key so re-runs are idempotent. SDKs sometimes resend events due to network-retry logic; dedup is unavoidable.
How does sessionization work in a clickstream pipeline?: 30-minute-gap session definition typically. SQL pattern: LAG(event_ts) to get previous event timestamp, CASE WHEN current minus lagged exceeds 30 minutes THEN 1 ELSE 0 AS new_session, SUM(new_session) OVER PARTITION BY user_id ORDER BY event_ts AS session_id, GROUP BY (user_id, session_id) for per-session aggregates. Edge cases: first event per user has NULL LAG (COALESCE to large value), same-millisecond events need secondary sort.
How does a clickstream pipeline handle late-arriving events?: Watermark configured to allow up to 24 hours of lateness in the streaming consumer. Spark Structured Streaming with watermark + foreachBatch sink for late-data handling. dbt sessionization re-runs on late partitions. Gold tables MERGE on (event_id) so late events update past aggregates without overwriting. Allowed lateness beyond the watermark goes to a side-output for inspection.
What is the role of the local SDK buffer?: Avoids blocking the app on network calls. Events write synchronously to a local store (sessionStorage, IndexedDB, or platform-specific equivalent). A background flusher batches and ships to the ingest endpoint. Bounded buffer size (10MB max) with drop-oldest policy on overflow. Telemetry counter for dropped events alerts when buffer policy needs revisiting.
How many Kafka partitions does a 10B-event-per-day pipeline need?: Roughly 24 partitions for 100k events-per-second throughput with 1KB events. Math: 10B/day = 116k events/sec average, peak ~5x = 580k events/sec, 1KB each = 580 MB/sec peak. Kafka safely handles 10-20 MB/sec per partition for low-latency. 580 / 20 = 29 partitions for peak headroom. Round up to 24 or 32. Partition count is hard to change post-creation; size for 2-3x growth.
How does a data engineer prevent the thundering-herd problem on the ingest endpoint?: CDN-fronted with edge-cached schema validation (Cloudflare Workers or Lambda@Edge). SDK adds random jitter to retry-after timestamps to avoid synchronized retries. Rate-limit per-client at the edge. Circuit breaker on the ingest endpoint to shed load gracefully when downstream Kafka is backed up.

144 practice problems matching this filter. Difficulty: medium (71), hard (72), easy (1).

Pipeline Architecture (144)

45 Minutes Turned Into 3.5 Hours - medium - Spark jobs are running. Just not fast enough.
600 Million Events a Day - hard - 600 million events a day. Two years of retention.
A Clean Number for Every Merchant - hard - Raw payment logs in. Clean merchant summaries out.
A Million Cars Phoning Home - hard - Every vehicle is a sensor. Deploy the pipeline to catch it all.
A Million Moving Dots - medium
Analysts Are Slowing the Store Down - medium - Orders placed. Data warehouse hungry.
A New Column on a Billion Rows - hard - Add and backfill a new column to a billion-row production table with zero downtime.
A Shared Drive Full of Contracts - medium - Buried in PDFs. The data is in there somewhere.
A Stream All Day and a File at Midnight - hard - Real-time and batch. Same pipeline. No compromises.
Badging Items That Already Sold Out - hard - Same-day delivery. The features have to be faster.
Basel, CCAR, and Monday Morning - medium - The regulator does not accept 'eventually consistent.'
Before the Batch Is Lost - hard
Bikes Before Rush Hour - hard - Bikes in, bikes out. The city needs to predict demand.
Credit for Every Touch - medium - They saw the ad, clicked the email, then bought. Who gets credit?
Disappearing Ink - easy
Everything Lands, Then It Ships - medium
Doubling Every Six Months - hard - Tuesdays are quiet. Black Friday is not.
Eight-Hour-Old Positions - hard - Positions shift by the second. The math cannot lag.
Eight Teams, Eight Latencies - medium - Millions of gamers. The architecture decision changes everything.
End of Day Is Too Late - medium - Every swipe tells a story.
Equities, ETFs, and the SEC - hard - Fractional shares, multi-currency, point-in-time. All of it.
Event System for Multiple Consumers - hard - One event, many hungry consumers.
Every Dataset Needs a Paper Trail - hard - The FDA has opinions about your data pipeline.
Every Deal Is a Financial Transaction - hard - Real money on the table. Reconstruct every hand.
Every Device, Every Impression - hard - Every ad seen. Every second watched. Real-time.
Every Device Has Its Own Dialect - medium - Three sources. Three formats. Same workout.
Every Firm Formats It Differently - medium - The regulator changed the format. Again. Handle it.
Every Format Imaginable - hard - PDFs, HL7, JSON. All of it lands in the same lake.
Everyone Wants the Same Data, Differently - hard - How you store it decides how fast you can read it.
Every Region Exports Its Own Way - medium - Sales data, BigQuery, Dataflow. Make it all sing.
Every Scan, Every Parcel, Every Pin Code - medium - Out for delivery. Delivered. Except the events arrived backwards.
Every Version of You - medium
Fifty Thousand Retailers - medium - Retail data at CPG scale. Every SKU, every store.
Five Times the Traffic, Five Times the Bill - hard - Scale up when needed. Do not bankrupt the team.
Five Years of Cron Jobs - hard - Half the jobs run on cron. Half run on events. All of it has to move.
Flying Blind Until Midnight - hard - Intraday risk, full lineage. The regulator is watching.
Four Teams, One Topic, No Agreement - hard - Everybody is writing to it. Nobody documented it. Now production is fragile.
Fresh and Forever - medium
Greenfield Build for Six Sources - hard - Infrastructure as code. Meaning as a service.
Half a Million Rental Cars - medium - Every vehicle is reporting. Every rental matters.
The Identity Problem - hard - Old systems. New demands. The same customer appears under three different names.
Listens From Everywhere, Counted Once - hard - Phones, tablets, laptops. And some of them report late.
Live Viewers, Live Billing - hard - The stream is live. The data cannot wait.
Mark to Market - medium
Near-Real-Time Trending Dishes Dashboard - hard - The dish rankings update faster than the kitchen.
Nested Docs, Flat Reports - medium - Two databases. One direction. No data left behind.
Nightly Exports Are Too Slow - medium - Healthcare claims change constantly. The warehouse cannot fall behind.
4,500 Stores Before Sunrise - medium - The shelves open at 7. The data better be there.
Not Every Team Can See Every Row - hard - Everyone can see the bucket. Not everyone should.
One Bill Across Three Clouds - medium - AWS, Azure, GCP. Three bills. One truth.
One Earthquake, Ten Thousand Tweets - hard - The firehose is on. Separate signal from noise.
Out of the Data Center - medium - The on-prem servers are not getting any younger.
The Speed Layer - medium - Dashboards can't wait for raw logs. Something has to happen upstream.
Prove the Number Is Right - hard - Bad data in fintech is not just messy. It is expensive.
Real Data, Fake Patients - hard - Dev needs production data. HIPAA says absolutely not.
The Register Never Sleeps - medium - Every swipe lands in the warehouse. The table has to stay current without breaking.
Recommendations Now, Royalties Later - medium - The catalog updated. Did anyone notice?
Replicate It Without Breaking It - hard - The source changed. The lake needs to know immediately.
Risk Models on Week-Old Data - medium - Loan approved. Loan denied. Every decision is an event.
SaaS API Connector with Incremental Sync - medium - The API has rate limits. You have deadlines.
Same-Day Sales, Every Store - medium - The cash register data needs to be queryable by morning.
The Living Table - medium - Data lands continuously. History must survive every update.
Score It Before It Clears - hard - The fraudsters move fast. Your pipeline has to move faster.
Seconds and Months - medium
Seconds to Trend - medium
Ship Before Fraud Finishes Checking - hard - The claim looks clean. The fraud model disagrees.
Six Hours to Miss a Deadline - medium - The rebuild works. It just doesn't finish in time.
Six Hours to Refresh Every Number - medium - Ratings change. The incremental model has to keep pace.
Six Million Rows Before the Market Opens - medium - One massive CSV. Millions of timestamps.
Six Sources, One Platform - medium - ADF orchestrates. Unity Catalog governs. Nothing leaks.
Sixty Minutes, Every Hour - medium - Every hour, on the hour. No excuses.
Someone Else's Server - hard
Stores and the Site, Together - hard - The registers never stop ringing.
Store, Site, and Distributor - medium - Sales data is piling up. Someone has to make sense of it.
The Acquisition Still Taking Bookings - hard - Two systems, two schemas. One truth.
The Agency That Changes the Columns - medium - The schema changed overnight. Again.
The Analysts Cannot Touch Production - medium - Production is the source. Analytics needs its own copy.
The Analyst Who Saw the Salary Data - hard - Two incidents. One shared lake. The access model was never designed, just assumed.
The API Drip Feed - medium - The API gives you 100 records at a time. You need millions.
The Bad Row That Broke the Dashboard - medium - Bad records cannot reach the warehouse.
The Binding and the Claim - medium - Policies are instant. Claims take their time.
The Booking That Came Three Ways - hard - PMS, OTA, and website all think they took the reservation first.
The Boutique That Sold in Six Currencies - hard - Every sale is real. The rate it was converted at depends on who is asking.
The Bucket Full of Resumes - medium - A thousand resumes. Structured data inside each one.
The Carrier Moving to Azure - medium - Claims arrive messy. The medallion cleans them up.
The Claim That Picks Its Own Lane - medium - Three entry points. Different workflows. All must route correctly.
The Clicks We Throw Away - hard - Every tap, swipe, and scroll. At scale.
The Clock That Runs Two Ways - hard - Nightly batch and live events. One dashboard.
The Consent Stitcher - medium - Consent was given. Or was it? Stitch the records together.
The Dashboard and the Attribution Model - hard - Streaming and batch. One pipeline to rule them.
The Decision Before the Door Closes - hard - The window to stop it is smaller than you think.
The Distributor Filing Problem - medium - Hundreds of suppliers. One warehouse. One deadline.
The Early Warning - medium
The Event Pile - hard - 600 million clicks a day. The budget is not infinite.
The Fare Aggregator - medium - Airfares shift every minute. Catch the best ones.
The Firehose and the Ledger - hard
The Fleet That Never Stops - hard - Every truck is talking. Not everyone can hear them yet.
The Leaderboard That Costs $25K a Month - hard - Product wants it live. Engineering has a price tag.
The Ledger and the Live Wire - hard
The Meal Kit That Knows You - medium - What they ordered says a lot about what they want next.
The Metric That Moved - medium
The Migration That Cannot Break Morning - hard - It all works today. Moving it without losing a single report is the hard part.
The Models Going Stale - hard - The model is only as good as what you feed it.
The Morning File - medium
The Next Track - medium
The Panel and the Set-Top Boxes - hard - Set-top boxes tell you who watched. Projection tells you how many.
The Patients We Cannot Move - hard - Patient data stays local. Insights have to be global.
The Points Arrive Two Days Late - medium - The bank data shows up late. The rewards were already sent.
The Provider That Sometimes Sleeps - medium - The models run at dawn. The data has to be there first.
The Query That Used to Be Fast - medium - Queries used to be fast. Something changed.
The Queue That Wouldn't Stop Growing - medium - 500,000 messages behind and the number keeps climbing.
The Revenue That Was Wrong for Two Weeks - medium - Nobody caught it until the CFO asked a question. Design the system that catches it first.
The Sale That Needs to Land Now - medium - Three channels feeding one view. Not all of them speak the same language.
The Same Stream Twice - hard
The Signals That Power Recommendations - medium - Fresh signals, many teams, one pipeline.
The Thirty-Second Rule - medium
Counted Once, Remembered Forever - hard
The User Who Asked to Be Forgotten - hard - Users want their data erased. Completely.
The Vendor Who Never Warns You - medium - Every month, something is different. The dashboards have no idea.
The What-If Machine - hard - A million slots. A thousand campaigns. Every combination matters.
The Whiteboard Exercise - medium - Marker in hand. Draw the whole thing.
Thirty Cities, One Forecast - hard - Five cities. Five data formats. One prediction.
Thirty Countries, One Solvency Number - hard - Premiums collected globally. Losses happen locally.
Thirty Million Unique Jobs a Year - hard - One press run, many orders. Group them right.
Thousands of Practices, One Dataset - hard - Patient records in, operational insights out.
Three Providers, One Workout - hard - The same ride, reported three times.
Three Regions, One Finance Team - hard - Payments from everywhere. One consistent report.
Three Regions, One Report - hard - Three regions, billions of payments, one merchant summary by 6 AM.
Towers and Phones, Same Story - hard - Tower signals meet app events. Somewhere in between is the truth.
Traders, Risk, and the Regulators - medium - Markets move in milliseconds. The pipeline has to keep up.
Two Hundred Million Redirects - medium - Billions of clicks. One tiny code. Two very different clocks.
Two Million Boxes by Monday Morning - hard - Shipped, maybe. Delivered, debatable.
Two Sources of Truth - medium
Two Systems, One Room Count - hard - Two booking systems. Rooms do not duplicate themselves.
Two Ways to Catch a Change - medium - Two ways to watch the database. Each has a cost.
Two Years of Every Click - hard - Every click, every aisle, every day for two years.
Two Years of Clicks, Cheap - hard - Two years of clicks. Every query has to be affordable.
What Everyone Is Watching - hard - Someone is watching. Capture everything.
What Should We Recommend Tonight - hard - They ordered pad thai twice. That means something.
Where Is Every Truck, Right Now - medium - Trucks are moving. Every ping counts.
Where the Crowd Goes - medium
Which Promotion Is Actually Working - hard - Was the promotion worth it? The data knows.
Who Is Churning and Why - medium - Subscribers churn. The pipeline cannot.
Who Saw the Ad Twice - hard - TV and digital. Same viewer, two measurement worlds.