Clickstream Pipeline Interview Questions
Clickstream Pipeline Interview Questions
Clickstream ingestion pipeline design for data engineer interview prep.
Clickstream pipeline interview questions for data engineer roles. 10B-event-per-day ingestion patterns. Web SDK to local buffer to CDN-fronted ingest to Kafka to Spark Structured Streaming to Parquet on S3 to dbt to Snowflake. Sessionization with 30-minute gap. Deduplication with composite natural key. Late-arriving events with watermarks and idempotent MERGE.
Clickstream pipeline design is one of the most-asked scenarios in 2026 data engineer interviews at consumer-internet companies (Meta, Pinterest, Spotify, Airbnb, Uber, DoorDash). The canonical architecture starts with the web or mobile SDK and ends with a gold star schema in the warehouse. Each component has its own failure modes and the senior data engineer is expected to drill on each.
The canonical 10B-event-per-day clickstream architecture. Step 1: web SDK writes events to a local buffer (sessionStorage or IndexedDB) so the app does not block on network. Step 2: local buffer flushes to a CDN-fronted ingest endpoint (Cloudflare Workers or AWS CloudFront with Lambda@Edge) that validates the event schema and writes to Kafka. Step 3: Kafka topic with 24 partitions for 100k-events-per-second throughput, replication factor 3, retention 7 days for replay. Step 4: Spark Structured Streaming consumer with 1-minute trigger reads Kafka, dedups on (event_id, source), writes Parquet to S3 partitioned by date and hour. Step 5: dbt micro-batch on hourly cadence reads S3, applies sessionization (LAG on event_ts, new session when gap exceeds 30 minutes), and writes to silver tables. Step 6: dbt on the silver tables produces gold star schemas in Snowflake with conformed dim_user, dim_page, dim_device.
Five failure modes the L5+ rubric explicitly tests. SDK buffer overflow: the device loses network for hours; buffer fills; oldest events get dropped. Mitigation: bounded buffer (10MB max), drop-oldest policy, telemetry counter for dropped events. CDN edge failure: the CDN region serving the user fails; events fail to ingest. Mitigation: multi-CDN failover, retry with exponential backoff from the SDK side. Kafka broker dies: replication factor 3 handles N-1 failures; alerting on ISR shrinkage. Spark Structured Streaming consumer lag: events accumulate in Kafka faster than Spark can consume. Mitigation: autoscale executors, increase parallelism, alert on consumer lag exceeding threshold. Late-arriving events: a phone offline for 3 hours surfaces 1000 events with old event_time. Mitigation: watermark configured to allow 24 hours of lateness, dbt sessionization re-runs on late partitions, gold tables MERGE on (event_id) so late events update past aggregates.
Sessionization is the SQL-heavy part of the clickstream design. The 30-minute-gap session definition: group consecutive events into sessions where the gap between consecutive events does not exceed 30 minutes. SQL pattern: LAG(event_ts) to get the previous event's timestamp; CASE WHEN current minus lagged exceeds 30 minutes THEN 1 ELSE 0 AS new_session; SUM(new_session) OVER PARTITION BY user_id ORDER BY event_ts AS session_id; GROUP BY (user_id, session_id) for per-session aggregates. Edge cases: the first event for each user has NULL LAG; COALESCE to a large value to start a new session. Same-millisecond events: secondary sort by event_id within ORDER BY to make session assignment deterministic.
Deduplication is the other recurring concern. Web SDKs sometimes send the same event twice due to retry-on-network-error logic. Server-side dedup uses the composite key (event_id, source) with the source identifying the SDK version or ingest endpoint. ROW_NUMBER OVER PARTITION BY (event_id, source) ORDER BY ingest_ts ASC, filter rn = 1 keeps the first arrival. Downstream gold tables MERGE on the same composite key so re-runs are idempotent.
- What is the canonical architecture for a 10B-event-per-day clickstream pipeline?
- Web SDK to local buffer to CDN-fronted ingest endpoint to Kafka (24 partitions, replication factor 3) to Spark Structured Streaming (1-minute trigger, dedup on event_id+source) to Parquet on S3 (partitioned by date/hour) to dbt micro-batch (hourly sessionization) to gold star schemas in Snowflake. Failure modes drilled per component.
- How does a data engineer handle clickstream deduplication?
- Composite key (event_id, source) where event_id is generated SDK-side (UUID v7 for time-orderable) and source identifies the SDK version. Server-side dedup with ROW_NUMBER OVER PARTITION BY (event_id, source) ORDER BY ingest_ts ASC, filter rn=1. Downstream gold tables MERGE on the same key so re-runs are idempotent. SDKs sometimes resend events due to network-retry logic; dedup is unavoidable.
- How does sessionization work in a clickstream pipeline?
- 30-minute-gap session definition typically. SQL pattern: LAG(event_ts) to get previous event timestamp, CASE WHEN current minus lagged exceeds 30 minutes THEN 1 ELSE 0 AS new_session, SUM(new_session) OVER PARTITION BY user_id ORDER BY event_ts AS session_id, GROUP BY (user_id, session_id) for per-session aggregates. Edge cases: first event per user has NULL LAG (COALESCE to large value), same-millisecond events need secondary sort.
- How does a clickstream pipeline handle late-arriving events?
- Watermark configured to allow up to 24 hours of lateness in the streaming consumer. Spark Structured Streaming with watermark + foreachBatch sink for late-data handling. dbt sessionization re-runs on late partitions. Gold tables MERGE on (event_id) so late events update past aggregates without overwriting. Allowed lateness beyond the watermark goes to a side-output for inspection.
- What is the role of the local SDK buffer?
- Avoids blocking the app on network calls. Events write synchronously to a local store (sessionStorage, IndexedDB, or platform-specific equivalent). A background flusher batches and ships to the ingest endpoint. Bounded buffer size (10MB max) with drop-oldest policy on overflow. Telemetry counter for dropped events alerts when buffer policy needs revisiting.
- How many Kafka partitions does a 10B-event-per-day pipeline need?
- Roughly 24 partitions for 100k events-per-second throughput with 1KB events. Math: 10B/day = 116k events/sec average, peak ~5x = 580k events/sec, 1KB each = 580 MB/sec peak. Kafka safely handles 10-20 MB/sec per partition for low-latency. 580 / 20 = 29 partitions for peak headroom. Round up to 24 or 32. Partition count is hard to change post-creation; size for 2-3x growth.
- How does a data engineer prevent the thundering-herd problem on the ingest endpoint?
- CDN-fronted with edge-cached schema validation (Cloudflare Workers or Lambda@Edge). SDK adds random jitter to retry-after timestamps to avoid synchronized retries. Rate-limit per-client at the edge. Circuit breaker on the ingest endpoint to shed load gracefully when downstream Kafka is backed up.
124 practice problems matching this filter. Difficulty: medium (57), hard (67).
Pipeline Architecture (124)
- 45 Minutes Turned Into 3.5 Hours - medium - Spark jobs are running. Just not fast enough.
- 600 Million Events a Day - hard - 600 million events a day. Two years of retention.
- A Clean Number for Every Merchant - hard - Raw payment logs in. Clean merchant summaries out.
- A Million Cars Phoning Home - hard - Every vehicle is a sensor. Deploy the pipeline to catch it all.
- Analysts Are Slowing the Store Down - medium - Orders placed. Data warehouse hungry.
- A New Column on a Billion Rows - hard - Add and backfill a new column to a billion-row production table with zero downtime.
- A Shared Drive Full of Contracts - medium - Buried in PDFs. The data is in there somewhere.
- A Stream All Day and a File at Midnight - hard - Real-time and batch. Same pipeline. No compromises.
- Badging Items That Already Sold Out - hard - Same-day delivery. The features have to be faster.
- Basel, CCAR, and Monday Morning - medium - The regulator does not accept 'eventually consistent.'
- Bikes Before Rush Hour - hard - Bikes in, bikes out. The city needs to predict demand.
- Credit for Every Touch - medium - They saw the ad, clicked the email, then bought. Who gets credit?
- Doubling Every Six Months - hard - Tuesdays are quiet. Black Friday is not.
- Eight-Hour-Old Positions - hard - Positions shift by the second. The math cannot lag.
- Eight Teams, Eight Latencies - medium - Millions of gamers. The architecture decision changes everything.
- End of Day Is Too Late - medium - Every swipe tells a story.
- Equities, ETFs, and the SEC - hard - Fractional shares, multi-currency, point-in-time. All of it.
- Event System for Multiple Consumers - hard - One event, many hungry consumers.
- Every Dataset Needs a Paper Trail - hard - The FDA has opinions about your data pipeline.
- Every Deal Is a Financial Transaction - hard - Real money on the table. Reconstruct every hand.
- Every Device, Every Impression - hard - Every ad seen. Every second watched. Real-time.
- Every Device Has Its Own Dialect - medium - Three sources. Three formats. Same workout.
- Every Firm Formats It Differently - medium - The regulator changed the format. Again. Handle it.
- Every Format Imaginable - hard - PDFs, HL7, JSON. All of it lands in the same lake.
- Everyone Wants the Same Data, Differently - hard - How you store it decides how fast you can read it.
- Every Region Exports Its Own Way - medium - Sales data, BigQuery, Dataflow. Make it all sing.
- Every Scan, Every Parcel, Every Pin Code - medium - Out for delivery. Delivered. Except the events arrived backwards.
- Fifty Thousand Retailers - medium - Retail data at CPG scale. Every SKU, every store.
- The Box That Won't Fit the Data - hard
- Five Times the Traffic, Five Times the Bill - hard - Scale up when needed. Do not bankrupt the team.
- Five Years of Cron Jobs - hard - Half the jobs run on cron. Half run on events. All of it has to move.
- Flying Blind Until Midnight - hard - Intraday risk, full lineage. The regulator is watching.
- Four Teams, One Topic, No Agreement - hard - Everybody is writing to it. Nobody documented it. Now production is fragile.
- Greenfield Build for Six Sources - hard - Infrastructure as code. Meaning as a service.
- Half a Million Rental Cars - medium - Every vehicle is reporting. Every rental matters.
- The Identity Problem - hard - Old systems. New demands. The same customer appears under three different names.
- Listens From Everywhere, Counted Once - hard - Phones, tablets, laptops. And some of them report late.
- Live Viewers, Live Billing - hard - The stream is live. The data cannot wait.
- Near-Real-Time Trending Dishes Dashboard - hard - The dish rankings update faster than the kitchen.
- Nested Docs, Flat Reports - medium - Two databases. One direction. No data left behind.
- Nightly Exports Are Too Slow - medium - Healthcare claims change constantly. The warehouse cannot fall behind.
- 4,500 Stores Before Sunrise - medium - The shelves open at 7. The data better be there.
- Not Every Team Can See Every Row - hard - Everyone can see the bucket. Not everyone should.
- One Bill Across Three Clouds - medium - AWS, Azure, GCP. Three bills. One truth.
- One Earthquake, Ten Thousand Tweets - hard - The firehose is on. Separate signal from noise.
- Out of the Data Center - medium - The on-prem servers are not getting any younger.
- The Speed Layer - medium - Dashboards can't wait for raw logs. Something has to happen upstream.
- Prove the Number Is Right - hard - Bad data in fintech is not just messy. It is expensive.
- Real Data, Fake Patients - hard - Dev needs production data. HIPAA says absolutely not.
- The Register Never Sleeps - medium - Every swipe lands in the warehouse. The table has to stay current without breaking.
- Recommendations Now, Royalties Later - medium - The catalog updated. Did anyone notice?
- Replicate It Without Breaking It - hard - The source changed. The lake needs to know immediately.
- Risk Models on Week-Old Data - medium - Loan approved. Loan denied. Every decision is an event.
- SaaS API Connector with Incremental Sync - medium - The API has rate limits. You have deadlines.
- Same-Day Sales, Every Store - medium - The cash register data needs to be queryable by morning.
- The Living Table - medium - Data lands continuously. History must survive every update.
- Score It Before It Clears - hard - The fraudsters move fast. Your pipeline has to move faster.
- Ship Before Fraud Finishes Checking - hard - The claim looks clean. The fraud model disagrees.
- Six Hours to Miss a Deadline - medium - The rebuild works. It just doesn't finish in time.
- Six Hours to Refresh Every Number - medium - Ratings change. The incremental model has to keep pace.
- Six Million Rows Before the Market Opens - medium - One massive CSV. Millions of timestamps.
- Six Sources, One Platform - medium - ADF orchestrates. Unity Catalog governs. Nothing leaks.
- Sixty Minutes, Every Hour - medium - Every hour, on the hour. No excuses.
- Stores and the Site, Together - hard - The registers never stop ringing.
- Store, Site, and Distributor - medium - Sales data is piling up. Someone has to make sense of it.
- The Acquisition Still Taking Bookings - hard - Two systems, two schemas. One truth.
- The Agency That Changes the Columns - medium - The schema changed overnight. Again.
- The Analysts Cannot Touch Production - medium - Production is the source. Analytics needs its own copy.
- The Analyst Who Saw the Salary Data - hard - Two incidents. One shared lake. The access model was never designed, just assumed.
- The API Drip Feed - medium - The API gives you 100 records at a time. You need millions.
- The Bad Row That Broke the Dashboard - medium - Bad records cannot reach the warehouse.
- The Binding and the Claim - medium - Policies are instant. Claims take their time.
- The Booking That Came Three Ways - hard - PMS, OTA, and website all think they took the reservation first.
- The Boutique That Sold in Six Currencies - hard - Every sale is real. The rate it was converted at depends on who is asking.
- The Bucket Full of Resumes - medium - A thousand resumes. Structured data inside each one.
- The Carrier Moving to Azure - medium - Claims arrive messy. The medallion cleans them up.
- The Claim That Picks Its Own Lane - medium - Three entry points. Different workflows. All must route correctly.
- The Clicks We Throw Away - hard - Every tap, swipe, and scroll. At scale.
- The Clock That Runs Two Ways - hard - Nightly batch and live events. One dashboard.
- The Consent Stitcher - medium - Consent was given. Or was it? Stitch the records together.
- The Dashboard and the Attribution Model - hard - Streaming and batch. One pipeline to rule them.
- The Decision Before the Door Closes - hard - The window to stop it is smaller than you think.
- The Distributor Filing Problem - medium - Hundreds of suppliers. One warehouse. One deadline.
- The Event Pile - hard - 600 million clicks a day. The budget is not infinite.
- The Fare Aggregator - medium - Airfares shift every minute. Catch the best ones.
- The Fleet That Never Stops - hard - Every truck is talking. Not everyone can hear them yet.
- The Leaderboard That Costs $25K a Month - hard - Product wants it live. Engineering has a price tag.
- The Meal Kit That Knows You - medium - What they ordered says a lot about what they want next.
- The Migration That Cannot Break Morning - hard - It all works today. Moving it without losing a single report is the hard part.
- The Models Going Stale - hard - The model is only as good as what you feed it.
- The Panel and the Set-Top Boxes - hard - Set-top boxes tell you who watched. Projection tells you how many.
- The Patients We Cannot Move - hard - Patient data stays local. Insights have to be global.
- The Points Arrive Two Days Late - medium - The bank data shows up late. The rewards were already sent.
- The Provider That Sometimes Sleeps - medium - The models run at dawn. The data has to be there first.
- The Query That Used to Be Fast - medium - Queries used to be fast. Something changed.
- The Queue That Wouldn't Stop Growing - medium - 500,000 messages behind and the number keeps climbing.
- The Revenue That Was Wrong for Two Weeks - medium - Nobody caught it until the CFO asked a question. Design the system that catches it first.
- The Sale That Needs to Land Now - medium - Three channels feeding one view. Not all of them speak the same language.
- The Signals That Power Recommendations - medium - Fresh signals, many teams, one pipeline.
- The User Who Asked to Be Forgotten - hard - Users want their data erased. Completely.
- The Vendor Who Never Warns You - medium - Every month, something is different. The dashboards have no idea.
- The What-If Machine - hard - A million slots. A thousand campaigns. Every combination matters.
- The Whiteboard Exercise - medium - Marker in hand. Draw the whole thing.
- Thirty Cities, One Forecast - hard - Five cities. Five data formats. One prediction.
- Thirty Countries, One Solvency Number - hard - Premiums collected globally. Losses happen locally.
- Thirty Million Unique Jobs a Year - hard - One press run, many orders. Group them right.
- Thousands of Practices, One Dataset - hard - Patient records in, operational insights out.
- Three Providers, One Workout - hard - The same ride, reported three times.
- Three Regions, One Finance Team - hard - Payments from everywhere. One consistent report.
- Three Regions, One Report - hard - Three regions, billions of payments, one merchant summary by 6 AM.
- Towers and Phones, Same Story - hard - Tower signals meet app events. Somewhere in between is the truth.
- Traders, Risk, and the Regulators - medium - Markets move in milliseconds. The pipeline has to keep up.
- Two Hundred Million Redirects - medium - Billions of clicks. One tiny code. Two very different clocks.
- Two Million Boxes by Monday Morning - hard - Shipped, maybe. Delivered, debatable.
- Two Systems, One Room Count - hard - Two booking systems. Rooms do not duplicate themselves.
- Two Ways to Catch a Change - medium - Two ways to watch the database. Each has a cost.
- Two Years of Every Click - hard - Every click, every aisle, every day for two years.
- Two Years of Clicks, Cheap - hard - Two years of clicks. Every query has to be affordable.
- What Everyone Is Watching - hard - Someone is watching. Capture everything.
- What Should We Recommend Tonight - hard - They ordered pad thai twice. That means something.
- Where Is Every Truck, Right Now - medium - Trucks are moving. Every ping counts.
- Which Promotion Is Actually Working - hard - Was the promotion worth it? The data knows.
- Who Is Churning and Why - medium - Subscribers churn. The pipeline cannot.
- Who Saw the Ad Twice - hard - TV and digital. Same viewer, two measurement worlds.