# The Dashboard and the Attribution Model

> Streaming and batch. One pipeline to rule them.

Canonical URL: <https://datadriven.io/problems/the_dashboard_and_the_attribution_model>

Domain: Pipeline Design · Difficulty: hard · Seniority: L6

## Problem

Our digital marketing platform generates a continuous stream of ad impression and conversion events that need to feed both a real-time campaign performance dashboard and a daily attribution model. We have been running separate streaming and batch pipelines that have drifted out of sync, causing discrepancies between the live dashboard and the daily report. Design a unified architecture on Azure Databricks that eliminates the discrepancy.

## Worked solution and explanation

### Why this problem exists in real interviews

Two consumers (a dashboard refreshed every five minutes, a daily attribution report that bills) reading the same impression-and-conversion events, with late events arriving hours late, attribution looking back over a window, and duplicates that have to count once in both views. The trap is two pipelines that drift; what's needed is one computation feeding both views.

The default reach is a streaming pipeline for the dashboard and a separate batch pipeline for the daily report. The dashboard's approximate dedup misses duplicates the batch removes; the dashboard total runs higher than the report. Late mobile impressions land in the dashboard's arrival hour and miss the batch's event-time bucket. Attribution windows differ between the two pipelines because nobody owns one definition.

> **Trick to Solving**
>
> One streaming-and-batch unified computation, event-time everywhere, dedup once on a stable id, the same attribution logic for both views.
> 
> 1. One computation produces both views: a streaming aggregator emits per-period rollups continuously into the warehouse; the daily report is the closed-window subset of those same rollups.
> 2. Event-time partitioning everywhere; late events land in the bucket they belong to, both for the dashboard and the report.
> 3. Dedup runs once on a stable event id at ingest; the dedup contract is shared by both views by virtue of being upstream of both.
> 4. Attribution logic lives in shared code that both the streaming aggregator and the daily report invoke; no two definitions, no drift.

---

### Walk the requirements

#### Step 1: One computation feeds both the dashboard and the daily report

A streaming aggregator runs over impression and conversion events and writes per-period rollups (per campaign, per period, per attribution window) into the warehouse. The dashboard reads the latest periods' rollups; the daily report reads yesterday's closed window from the same rollups. Both views are derivatives of the same source computation; drift is structurally impossible because the rollups are the source of both. Without a streaming tier the dashboard is too slow; without a shared warehouse there's nothing both views point at.

#### Step 2: Event-time everywhere so late events land in the right bucket

Mobile impressions arrive hours late. The aggregator runs on event-time windows with a configured lateness allowance; events arriving inside the allowance update the bucket they belong to. Late updates rewrite the affected bucket's rollup; both the dashboard and the daily report see the same restated number for that period. Attributing on arrival time is the version where the dashboard's hour and the report's hour disagree because the late events landed in different buckets.

#### Step 3: Attribution reads the full lookback window for each conversion

Each conversion attributes credit to the last impression in the lookback window before the conversion. The aggregator's per-conversion lookup reads the impression history for that user inside the window from the warehouse. The daily report's attribution and the dashboard's attribution use the same lookup because the logic lives in shared code. A 'today's data only' attribution is the version where conversions don't credit yesterday's impression; the lookback window in shared code is what makes attribution match what the model expects.

#### Step 4: Dedup once on a stable id; both views inherit

Each event carries a stable event id. Dedup runs at ingest before the aggregator; the rollups read the deduped stream. Both views inherit the dedup contract because they read the same rollups. A 'streaming approximate dedup, batch exact dedup' design is what made the original drift; pulling dedup upstream of both views is what eliminates it.

---

### The shape that fits

> **What this design gives up**
>
> One unified computation is a single point of failure where two pipelines were independent; event-time windows hold open longer than processing-time windows; shared attribution code has to be tested against both view shapes. Implementation cost is the price; the win is a dashboard and a daily report that always agree, late events that count for the right hour in both views, and an attribution model that uses the full lookback every time.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - One computation produces both the dashboard and the daily report so the two can't drift.
> - Late events apply to the time bucket they belong to in both views.
> - Dedup runs once on a stable event id; both views inherit the deduped stream.
> - Attribution reads the full lookback window for each conversion.

> **The mistake that ships**
>
> What gets shipped runs separate streaming and batch pipelines for the dashboard and the daily report. The dashboard's approximate dedup runs higher than the batch's exact dedup; the daily report bills against a number that's lower than the dashboard. Late mobile impressions land in different hour buckets between the two views. Attribution windows differ because nobody owns one definition. The eventual rebuild collapses the two paths into one event-time aggregator with shared attribution logic , reachable up front if 'unified architecture' had been treated as one computation rather than two synchronized.

---

## Common follow-up questions

- A late impression arrives a week later, well past the lateness allowance. What does this design do, and what do the dashboard and the daily report show? _(Tests whether the candidate sees that events past the lateness allowance route to a late-data path that updates the affected past period's rollup; both views eventually reflect the corrected period. The dashboard's recent windows aren't impacted; the report's prior period restates on the next read.)_
- Marketing wants a per-campaign view that updates faster than the unified aggregator's window. What in this design lets that happen, and what's the cost? _(Tests whether the candidate sees that adding a faster aggregator on top of the deduped stream is the surface for the new view; the existing rollups continue at their cadence. The cost is the additional streaming compute for the per-campaign aggregator; the gain is faster per-campaign visibility without disturbing the unified rollups.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_dashboard_and_the_attribution_model)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.