# A Number for the Seller

> They want a total. Give them the right schema first.

Canonical URL: <https://datadriven.io/problems/a_number_for_the_seller>

Domain: Data Modeling · Difficulty: easy · Seniority: L3

## Problem

A seller on our marketplace wants to see their total products listed and total revenue earned per day. Design the data model that supports this view, and choose between a dimensional model or an ER model. Justify your choice. Then write the SQL to produce the report.

## Worked solution and explanation

### Why this problem exists in real interviews

This is a compact probe of **grain awareness** and the dimensional versus ER trade-off. Two metrics (products listed and revenue) look similar but live at different grains, and a strong candidate spots the fanout before joining them.

> **Trick to Solving**
>
> Before drawing tables, a strong candidate asks: what is the atomic fact being counted? Products listed is a count over a snapshot of the product dimension. Revenue is a sum over a sale fact. Different grains, different tables. Joining them directly creates a many-to-many fanout.
> 
> 1. Name the two grains out loud
> 2. Pick a dimensional model so analysts can slice by date and seller
> 3. Compute each metric in its own subquery
> 4. Join the subqueries on `(seller_id, date)`

---

### Break down the requirements

#### Step 1: Pick dimensional over ER

ER normalizes for write correctness; dimensional denormalizes for read speed. A seller daily dashboard is a read-heavy, additive aggregation pattern, which points to a star schema with `dim_sellers`, `dim_products`, `dim_date`, and `fact_sales`.

#### Step 2: Declare the grain of fact_sales

One row per line item per order. This is the most atomic grain, which preserves the ability to roll up to daily revenue per seller without losing resolution.

#### Step 3: Treat `products listed` as dimension counting

`products listed` is a property of `dim_products`, not `fact_sales`. It is counted with `COUNT(DISTINCT product_id)` filtered to active listings, not summed from a revenue row.

#### Step 4: Never join the two metrics at row level

If a seller has 50 active listings and 200 sales, joining sales to products on `(seller_id)` produces 10,000 rows. Use two subqueries and join on `(seller_id, date)` only after aggregation.

---

### The solution

Below is one defensible model. The grain of `fact_sales` is one row per order line, which keeps revenue cleanly additive.

> **Why this works**
>
> Two metrics live at two grains and are joined only after each is aggregated to `(seller_key, date_key)`. The trade-off is a small join cost in exchange for a model that scales to thousands of sellers without row explosion.

> **Interviewers watch for**
>
> Strong candidates call out the fanout risk within the first 30 seconds. They also justify their dimensional pick with one sentence about read patterns. Weak candidates reach for a single flat table and then realize halfway through the query that the counts are wrong.

> **Common pitfall**
>
> Adding `is_active` directly onto `fact_sales` to compute listing counts. That conflates a dimension attribute with a fact measure and breaks the moment a seller deactivates a product that has historical sales.

---

### The analysis pattern

**Daily listings and revenue per seller**

```sql
SELECT
    s.seller_name,
    d.full_date,
    listings.active_products,
    COALESCE(sales.daily_revenue, 0) AS daily_revenue
FROM dim_sellers s
CROSS JOIN dim_date d
LEFT JOIN (
    SELECT seller_key, COUNT(*) AS active_products
    FROM dim_products
    WHERE is_active = TRUE
    GROUP BY seller_key
) listings ON listings.seller_key = s.seller_key
LEFT JOIN (
    SELECT seller_key, date_key, SUM(gross_revenue) AS daily_revenue
    FROM fact_sales
    GROUP BY seller_key, date_key
) sales ON sales.seller_key = s.seller_key AND sales.date_key = d.date_key
WHERE d.full_date = CURRENT_DATE
```

---

### Trade-offs and alternatives

**Dimensional star**

Fast reads, additive measures, easy for BI tools. Cost: ETL must keep `dim_products.is_active` fresh and surrogate keys add an indirection layer.

**3NF ER model**

Strong write correctness and referential integrity. Cost: every seller dashboard pays join cost, and analysts write more CTEs to answer the same question.

---

## Common follow-up questions

- What if a seller soft-deletes a product mid-day and the dashboard needs the count at a specific hour? _(Tests whether the candidate introduces a daily snapshot fact for listing counts.)_
- How would you handle refunds so that `daily_revenue` reconciles to finance? _(Tests signed measures and whether refunds are a separate fact or negative revenue rows.)_
- What if the product changes seller mid-year on the same `product_id`? _(Tests SCD thinking on dim_products and whether the natural key is stable.)_
- How would your model change if the business added a tax component with per-jurisdiction rates? _(Tests degenerate versus conformed dimension modeling for tax.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/a_number_for_the_seller)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.