# Housing Marketplace Analytics

> Sellers want buyers. Buyers want deals.

Canonical URL: <https://datadriven.io/problems/housing_marketplace_analytics>

Domain: Data Modeling · Difficulty: medium · Seniority: L4

## Problem

We run a housing marketplace. Sellers list properties, buyers view listings and submit leads. We need to measure conversion rate from view to lead by location and property type. Design the data model.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether a candidate separates facts by **grain and volume** instead of overloading a single event log. Views arrive at orders of magnitude more than leads, and conversion analytics need both sides joinable on a conformed listing dimension.

> **Trick to Solving**
>
> Whenever a prompt says “conversion rate from X to Y,” the trick is to notice you have two facts at two grains, not one. Before designing, ask: what is the cardinality ratio between the two, and do they share a common dimension?
> 
> 1. Build separate fact tables for views and leads
> 2. Share dim_listings and dim_geography across both facts
> 3. Compute conversion at query time via a JOIN on listing_id
> 4. Keep geography as its own hierarchy dimension

---

### Break down the requirements

#### Step 1: Separate facts by grain

`fact_views` is one row per impression; `fact_leads` is one row per contact request. Combining them forces a `fact_type` column and breaks additivity.

#### Step 2: Model listings as a dimension

`dim_listings` carries price, bedrooms, square footage, property type, and a FK to geography. Listing attributes change slowly and are shared by both facts.

#### Step 3: Model geography with a hierarchy

`dim_geography` stores neighborhood, city, state in one row per neighborhood. Roll-ups become GROUP BY without joins.

#### Step 4: Connect users as a conformed dimension

`dim_users` is referenced by both facts so that cross-fact user behavior (did this viewer also submit a lead) is a single join away.

---

### The solution

Below is one defensible model. The anchor is that views and leads are two distinct facts sharing the same dimensions, which is the purest form of a conformed star schema.

> **Why this works**
>
> Two fact tables at different grains let the warehouse honor the 100:1 or 1000:1 view-to-lead ratio without bloating a combined events log with nulls. Conformed dimensions mean conversion, cohort, and geo roll-ups are all one JOIN away.

> **Interviewers watch for**
>
> A strong candidate names the conformed-dimension pattern out loud and justifies why views and leads are not the same fact. They also raise whether to partition `fact_views` by day given the volume ratio.

> **Common pitfall**
>
> An overloaded `events` table with `event_type` = 'view' or 'lead'. It feels flexible but destroys query performance: every view query pays the cost of every lead column, and the cardinality imbalance wrecks partition pruning.

---

### The analysis pattern

**View-to-lead conversion by neighborhood**

```sql
SELECT
    g.neighborhood,
    COUNT(DISTINCT v.view_id) AS views,
    COUNT(DISTINCT l.lead_id) AS leads,
    COUNT(DISTINCT l.lead_id) * 1.0 / NULLIF(COUNT(DISTINCT v.view_id), 0) AS conversion_rate
FROM dim_listings d
JOIN dim_geography g ON g.geography_id = d.geography_id
LEFT JOIN fact_views v ON v.listing_id = d.listing_id
LEFT JOIN fact_leads l ON l.listing_id = d.listing_id
WHERE d.listed_at >= NOW() - INTERVAL '30 days'
GROUP BY g.neighborhood
ORDER BY conversion_rate DESC
```

---

### Trade-offs and alternatives

**Separate fact tables**

fact_views and fact_leads split.

* Partition pruning works cleanly
* Schema columns reflect what each event actually has
* Conversion requires a two-way JOIN

**Unified event log**

One events table with event_type discriminator.

* Single ingest path
* Sparse columns and null-heavy rows
* Partition pruning blurred by the high-low volume mix

---

## Common follow-up questions

- How would you model listings that change price mid-campaign? _(Tests whether the candidate reaches for a Type 2 SCD on dim_listings or a price history table.)_
- How do you attribute a lead to the view that drove it when a user visits five times? _(Tests last-touch vs first-touch attribution and whether a session key spans both facts.)_
- What changes when fact_views grows to 100M per day? _(Tests partitioning, clustering, and whether a rolled-up daily aggregate is introduced.)_
- How do you support GDPR deletion of a user who submitted leads a year ago? _(Tests whether user_id is a hashable pseudonym and whether lookups are tombstone-driven.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/housing_marketplace_analytics)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.