# Search Success by User Tenure

> Compare search click-through rates between new and existing users.

Canonical URL: <https://datadriven.io/problems/search_success_by_user_tenure>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

We want to compare search quality between new and existing users. A successful search is one where a click occurred within 30 seconds of the query time. Classify users as 'new' (signup date within the last 30 days of the dataset) or 'existing'. Return one row per segment with total searches, successful searches, and success rate.

## Worked solution and explanation

### Why this problem exists in real interviews

Built around the `search_queries` and `users` tables, this challenge probes your ability to apply conditional aggregation via CASE in a search behavior setting. Correctly referencing columns like `user_id`, `clicked_result`, and `signup_date` is essential to a working solution.

> **Trick to Solving**
>
> When the prompt asks for multiple metrics split by a condition (e.g., resolved vs. unresolved), conditional aggregation avoids multiple passes.
> 
> 1. Spot the split: two or more categories in one output row
> 2. Use `SUM(CASE WHEN condition THEN 1 ELSE 0 END)` for each bucket
> 3. Group by the common dimension

---

### Break down the requirements

#### Step 1: Structure the logic with 3 CTEs

Break the problem into named stages: `max_date`, `user_segment`, `search_stats`. Each CTE isolates one transformation, making the query readable and debuggable.

#### Step 2: Join `users` to `max_date`

The join connects the two tables on their shared key. This brings the columns needed for filtering and aggregation into a single row set.

#### Step 3: Use conditional aggregation with CASE

A `CASE` expression inside the aggregate function splits rows into buckets without multiple passes over the data. Each condition maps to one output column.

---

### The solution

**Case pivot for search success by user tenure**

```sql
WITH max_date AS (
    SELECT MAX(signup_date) AS latest
    FROM users
)
, user_segment AS (
    SELECT u.user_id, CASE WHEN julianday((SELECT latest FROM max_date)) - julianday(u.signup_date) <= 30 THEN 'new' ELSE 'existing' END AS segment
    FROM users u
)
, search_stats AS (
    SELECT us.segment, COUNT(*) AS total_searches, SUM(CASE WHEN sq.clicked_result = 1 THEN 1 ELSE 0 END) AS successful_searches
    FROM search_queries sq
    INNER
    JOIN user_segment us ON sq.user_id = us.user_id
    GROUP BY us.segment
)
SELECT segment, total_searches, successful_searches, CAST(successful_searches AS DOUBLE) / total_searches AS success_rate
FROM search_stats
```

> **Cost Analysis**
>
> With ~90M rows, the GROUP BY reduces the working set before any downstream operations; the join cost depends on the smaller table's cardinality; CTEs materialize intermediate results, which can be beneficial or costly depending on the engine. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you decompose the problem into named, testable stages rather than nesting everything; whether you can pivot data with conditional aggregation in a single pass instead of multiple queries; how you handle date arithmetic and whether you account for edge cases like month boundaries.

> **Common Pitfall**
>
> Using string comparison instead of proper date arithmetic for date ranges can miss edge cases at midnight boundaries.

---

## Common follow-up questions

- What would happen to your result if `users.username` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `username` and uses DISTINCT or deduplication where needed.)_
- If `search_queries` grew to contain billions of rows, which part of your query would become the bottleneck given the cardinality of `query_time`? _(Tests ability to identify performance hotspots related to `search_queries.query_time` at scale.)_
- If `search_queries` contained late-arriving rows that were inserted after your query ran, how would you design an incremental update instead of re-aggregating? _(Tests understanding of incremental aggregation patterns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/search_success_by_user_tenure)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.