# Negative Outcome Rate for New Users

> New users have a rough first two weeks.

Canonical URL: <https://datadriven.io/problems/negative_outcome_rate_for_new_users>

Domain: SQL · Difficulty: medium · Seniority: L5

## Problem

Given a users table with signup dates and an event_data table with event types and timestamps, calculate the fraction of events with a negative outcome out of all events for users who signed up in June 2026, restricted to events that occurred within 14 days of signup. Negative outcomes are events where the event_type is error, timeout, or crash.

## Worked solution and explanation

### The mental model

Product analytics teams use this to probe whether you can hold a per-user date window in your head while writing aggregate SQL. The trap is writing the 14-day filter as a fixed global range instead of `event - signup BETWEEN 0 AND 14` per user. They also want to hear you ask 'event-weighted or user-weighted rate?' before you write, because the two answers diverge.

---

### The three traps

#### Step 1: Event-weighted versus user-weighted

The query computes `negative_events / total_events`, so a single power user with 1000 events dominates the rate. Picture two June signups. User A fires 999 clean events and 1 error. User B fires 1 event, an error. The event-weighted rate is 2 / 1001, roughly 0.002. The user-weighted rate is 1.0, because 50 percent of users had a bad first experience. The prompt says **fraction of events**, so event-weighted is the right answer here. A senior candidate still calls out the ambiguity aloud, because product teams almost always want the user-weighted version when they say negative outcome rate.

#### Step 2: The 14-day window is per-user, not global

`julianday(ed.event_timestamp) - julianday(u.signup_date) BETWEEN 0 AND 14` evaluates against each user's own signup date. A user who signed up June 1 has a window ending June 15. A user who signed up June 30 has a window ending July 14. The tempting shortcut, `event_timestamp BETWEEN '2024-06-01' AND '2024-07-14'`, looks similar but is wrong: it credits a June 1 signup with events all the way through July 14, more than 6 weeks past their signup. The per-user date arithmetic on each row is non-negotiable.

#### Step 3: Integer division silently destroys the answer

`CAST(SUM(...) AS REAL) / COUNT(*)` forces floating-point division. Without the CAST, SQLite (and Postgres with two integer operands, and MySQL in strict mode) returns an integer. SUM = 50, COUNT = 100, result = 0. Every rate below 1.0 collapses to zero and every rate at or above 1.0 collapses to 1. The query runs cleanly, the column type looks fine, and the metric is wrong. You only notice when someone benchmarks the dashboard against a known cohort.

---

### The solution

**Conditional rate over June signups, first 14 days**

```sql
SELECT
    CAST(SUM(CASE WHEN ed.event_type IN ('error', 'timeout', 'crash') THEN 1 ELSE 0 END) AS REAL)
        / COUNT(*) AS negative_rate
FROM users u
JOIN event_data ed ON u.user_id = ed.user_id
WHERE u.signup_date >= '2026-06-01'
  AND u.signup_date <  '2026-07-01'
  AND julianday(ed.event_timestamp) - julianday(u.signup_date) BETWEEN 0 AND 14
```

> **The join is the cost driver**
>
> `users` is 15M rows; the June predicate trims that to roughly 500k. `event_data` is 200M rows and partitioned by `event_timestamp`. The expression `julianday(ed.event_timestamp) - julianday(u.signup_date) BETWEEN 0 AND 14` wraps the partition column in a function, so the planner cannot range-seek per partition. Rewrite as `ed.event_timestamp >= u.signup_date AND ed.event_timestamp < datetime(u.signup_date, '+15 days')` and also bound the outer range to `event_timestamp >= '2026-06-01' AND event_timestamp < '2026-07-15'` so the planner prunes to the 6-week slice that can possibly match instead of scanning 200M rows.

> **What a senior candidate raises**
>
> Three clarifications worth surfacing before writing SQL. (a) **Weighting**: event-weighted or user-weighted, because the answer can differ by orders of magnitude. (b) **Empty users**: someone who signed up in June but fired zero events in the first 14 days drops out of the denominator entirely through the INNER JOIN. That is usually desired, since a rate over zero events is undefined, but the interviewer wants to hear you say it. (c) **Type of `signup_date`**: if it is a `DATE`, day 14 ends at midnight 14 days later; if it is a `TIMESTAMP`, day 14 is exactly 14 * 86400 seconds out and may exclude events the business considers same-day.

> **The integer-division trap, two ways**
>
> `SUM(CASE WHEN ed.event_type IN ('error','timeout','crash') THEN 1 ELSE 0 END) / COUNT(*)` returns 0 in SQLite, Postgres, and MySQL when both sides are integers. The same bug hides inside `COUNT(CASE WHEN ed.event_type IN ('error','timeout','crash') THEN 1 END) / COUNT(*)`. The fix is one of: cast the numerator to REAL or NUMERIC, multiply by `1.0`, or use the `AVG(CASE ... THEN 1.0 ELSE 0.0 END)` form. Pick one and stay consistent across the codebase, or every new metric is a fresh bug waiting to ship.

> **AVG with float literals is shorter**
>
> `AVG(CASE WHEN ed.event_type IN ('error','timeout','crash') THEN 1.0 ELSE 0.0 END) AS negative_rate` says the same thing in one expression and the `1.0` literal forces floating-point throughout, so there is no CAST to forget. It is exactly equivalent to `SUM / COUNT` on the same condition, and many teams prefer it for readability in dashboards where rate columns multiply.

---

## Common follow-up questions

- Rewrite this as a user-weighted rate where each June signup contributes one observation regardless of their event count. How does the answer change for the two-user toy example in trap 1? _(Tests whether the candidate can transform an event-weighted aggregate into a user-weighted one, typically via a per-user subquery that produces one row per user, then AVG over those rows.)_
- Why does the query use `CAST(... AS REAL) / COUNT(*)` instead of just dividing the SUM by the COUNT directly? _(Probes whether the candidate knows the integer-division behavior of SQLite, Postgres, and MySQL, and can name at least one alternative (multiply by 1.0, AVG with float literals, NUMERIC cast).)_
- How would you rewrite the 14-day window so the planner can range-seek the `event_timestamp` partitions on `event_data` instead of evaluating julianday on every row? _(Tests SARGability awareness: replace the function-on-column with `ed.event_timestamp >= u.signup_date AND ed.event_timestamp < datetime(u.signup_date, '+15 days')`, and bound the outer range to the 6-week slice.)_
- If a user signed up on June 30, does their 14-day window extend into mid-July, or does the cohort definition truncate it at June 30? _(Probes whether the candidate distinguishes the cohort filter (signup_date in June) from the window filter (14 days from each user's signup), and recognizes that the window can spill outside June.)_
- If 5 percent of events have NULL event_type, do they appear in the denominator? Do they appear in the numerator? Is that the behavior you want? _(Tests three-valued logic intuition. `NULL IN ('error','timeout','crash')` evaluates to NULL, so the CASE returns 0 and they do not count in the numerator. `COUNT(*)` counts the row regardless, so they do count in the denominator. The candidate should decide whether to filter NULLs out explicitly.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/negative_outcome_rate_for_new_users)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.