# Presence vs. Participation

> Being in the region and being active are two very different things.

Canonical URL: <https://datadriven.io/problems/active_vs_regional_user_count>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The platform team is doing a quick capacity check. Compare the number of active user accounts against the number of infrastructure nodes deployed in us-east-1. If active accounts outnumber us-east-1 nodes, return 'More active'; otherwise return 'More us-east-1'.

## Worked solution and explanation

### Why this problem exists in real interviews

Capacity planning conversations often reduce to a single comparison: are we serving more users than we have machines for, or vice versa, in a given region? The prompt asks for a literal verdict string, not a count, which is the giveaway: this is a scalar query, not an aggregation by group. Interviewers like this question because it tests whether you over-engineer simple comparisons. Many candidates reach for a JOIN or GROUP BY by reflex; the right answer is two scalar subqueries inside a CASE.

> **Trick to Solving**
>
> When the prompt asks "which is bigger, A or B" and wants a verdict back, write it as two scalar subqueries inside a CASE expression. No JOIN, no GROUP BY, no UNION. The two filters are independent, the comparison is scalar, the output is one row.
> 
> 1. `(SELECT COUNT(*) FROM users WHERE account_status = 'active')`
> 2. `(SELECT COUNT(*) FROM infra_nodes WHERE region = 'us-east-1')`
> 3. Compare with `CASE WHEN ... THEN 'More active' ELSE 'More us-east-1' END`

---

### Break down the requirements

#### Step 1: Recognize there is no join

`users` and `infra_nodes` share no key relevant to this question. They live in the same warehouse, but a user is not deployed to a node and a node does not belong to a user. JOINing them would either fan out the row count or quietly compute the wrong number. Recognize the trap: the prompt tests whether you can resist a reflex JOIN.

#### Step 2: Compute scalar #1: active users

Active users: `SELECT COUNT(*) FROM users WHERE account_status = 'active'`. `account_status` has cardinality 4 with single-hot skew, so 'active' covers most of the 500K rows; the COUNT is essentially a scan. The cast/comparison is exact - no LIKE, no case-folding needed if the column stores a controlled vocabulary.

#### Step 3: Compute scalar #2: us-east-1 nodes

us-east-1 nodes: `SELECT COUNT(*) FROM infra_nodes WHERE region = 'us-east-1'`. `region` has cardinality 8 with zipf skew; us-east-1 is typically the largest region. 5K total rows, so this is trivial regardless.

#### Step 4: Compare and return the verdict string

`CASE WHEN active_users > us_east_1_nodes THEN 'More active' ELSE 'More us-east-1' END`. The prompt's two literal output strings are non-negotiable - returning '> active' or 'active wins' would be marked wrong by an automated grader. Read the prompt for exact wording before typing.

---

### The solution

**Scalar comparison with CASE**

```sql
SELECT CASE
    WHEN (SELECT COUNT(*) FROM users WHERE account_status = 'active')
       > (SELECT COUNT(*) FROM infra_nodes WHERE region = 'us-east-1')
    THEN 'More active'
    ELSE 'More us-east-1'
END AS result
```

**Equivalent form with explicit tie handling**

```sql
/* CTE form, easy to extend with a tie branch */
WITH counts AS (
  SELECT
    (SELECT COUNT(*) FROM users WHERE account_status = 'active') AS active_users,
    (SELECT COUNT(*) FROM infra_nodes WHERE region = 'us-east-1') AS us_east_nodes
)
SELECT CASE
  WHEN active_users > us_east_nodes THEN 'More active'
  WHEN active_users < us_east_nodes THEN 'More us-east-1'
  ELSE 'Tied'
END AS result
FROM counts;
```

> **Cost Analysis**
>
> Each subquery is one filtered count. On `users` (500K rows, 128 MB) the COUNT runs over the filtered subset; if `account_status` had a btree, the planner could use an index-only scan for `'active'`. On `infra_nodes` (5K rows) it does not matter. The two subqueries are independent and run in parallel on engines that support it. Total cost is two filtered scans plus one comparison - no join, no sort, no aggregation state. Expected plan: two parallel Aggregate(Filter(Scan)) -> CASE evaluation -> one-row Result.

> **Interviewers Watch For**
>
> The interviewer is watching whether you write a JOIN. There is no key to join on, and reaching for JOIN here signals you matched on the table names instead of reading the prompt. Strong candidates name the trap out loud ("these tables don't share a key, so I'll use two scalar subqueries"). They also ask about ties before writing the CASE, even if the prompt says only two branches - it shows they think about edge cases without slowing down.

> **Common Pitfall**
>
> Returning a count instead of the verdict string. The prompt's exact words are 'More active' / 'More us-east-1', and an automated grader will reject any other output. The other pitfall is JOINing `users` to `infra_nodes` on something nonsensical (`user_id = node_id` is a real example of what novice candidates write), which produces a Cartesian-ish result and a wrong COUNT.

---

## Common follow-up questions

- Add a 'Tied' branch when the two counts are equal, and a 'Within 5%' branch when they differ by less than 5%. How does your CASE change? _(Tests whether the candidate can extend the CASE with a third branch. Production capacity dashboards almost always have a Tied or Within-N% bucket because exact equality is rare but near-equality is operationally meaningful.)_
- How would your answer change if 'active' is one of four account_status values, and the product team considers 'trialing' equivalent to 'active' for capacity? _(Tests cardinality intuition on what 'active' actually means. `account_status` has 4 values; one of them is likely 'active', but production data often distinguishes 'active', 'trialing', 'paused'. The right answer is to ask the product owner before assuming.)_
- Region has zipf skew, with us-east-1 likely the dominant region. If the team asked the same question for ap-south-1, would the query template still work, and what would change about the expected result? _(Tests skew awareness. `region` is zipf, so us-east-1 likely has 50%+ of nodes and the comparison is lopsided. The follow-up is whether the same query template works for, say, ap-south-1, where the count is much smaller.)_
- If this CASE were embedded in a larger query that returns one row per region, would the two scalar subqueries be re-evaluated for each region row? How would you avoid that? _(Tests whether the candidate knows scalar subqueries are evaluated once and cached. Some engines (older MySQL) re-evaluate them per row in a larger query; modern engines do not. This matters when the CASE is embedded in a larger query.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/active_vs_regional_user_count)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.