# The Online Elite

> The top performers are hiding in the data.

Canonical URL: <https://datadriven.io/problems/the_online_elite>

Domain: Python · Difficulty: medium · Seniority: L4

## Problem

Given merchants (list of dicts with merchant_id and name), payments (list of dicts with merchant_id, amount, mode), and integer top_n, filter payments to mode = 'online'. Per merchant_id, sum amount. Rank merchants by that sum descending. Take the top_n merchants. Return the average of all online payment amounts belonging to those top-n merchants as a single float.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **multi-step data transformation**: filtering, joining, aggregating, ranking, and computing derived metrics. It mirrors real analytics workflows where multiple operations must be chained in the correct order.

---

### Break down the requirements

#### Step 1: Filter to online transactions

Keep only transactions where the channel is 'online' (or equivalent flag).

#### Step 2: Join transactions to merchants

Match each transaction to its merchant record using a shared identifier.

#### Step 3: Rank merchants by total volume

Sum transaction amounts per merchant and sort descending to find the top N.

#### Step 4: Compute the average transaction value for top merchants

Filter transactions to those belonging to top-N merchants, then compute the average.

---

### The solution

**Filter, join, rank, then average**

```python
def top_merchant_avg(merchants, transactions, n):
    merchant_map = {}
    for m in merchants:
        merchant_map[m['id']] = m
    online_txns = []
    for txn in transactions:
        if txn.get('channel') == 'online':
            online_txns.append(txn)
    volume_by_merchant = {}
    for txn in online_txns:
        mid = txn['merchant_id']
        if mid in volume_by_merchant:
            volume_by_merchant[mid] += txn['amount']
        else:
            volume_by_merchant[mid] = txn['amount']
    ranked = []
    for mid in volume_by_merchant:
        ranked.append((volume_by_merchant[mid], mid))
    ranked.sort(reverse=True)
    top_ids = set()
    for i in range(min(n, len(ranked))):
        top_ids.add(ranked[i][1])
    total_amount = 0
    count = 0
    for txn in online_txns:
        if txn['merchant_id'] in top_ids:
            total_amount += txn['amount']
            count += 1
    if count == 0:
        return 0
    avg = total_amount / count
    return avg
```

> **Time and Space Complexity**
>
> **Time:** O(t + m + k log k) where t is the number of transactions, m is merchants, and k is distinct merchant count for sorting.
> 
> **Space:** O(t + m) for the filtered transactions and lookup maps.

> **Interviewers Watch For**
>
> Do you build the merchant lookup map first, or do you join inline? Building a dict upfront is O(m) and makes each join O(1). Nested loops would be O(t * m).

> **Common Pitfall**
>
> Computing average transaction value across all merchants instead of only the top N. The final averaging step must filter by the top merchant set.

---

## Common follow-up questions

- How would you do this in SQL instead of Python? _(Tests window functions, CTEs, and GROUP BY with HAVING.)_
- What if transactions had timestamps and you needed top merchants per month? _(Tests adding a time-based grouping dimension.)_
- What if N were very large relative to the merchant count? _(Tests that min(n, len(ranked)) handles this gracefully.)_
- How would you handle transactions missing the merchant_id field? _(Tests defensive filtering and data quality logging.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_online_elite)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.