# Prolific Authors in Largest Service Teams

> Senior leads in the biggest teams.

Canonical URL: <https://datadriven.io/problems/prolific_authors_in_largest_service_teams>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Find authors whose name starts with 'a' (case-insensitive) who belong to the service(s) with the most unique authors. If multiple services tie for largest team, include matching authors from all of them. Return the service name and author.

## Worked solution and explanation

### Why this problem exists in real interviews

This challenge asks you to apply pattern matching with LIKE to the `deploy_logs` table, simulating a real analytics workflow. Pay attention to the `svc_name` and `author` columns as they drive the aggregation and output.

> **Trick to Solving**
>
> "Belong to the service(s) with the most unique authors" requires finding a max-of-counts, then filtering. This is a classic top-group-then-filter pattern.
> 
> 1. Count distinct authors per service
> 2. Find the maximum of those counts
> 3. Filter to services matching that maximum
> 4. Within those services, filter authors whose name starts with 'a'

---

### Break down the requirements

#### Step 1: Count distinct authors per service

In a CTE, `GROUP BY svc_name` with `COUNT(DISTINCT author)` gives team size per service.

#### Step 2: Find the maximum team size

In a second CTE, `SELECT MAX(author_count)` from the first CTE.

#### Step 3: Filter to largest-team services

Join the first CTE to the max and keep services where `author_count = max_count`.

#### Step 4: Filter to 'a' authors

From `deploy_logs`, filter `WHERE LOWER(author) LIKE 'a%'` and join to the qualifying services. Return `svc_name` and `author`.

---

### The solution

**Pattern-match for prolific authors in largest service**

```sql
WITH team_sizes AS (
    SELECT svc_name, COUNT(DISTINCT author) AS author_count
    FROM deploy_logs
    GROUP BY svc_name
),
max_size AS (
    SELECT MAX(author_count) AS max_count
    FROM team_sizes
),
top_services AS (
    SELECT svc_name
    FROM team_sizes, max_size
    WHERE author_count = max_count
)
SELECT DISTINCT dl.svc_name, dl.author
FROM deploy_logs dl
JOIN top_services ts ON dl.svc_name = ts.svc_name
WHERE LOWER(dl.author) LIKE 'a%'
```

> **Cost Analysis**
>
> The query scans `deploy_logs` (1,000,000 rows). A covering index on the filter and group columns would reduce I/O. At this scale, the full scan is acceptable but becomes costly if the table grows 10x.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- What result would you get if every value in `deploy_logs.dur_secs` were NULL? Would your query return an empty set or something unexpected? _(Tests extreme NULL scenarios and whether the candidate guards against edge cases in `dur_secs`.)_
- With 950,000 distinct values in `deploy_logs.deploy_at`, how would a composite index on the GROUP BY columns change the execution plan? _(Probes understanding of how cardinality in `deploy_at` affects grouping and sort operations.)_
- Does your database engine materialize the CTE or inline it? How would that affect repeated scans of `deploy_logs`? _(Tests understanding of CTE materialization semantics.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/prolific_authors_in_largest_service_teams)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.