# Shortest Unique Metric Tag

> One token per metric. Make it unambiguous.

Canonical URL: <https://datadriven.io/problems/shortest_unique_metric_tag>

Domain: Python · Difficulty: medium · Seniority: L5

## Problem

A metrics platform stores fully-qualified metric names like 'bookings.checkout.success_rate'. Dashboards autocomplete a metric from any contiguous substring of its name, so each metric needs the shortest substring of its own name that does not appear inside any other metric's name. Given a list of metric names, return a list of the same length where the i-th entry is the shortest contiguous substring of metrics[i] that is not a substring of any other metric in the list. If multiple substrings of the same metric tie for the shortest length, return the lexicographically smallest one. If every substring of metrics[i] also appears in some other metric's name, return an empty string for that metric.

## Worked solution and explanation

### Why this problem exists in real interviews

This is a common DE phone-screen pattern. It tests whether you can scan substrings of one string against the rest of a collection, manage a tiebreak rule, and reason about the worst-case complexity of substring enumeration. The metrics-platform framing is incidental; the underlying skill is string scanning with a uniqueness predicate across a list.

> **Trick to Solving**
>
> Iterate substring lengths from 1 upward. For each length, generate every substring of metrics[i], dedupe, sort lexicographically, and return the first one that does not appear in any other metric. The length-major outer loop guarantees shortest, the lex sort guarantees the tiebreak.

---

### Break down the requirements

#### Step 1: Per-metric scope: only substrings of metrics[i\] matter

For metric m, you only care about substrings of m. The 'others' set is every other metric joined into a haystack. Substring containment in that haystack is the cheap predicate.

#### Step 2: Length-major enumeration enforces 'shortest then lex'

Loop length L from 1 to len(metric). At each L, build the set of substrings of that length, sort lexicographically, and pick the first one not present in the others-haystack. Length-major order plus lex sort enforces both rules in one pass.

#### Step 3: Build the others-haystack with a sentinel

Concatenate the other metrics with a sentinel character that cannot appear in metric names (e.g., a NUL byte). This prevents a candidate from matching a substring that straddles two metrics in the haystack.

#### Step 4: Handle the no-unique-substring case

Duplicate metrics, or a metric that is itself a substring of another, will exhaust every length without finding a unique substring. Return '' for those indices.

---

### The solution

**Length-major substring enumeration with lex tiebreak**

```python
def shortest_unique_metric_tag(metrics: list[str]) -> list[str]:
    result = []
    for i, m in enumerate(metrics):
        others = '\x00'.join(metrics[:i] + metrics[i+1:])
        found = ''
        for length in range(1, len(m) + 1):
            candidates = sorted({m[s:s+length] for s in range(len(m) - length + 1)})
            unique = next((c for c in candidates if c not in others), None)
            if unique is not None:
                found = unique
                break
        result.append(found)
    return result
```

> **Time and Space Complexity**
>
> **Time:** O(n * m^2 * (n*m)) in the worst case, where n is the number of metrics and m is the longest metric. For each metric we examine O(m^2) substrings; each containment check is O(n*m).
> 
> **Space:** O(n*m) for the joined haystack plus O(m) per length for the candidate set.

> **Interviewers Watch For**
>
> Strong candidates volunteer the lex tiebreak before coding, and call out that duplicates or one-string-contains-another forces an empty-string return. Weak candidates rank by start-index instead of lex order, which silently passes small examples and fails on cases with multiple equal-length unique candidates.

> **Common Pitfall**
>
> Building the others-haystack with simple concatenation lets a candidate match a substring straddling two metric boundaries. Always join with a sentinel character that cannot appear in metric names.

---

## Common follow-up questions

- How would you scale this to a million metrics? _(Tests scaling reasoning. Naive enumeration is fine for a few hundred metrics; suffix automata or generalized suffix trees push it to near-linear in the total input size.)_
- If metrics are added one at a time, how would you avoid recomputing every answer from scratch? _(Tests incremental computation. A new metric only invalidates metrics whose unique substring is now non-unique; those are the only indices needing recomputation.)_
- What if a metric can appear in the list multiple times and you want the shortest substring that uniquely identifies it across the corpus? _(Tests broadening the predicate. Substring existence becomes substring count; you want the shortest substring whose occurrence count across the corpus is exactly one.)_
- How would you handle metric names with unicode characters or different normalization forms? _(Tests language-edge thinking. Unicode normalization (NFC vs NFD) can turn equal-looking strings into different byte sequences, breaking naive substring checks.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/shortest_unique_metric_tag)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.