# The Tag Order

> Tags arrived in chaos. The system needs them in line.

Canonical URL: <https://datadriven.io/problems/alphabetical_tag_sort>

Domain: SQL · Difficulty: hard · Seniority: L3

## Problem

The data quality team is standardizing event metadata. For all signup events, normalize the tags column to lowercase. Return the event ID and the normalized tags, ordered by event ID.

## Worked solution and explanation

### What this is really asking

`tags` lives on a 200M-row table partitioned by `event_timestamp`, but the filter is on `event_type`. So partitioning does not help. The work is one LOWER, one signup filter, one sort by `event_id`.

---

### Break down the requirements

#### Step 1: Filter to signups

`event_type = 'signup'` in WHERE, before projecting. No reason to LOWER rows you discard.

#### Step 2: Normalize and sort

Wrap `tags` in LOWER, alias as `normalized_tags`, then ORDER BY `event_id`. Do not mutate the stored column; this is a read-side transform.

---

### The solution

**NORMALIZED SIGNUP TAGS**

```sql
SELECT
  event_id,
  LOWER(tags) AS normalized_tags
FROM event_data
WHERE event_type = 'signup'
ORDER BY event_id
```

> **Cost Analysis**
>
> Partitioning is by `event_timestamp`, so a signup-only filter prunes nothing. Expect a full 200M-row scan plus sort unless `event_type` is indexed.

> **Interviewers Watch For**
>
> Clean LOWER alias, knowing `event_timestamp` partitioning does not help here, and not mutating the source column. Bonus: asking whether `tags` is string or array.

> **Common Pitfall**
>
> Writing UPDATE to normalize in place. The prompt asks for a result set, not a schema change; UPDATE on 200M rows rewrites partitions.

> **The False Start**
>
> First instinct is `ORDER BY normalized_tags` because that column is the headline transform. That sorts alphabetically by tag string and the grader rejects it. Pivot to `ORDER BY event_id`; the prompt's last clause names the sort key.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you persist the normalized form without rewriting 200M rows? _(Add a generated column `LOWER(tags)` or a materialized view scoped to signups.)_
- What if `tags` is comma-separated and you want lowercase per token? _(Split into rows, LOWER each token, then re-aggregate.)_
- How do you make this partition-prunable? _(Add a time predicate alongside `event_type = 'signup'`, or repartition by `event_type`.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/alphabetical_tag_sort)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.