# The Table Thief

> Somewhere in that query, tables are hiding.

Canonical URL: <https://datadriven.io/problems/the_table_thief>

Domain: Python · Difficulty: medium · Seniority: L4

## Problem

Given a SQL query string, extract all table names following FROM or JOIN clauses. Return a sorted list with no duplicates. Assume simple identifiers without schema prefixes; aliases may follow and should be stripped.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **regex-based text extraction** and SQL parsing fundamentals. Extracting table names from queries is a common task in data lineage tools, query analyzers, and migration scripts.

---

### Break down the requirements

#### Step 1: Identify FROM and JOIN keywords

Table names follow `FROM` and any variant of `JOIN` (LEFT JOIN, INNER JOIN, etc.). The keyword matching must be case-insensitive.

#### Step 2: Extract the table name that follows

The table name is the next word after the keyword. Aliases (e.g., `users u`) may follow but should be ignored.

#### Step 3: Deduplicate and sort

Return unique table names in sorted order.

---

### The solution

**Regex extraction from FROM and JOIN clauses**

```python
import re
def extract_tables(query: str) -> list:
    pattern = re.compile(r'(?:FROM|JOIN)\s+(\w+)', re.IGNORECASE)
    found = pattern.findall(query)
    unique = set()
    for name in found:
        unique.add(name)
    result = sorted(unique)
    return result
```

> **Time and Space Complexity**
>
> **Time:** O(n) where n is the length of the query string. Regex scanning is linear.
> 
> **Space:** O(t) where t is the number of unique table names.

> **Interviewers Watch For**
>
> Using `re.IGNORECASE` to handle mixed-case SQL keywords. Queries may use `FROM`, `from`, or `From` interchangeably.

> **Common Pitfall**
>
> The regex `(?:FROM|JOIN)` uses a non-capturing group, which is important. A capturing group would interfere with `findall` and return the keyword instead of the table name.

---

## Common follow-up questions

- What about subqueries with nested FROM clauses? _(Tests recursive or multi-pass parsing for complex SQL.)_
- How would you handle schema-prefixed tables like `public.users`? _(Tests extending the regex pattern to capture dotted identifiers.)_
- What if table names were quoted with backticks or double quotes? _(Tests handling quoted identifiers in the regex.)_
- How do production SQL parsers like sqlglot handle this? _(Tests awareness of proper AST-based SQL parsing vs. regex heuristics.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_table_thief)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.