# The Messy Pipeline

> The upstream API has no idea what a schema is.

Canonical URL: <https://datadriven.io/problems/the_messy_pipeline>

Domain: Python · Difficulty: easy · Seniority: L4

## Problem

Given a list of mixed items (numbers, numeric strings, nested lists of those, None) and a float multiplier, produce a flat list of floats: extract every numeric value (parse numeric strings), skip None, recurse into lists one level, multiply each extracted number by multiplier. Return the resulting floats in traversal order.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **type checking, coercion, and list flattening**, all common in real ETL pipelines where upstream data arrives in mixed formats. Interviewers want to see careful type dispatch and defensive conversion logic.

---

### Break down the requirements

#### Step 1: Classify each element by type

Check whether each element is `None`, a number (int/float), a string representation of a number, or a list of numbers.

#### Step 2: Extract and convert numeric values

Skip `None` values. Convert string numbers to floats. Flatten nested lists by extracting each element.

#### Step 3: Apply the multiplier

Multiply each extracted float by the given multiplier.

#### Step 4: Sort and return

Collect all results into a flat list, sort it, and return.

---

### The solution

**Type dispatch with flatten and multiply**

```python
def clean_pipeline(data, multiplier):
    values = []
    for item in data:
        if item is None:
            continue
        if isinstance(item, list):
            for sub in item:
                values.append(float(sub) * multiplier)
        elif isinstance(item, str):
            values.append(float(item) * multiplier)
        else:
            values.append(float(item) * multiplier)
    values.sort()
    return values
```

> **Time and Space Complexity**
>
> **Time:** O(n log n) dominated by the final sort, where n is the total number of numeric values extracted.
> 
> **Space:** O(n) for the collected values list.

> **Interviewers Watch For**
>
> Do you check `isinstance(item, list)` before `isinstance(item, (int, float))`? In Python, `bool` is a subclass of `int`, so ordering your type checks matters. Also, checking for `None` first with `is None` avoids accidental falsy matches.

> **Common Pitfall**
>
> Using `type(item) == int` instead of `isinstance`. This fails for subclasses and is not considered Pythonic. `isinstance` handles inheritance correctly.

---

## Common follow-up questions

- What if strings could be non-numeric and should be silently skipped? _(Tests wrapping float() in try/except for graceful error handling.)_
- What if nested lists could be arbitrarily deep? _(Tests recursive flattening.)_
- How would you handle this at scale with millions of records? _(Tests streaming approach vs materializing the full list.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_messy_pipeline)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.