# Stream-Process a Large CSV

> Too big to load. Read what you can.

Canonical URL: <https://datadriven.io/problems/stream_process_a_large_csv>

Domain: Python · Difficulty: hard · Seniority: L5

## Problem

Given an iterable of CSV lines (each line a string, no trailing newline) and a column name, stream the iterable lazily and return the sum of numeric values in that column. The first line is the header. Skip rows whose target-column value is missing or not numeric. Return a float. Process row-by-row without materializing the whole iterable in memory.

## Worked solution and explanation

### Why this problem exists in real interviews

Processing a multi-gigabyte CSV without loading it all into memory tests **streaming I/O**, **the csv module**, and **error handling** for dirty data. This is a core data engineering competency.

> **Trick to Solving**
>
> Use Python's `csv.DictReader` which reads one row at a time. Wrap the value conversion in a try/except to skip non-numeric or missing values gracefully.

---

### Break down the requirements

#### Step 1: Open the file and create a streaming reader

Use `open()` and `csv.DictReader` to read row by row without loading the entire file.

#### Step 2: Extract and convert the target column value

For each row, get the column value and convert to float. Skip if missing or non-numeric.

#### Step 3: Accumulate the sum

Add valid values to a running total.

---

### The solution

**Streaming CSV sum with row-by-row processing**

```python
import csv
def sum_column(filepath, column_name):
    total = 0.0
    with open(filepath, 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            raw_value = row.get(column_name)
            if raw_value is None:
                continue
            try:
                total += float(raw_value)
            except (ValueError, TypeError):
                continue
    return total
```

> **Time and Space Complexity**
>
> **Time:** O(n) where n is the number of rows. Each row is processed once.
> 
> **Space:** O(1) beyond I/O buffers. Only a running total is maintained, not the full dataset.

> **Interviewers Watch For**
>
> Using `csv.DictReader` for streaming row-by-row reads, and try/except for non-numeric values. Candidates who `f.read()` the entire file fail the memory constraint.

> **Common Pitfall**
>
> Loading the entire file with `f.read()` or `f.readlines()`. For a multi-gigabyte file, this exhausts memory. Always process line by line.

---

## Common follow-up questions

- What if the file is gzip-compressed? _(Tests using `gzip.open` as a drop-in replacement for `open`.)_
- How would you process the file in parallel? _(Tests splitting by byte offset or using multiprocessing.Pool with chunk boundaries.)_
- What if the CSV uses a non-comma delimiter? _(Tests passing `delimiter` to `csv.DictReader`.)_
- How would you compute both sum and count without a second pass? _(Tests maintaining two accumulators in the same loop.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/stream_process_a_large_csv)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.