# The Chunked Reader

> Too big for memory. Read in pieces.

Canonical URL: <https://datadriven.io/problems/the_chunked_reader>

Domain: Python · Difficulty: medium · Seniority: L3

## Problem

Given a list of lines and a positive chunk_size, yield (generator) successive chunks, each a list of at most chunk_size lines. The test harness collects yielded chunks into a list of lists.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests understanding of **generators and yield semantics**, which are essential for memory-efficient data processing. Interviewers use this to check whether a candidate knows how to process large datasets without loading everything into memory at once.

> **Trick to Solving**
>
> The key insight is using **index-based slicing** inside a while loop with `yield`. Advance a pointer by `chunk_size` on each iteration and slice the input list. The last slice naturally handles the remainder.

---

### Break down the requirements

#### Step 1: Use a position pointer starting at 0

Track your current read position in the list. This replaces the need for modifying the input.

#### Step 2: Yield slices of chunk_size

On each iteration, yield `lines[pos:pos + chunk_size]`. Python slicing handles the case where fewer than `chunk_size` elements remain.

#### Step 3: Advance the pointer

Increment pos by chunk_size after each yield. The loop ends when pos reaches or exceeds the list length.

---

### The solution

**Generator with index-based slicing**

```python
def chunked_read(lines: list, chunk_size: int):
    pos = 0
    while pos < len(lines):
        chunk = lines[pos:pos + chunk_size]
        yield chunk
        pos += chunk_size
```

> **Time and Space Complexity**
>
> **Time:** O(n) total across all yields, where n is the number of elements. Each element is visited exactly once.
> 
> **Space:** O(chunk_size) per yielded chunk. The generator itself uses O(1) bookkeeping beyond the current chunk.

> **Interviewers Watch For**
>
> Do you use `yield` correctly? Candidates who collect all chunks into a list and return it miss the entire point of generators and memory efficiency.

> **Common Pitfall**
>
> Using `range(0, len(lines), chunk_size)` with a for loop works too, but some candidates accidentally use `range(chunk_size)` which only yields one chunk.

---

## Common follow-up questions

- What if the input were a file handle instead of a list? _(Tests whether you can adapt from list slicing to `readline()` or `read(n)` in a streaming context.)_
- How would you process chunks in parallel? _(Tests knowledge of `concurrent.futures` or multiprocessing patterns with generators.)_
- What happens if chunk_size is larger than the input? _(Tests edge case awareness: the generator yields one chunk containing all elements.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_chunked_reader)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.