# The Trade Log Aggregator

> Every trade left a footprint.

Canonical URL: <https://datadriven.io/problems/the_trade_log_aggregator>

Domain: Python · Difficulty: medium · Seniority: L4

## Problem

Given CSV trade log lines (date,process,host,log_message,bytes), extract the exchange name from process (everything before the first underscore). Per (date, exchange), sum bytes. Return a list of dicts {'date', 'exchange', 'total_bytes'} sorted by (date ascending, exchange ascending).

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **CSV parsing**, **string extraction**, and **nested dict aggregation**. Extracting structured fields from semi-structured log data and aggregating by multiple dimensions is a core data engineering skill.

---

### Break down the requirements

#### Step 1: Parse each CSV line into fields

Split on comma to get date, process, host, log_message, and bytes.

#### Step 2: Extract the exchange name from the process field

The exchange name is the portion before the first underscore in the process column.

#### Step 3: Accumulate bytes by day and by exchange-day

Maintain two dicts: one for total bytes per day, one for bytes per exchange per day.

---

### The solution

**CSV split with dual-level aggregation**

```python
def aggregate_trades(lines: list) -> tuple:
    daily_totals = {}
    exchange_daily = {}
    for line in lines:
        parts = line.split(",")
        date = parts[0]
        process = parts[1]
        bytes_val = int(parts[4])
        exchange = process.split("_")[0]
        daily_totals[date] = daily_totals.get(date, 0) + bytes_val
        if date not in exchange_daily:
            exchange_daily[date] = {}
        exchange_daily[date][exchange] = exchange_daily[date].get(exchange, 0) + bytes_val
    return daily_totals, exchange_daily
```

> **Time and Space Complexity**
>
> **Time:** O(n) where n is the number of log lines. Each line is parsed and aggregated in constant time.
> 
> **Space:** O(d * e) where d is the number of distinct dates and e is the number of distinct exchanges.

> **Interviewers Watch For**
>
> Splitting on the first underscore only (`split('_')[0]`) correctly handles process names like `NYSE_trader_1` by extracting just `NYSE`.

> **Common Pitfall**
>
> Assuming the bytes field is always valid. Production code should handle malformed lines with try/except, but the interview context typically guarantees clean input.

---

## Common follow-up questions

- How would you handle quoted CSV fields containing commas? _(Tests using the `csv` module instead of naive `split(',')`.)_
- What if logs arrived out of chronological order? _(Tests sorting by date before or after aggregation.)_
- How would you process a 10GB log file? _(Tests streaming line-by-line processing instead of loading the entire file.)_
- What if the process field had no underscore? _(Tests defensive parsing with a fallback when `split('_')` yields only one part.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_trade_log_aggregator)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.