# Log Parsing Pipeline Schema

> Raw text files, terabytes of them, full of buried signals and cryptic error codes.

Canonical URL: <https://datadriven.io/problems/log_parsing_pipeline_schema>

Domain: Data Modeling · Difficulty: medium · Seniority: L5

## Problem

We ingest about 2TB of raw application logs daily. Right now they're just text files in S3. We need a structured schema so analysts can query error patterns, correlate events within sessions, and track error rates by service. Can you design the data model?

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether a candidate recognizes **cardinality skew** in log data and uses correlation keys (request_id, session_id) as the backbone for cross-service tracing. The design that ignores the 1000:1 ratio between debug and error events pays for it every day at query time.

> **Trick to Solving**
>
> When a prompt mixes “2TB per day” with “query error patterns,” the trick is to split the hot path (errors) from the cold path (debug/info) and to invest in a correlation key. Before drawing tables, a strong candidate asks: what is the ratio of error to info logs, and what key ties events across services?
> 
> 1. Split error_events from log_events
> 2. Pull services into a dimension with team and tier
> 3. Normalize error_types as a dimension
> 4. Keep request_id and session_id as first-class columns

---

### Break down the requirements

#### Step 1: Split by cardinality

`log_events` absorbs the firehose; `error_events` is a narrower, higher-value fact. Analysts who care about errors never pay the cost of scanning debug logs.

#### Step 2: Dimensionalize services

`services` is a conformed dimension with team, tier, and criticality. Every fact FK points to it so that ownership is one join away.

#### Step 3: Classify errors via a dimension

`error_types` maps raw message patterns to canonical categories. `error_events.error_type_id` is the FK, so taxonomy changes do not rewrite the fact.

#### Step 4: Preserve correlation keys

`request_id` and `session_id` live on the fact rows so that a single user action can be reconstructed across ten services with a WHERE clause.

---

### The solution

Below is one conceptually sound model. The split between log_events and error_events is the load-bearing decision: it is how a 2TB-per-day feed remains queryable.

> **Why this works**
>
> Two facts keyed by the same correlation columns let analysts slice errors cheaply and still join back to the firehose when they need full context. The error_types dimension turns taxonomy changes into a one-row UPDATE.

> **Interviewers watch for**
>
> A strong candidate raises the cardinality ratio early and reaches for a correlation key before schema specifics. They also name a retention policy: errors kept longer than info.

> **Common pitfall**
>
> A single `logs` table with a `level` column. Error queries scan the entire firehose and the partition strategy has to accommodate two very different access patterns. At 2TB per day this is the difference between seconds and minutes on every hot query.

---

### The analysis pattern

**Top error categories by service last hour**

```sql
SELECT
    s.name AS service,
    et.category,
    COUNT(*) AS error_count
FROM error_events e
JOIN services s ON s.service_id = e.service_id
JOIN error_types et ON et.error_type_id = e.error_type_id
WHERE e.event_ts >= NOW() - INTERVAL '1 hour'
GROUP BY s.name, et.category
ORDER BY error_count DESC
LIMIT 20
```

---

### Trade-offs and alternatives

**Split fact tables**

Hot error path separate from firehose.

* Error queries are fast and narrow
* Retention can differ per fact
* Producers must classify at write time

**Single logs table with level column**

Unified logs table filtered by level.

* One ingest path
* Every error query scans everything
* Retention is one-size-fits-all

---

## Common follow-up questions

- How would you enforce request_id propagation across 40 services? _(Tests whether the candidate names a tracing context standard and upstream instrumentation.)_
- Debug logs must be kept 3 days, errors 90 days, audit 7 years. How is that enforced? _(Tests per-fact partitioning and TTL policies.)_
- How do you detect a new error type the classifier has never seen? _(Tests whether error_type_id is nullable with an unclassified bucket and a backfill path.)_
- At 2TB per day, where does the ingest bottleneck move, and how do you shard? _(Tests scale awareness: batch vs streaming, partition keys, and hot-shard mitigation.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/log_parsing_pipeline_schema)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.