# Machine Process Event Log Schema

> Machines fire events. Pair them up before they bury you.

Canonical URL: <https://datadriven.io/problems/machine_process_event_log_schema>

Domain: Data Modeling · Difficulty: medium · Seniority: L5

## Problem

We collect structured logs from a fleet of machines. Each machine runs many processes, and we need to track when each process runs and how long it takes. Data scientists need to query metrics like average elapsed time per process and plot process timelines across machines. Design the data model, and describe how you'd load this data via an ETL.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether a candidate handles **composite keys on semi-unique identifiers** and reasons about unpaired start/end events as an ETL concern. process_id is unique per machine but not globally, which is the subtle trap that separates pass from fail.

> **Trick to Solving**
>
> When a prompt lists start/end events and a per-machine process_id, the trick is to recognize the uniqueness scope. Before drawing tables, a strong candidate asks: is process_id unique across machines, and what happens when an end event is missing?
> 
> 1. Make (machine_id, process_id, event_type, ts) the logical uniqueness boundary
> 2. Keep machines as a dimension
> 3. Treat process_events as an append-only fact
> 4. Pair start and end at query time, not ingest

---

### Break down the requirements

#### Step 1: Declare the event grain

One row in `process_events` equals one lifecycle event (start or end) of one process on one machine at one timestamp.

#### Step 2: Fix the composite key

`process_id` is local to a machine. The logical uniqueness is `(machine_id, process_id, event_type, ts)`, not `process_id` alone.

#### Step 3: Dimensionalize machines

`machines` carries region, hardware class, and service tier. Every fact FK points to it so fleet-level rollups are a single GROUP BY.

#### Step 4: Pair events at query time

Duration is a self-join or window function over start and end events. Persisting duration on ingest hides unpaired events.

---

### The solution

Below is one defensible approach. The grain is one event per row, and process_id is an attribute not a foreign key, because its uniqueness is scoped to the machine.

> **Why this works**
>
> Treating process_id as an attribute respects its actual uniqueness scope. Pairing events at query time means a missing end event is visible as data quality, not hidden by silent imputation at ingest.

> **Interviewers watch for**
>
> A strong candidate explicitly asks about uniqueness of process_id and refuses to make it a PK across the fleet. They also mention event time vs ingestion time when loaded_at lags event_ts.

> **Common pitfall**
>
> Declaring `process_id` as a primary key or FK to a `processes` table. The first time two machines reuse the same local counter, inserts silently collide and metrics become nonsense.

---

### The analysis pattern

**Average process duration by machine region**

```sql
WITH paired AS (
    SELECT
        s.machine_id,
        s.process_id,
        s.event_ts AS started_at,
        MIN(e.event_ts) AS ended_at
    FROM process_events s
    JOIN process_events e
      ON e.machine_id = s.machine_id
     AND e.process_id = s.process_id
     AND e.event_type = 'end'
     AND e.event_ts > s.event_ts
    WHERE s.event_type = 'start'
    GROUP BY s.machine_id, s.process_id, s.event_ts
)
SELECT
    m.region,
    AVG(EXTRACT(EPOCH FROM (p.ended_at - p.started_at))) AS avg_duration_sec
FROM paired p
JOIN machines m ON m.machine_id = p.machine_id
GROUP BY m.region
```

---

### Trade-offs and alternatives

**Append-only events**

process_events with start and end rows.

* Ingest is O(1) append
* Missing end events remain visible
* Duration queries pay self-join cost

**Pre-paired runs table**

ETL pairs events into a single process_runs row with start_ts and end_ts.

* Fast duration queries
* Missing end events become late-arriving updates
* Ingest has to buffer or wait

---

## Common follow-up questions

- How do you detect processes that never emitted an end event? _(Tests whether the candidate reaches for a LEFT JOIN or a missing-mate pattern.)_
- At 10M events per hour, what changes about this schema? _(Tests partitioning by event_ts and whether process pairing moves into a streaming job.)_
- Some end events arrive hours after their start because of buffering. How do you guard against it? _(Tests the distinction between event time and ingestion time, plus watermark thinking.)_
- How would you add CPU and memory measures per process without breaking the grain? _(Tests whether the candidate adds them to the end event or introduces a separate measurement fact.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/machine_process_event_log_schema)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.