# Disappearing Ink

Canonical URL: <https://datadriven.io/problems/disappearing-ink-shape-comes-later>

Domain: Pipeline Design · Difficulty: easy · Seniority: mid

## Problem

We run a photo and video messaging app where about 10 billion engagement events land every day: opens, replays, story views, ad views. A dozen analytics teams each want the data shaped differently and their needs keep changing, so we cannot lock every transformation in before the data is stored. Design a pipeline that lands the raw events cheaply and rebuilds the curated tables those teams read on a daily schedule, gating the publish on a quality check so bad data never reaches them and paging on-call when a run is late or fails.

## Worked solution and explanation

### Why this problem exists in real interviews

This is an ELT-versus-ETL question wearing a scale costume. The tell is one line in the prompt: a dozen teams want the data shaped differently and their needs keep changing. The trap is the reflex ETL design that transforms events on the way in and stores only the shaped result. It looks clean until the first team asks for a field you dropped, and now you are re-ingesting 8 TB a day of history you no longer have in raw form. Land raw first and that request is a reprocessing job, not an incident.

The second trap is reaching for streaming because 10 billion events sounds like a firehose. Nobody here needs sub-minute data; the analysts work off yesterday. A daily batch over cheaply-stored raw events is both cheaper and simpler, and choosing it on purpose is what separates a candidate who designs for the requirement from one who defaults to the fanciest tool.

---

### Walk the requirements

#### Step 1: Land the raw events untouched and cheap

Ingest the full event stream into low-cost object storage as-is, partitioned by date. No shaping, no dropping fields. This is the L-before-T: storage is cheap, re-collection is impossible. When a team changes its mind or a schema drifts, the answer already sits in the landing zone waiting to be reprocessed.

#### Step 2: Transform in batch, on a daily cadence

A Spark job on Databricks reads yesterday's raw partition and builds the curated tables. Batch fits because the freshness target is T+1; a streaming design would cost far more to serve a need no consumer has. The transformation is where field mapping, sessionization, and de-duplication happen, downstream of storage where they are cheap to change.

#### Step 3: Publish curated tables the teams share

Write the shaped output into a lakehouse layer that every analytics team queries. Because the raw source is decoupled, one team's new column or reshaped table does not force a re-ingest and does not break another team. The curated layer is the contract; the raw zone is the safety net behind it.

#### Step 4: Gate quality before anyone reads

Run validation between the transformation and the published tables: row-count sanity, null and schema checks, partition completeness. If a day is malformed, the gate holds the publish and alerts, so the analysts are never the ones who discover bad data. Orchestration owns the whole daily chain and pages when the run is late or fails.

---

### The shape that fits

**The daily ELT run as an orchestrated DAG**

```python
# Airflow DAG: land-first, transform-late, gate-before-publish.
# The order of the >> chain IS the architecture: raw storage is
# upstream of the transform, and the quality gate sits between the
# transform and the curated publish so bad partitions never reach teams.
from airflow import DAG
from airflow.operators.python import PythonOperator, ShortCircuitOperator
from datetime import datetime

def transform_day(ds, **_):
    # Read ONLY yesterday's raw partition, not the whole lake, so
    # compute stays proportional to one day (~8 TB) not all history.
    raw = spark.read.json(f"s3://events-raw/dt={ds}/")
    curated = (
        raw
        .transform(map_known_fields)     # tolerate unknown fields on drift
        .transform(sessionize)
        .dropDuplicates(["event_id"])    # idempotent re-runs
    )
    # Stage to a side path; publish only after the gate passes.
    curated.write.format("delta").mode("overwrite").save(f"s3://curated-staging/dt={ds}/")

def quality_gate(ds, **_):
    staged = spark.read.format("delta").load(f"s3://curated-staging/dt={ds}/")
    checks = [
        staged.count() > 0,                                  # partition not empty
        staged.filter("user_id IS NULL").count() == 0,       # required field present
        set(EXPECTED_COLS).issubset(staged.columns),         # schema not broken
    ]
    return all(checks)  # ShortCircuit: False halts the DAG and alerts

def publish(ds, **_):
    spark.read.format("delta").load(f"s3://curated-staging/dt={ds}/") \
        .write.format("delta").mode("overwrite") \
        .partitionBy("dt").save("s3://curated/engagement/")

with DAG(
    "engagement_daily_elt",
    schedule="0 3 * * *",          # T+1 morning delivery
    start_date=datetime(2026, 1, 1),
    catchup=True,                  # backfill reprocesses raw into new shapes
    default_args={"retries": 2, "on_failure_callback": page_oncall},
) as dag:
    t = PythonOperator(task_id="transform", python_callable=transform_day)
    g = ShortCircuitOperator(task_id="quality_gate", python_callable=quality_gate)
    p = PythonOperator(task_id="publish", python_callable=publish)
    t >> g >> p
```

*Ingestion lands raw JSON into s3://events-raw independently; this DAG reads that raw partition, transforms in Spark, and publishes to the curated Delta layer only after the quality gate passes. catchup=True is the ELT payoff: reprocessing history into a new table shape is a backfill, not a re-ingest.*

> **Trick to Solving**
>
> Store raw, transform late. The single decision that cracks this problem is putting the T after the L: land every event unshaped, then build curated tables from storage. Everything the prompt worries about, changing team needs and drifting schemas, becomes a cheap reprocessing job instead of a re-ingestion emergency.

> **Interviewers Watch For**
>
> A candidate who says out loud why ELT beats ETL here, and who justifies batch over streaming against the T+1 freshness fact instead of defaulting to real-time. Bonus signal: naming date partitioning and file sizing on the raw zone so the daily Spark read scans a slice, not the whole lake.

> **Common Pitfall**
>
> Transforming on ingest and keeping only the shaped output. It passes the first demo and fails the first new requirement: a team asks for a field you discarded and there is no raw history to rebuild from. The other pitfall is streaming the whole 10B/day firehose for a report nobody reads before morning, paying real-time cost for batch value.

> **Scale + Cost**
>
> Roughly 10 billion events and 8 TB of raw JSON per day. Cost concentrates in two places: object storage for the raw zone (cheap per TB, so keeping raw is affordable) and the daily Spark cluster. Partition the raw zone by date and compact into right-sized files so the transform reads yesterday's partition only; that keeps the daily job's compute proportional to one day, not the whole history.

---

## Common follow-up questions

- A team asks for a metric that needs a field you never mapped into the curated tables, but which is present in the raw events. What do you do? _(Tests whether the candidate sees the payoff of ELT: reprocess the raw landing zone into a new or extended curated table without touching ingestion or the source.)_
- Event schemas drift weekly as the app ships features. How does the pipeline avoid breaking when a new field appears or an old one is deprecated? _(Tests schema-drift handling: raw preserves everything as received, the transform tolerates unknown fields, and a schema check in the quality gate flags breaking changes before curated tables are rebuilt.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/disappearing-ink-shape-comes-later)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.