Schema Evolution and Late Data: Beginner

A subscription analytics company at Series C scale ran a nightly job that aggregated checkout events from a Kafka topic into a Snowflake fact table. On a Tuesday afternoon a backend engineer added one new field to the checkout event, called promo_code, to support a new marketing experiment. The deploy went out at 3pm. By 6am Wednesday the nightly job had crashed in production with a schema mismatch error. The data team spent the morning patching the loader. The same week, a separate pull request increased the upstream Kafka producer's buffer flush from 5 seconds to 30 seconds. Three days later analysts noticed the daily revenue numbers were quietly understating Tuesday by about 1.4 percent because some events stamped late Monday night had not arrived in time for Tuesday's batch run. Two unrelated changes, two production headaches, both rooted in the same fact: the data flowing through a pipeline is never as stable as the diagram suggests. Schemas drift. Events arrive out of order. This lesson is the picture of why both happen and what the simplest defenses look like.

The Producer Added a Column Problem

Daily Life
Interviews

Recognize the producer-added-a-column failure mode and name the three reactions a loader can have to an unexpected field.

Pipelines do not own the data flowing through them. The teams that produce events, write to operational databases, or push files into shared buckets own the upstream shape. Those teams ship code on their own cadence. Sooner or later, one of them adds a field, renames a column, or changes a type, and a pipeline that has been running fine for months suddenly fails. The producer-added-a-column problem is the most common variant of this story. It is so common that every senior data engineer has a personal version of it.

What Actually Breaks

SymptomCause UpstreamWhere It Surfaces
Schema mismatch error on loadA new column appeared in the source payloadLoader rejects rows because the target table lacks the column
Silent type coercionA field's type widened from int32 to int64Values truncate or wrap around without an error
Empty column downstreamA field was renamed and the pipeline still reads the old nameDashboards show NULL for what used to be populated
Failed downstream joinA foreign key column was droppedJoined fact table is missing rows or fans out incorrectly
The shared trait of all four cases is that the change happened upstream, was not announced, and was not noticed until something broke. The fix is rarely a code change in the producer. The producer ships software and treats the schema as their own. The fix is teaching the pipeline to absorb the change, or to fail loudly enough that someone gets paged before customers do.

A Concrete Failure

1// Tuesday morning: the event the loader expects
2{
3 "event_id": "evt_abc123",
4 "user_id": 42,
5 "checkout_amount_cents": 4999,
6 "event_ts": "2026-04-25T10:14:02Z"
7}
8
9// Tuesday afternoon: a backend engineer ships this version
10{
11 "event_id": "evt_abc124",
12 "user_id": 42,
13 "checkout_amount_cents": 4999,
14 "event_ts": "2026-04-25T15:21:18Z",
15 "promo_code": "SPRING20"
16}
The two payloads are nearly identical. The new one has one additional field. Whether that one extra field crashes the pipeline depends entirely on how the loader was written. A loader that does SELECT * INTO a strict schema will reject the new row. A loader that does INSERT INTO target(event_id, user_id, checkout_amount_cents, event_ts) VALUES(...) will silently drop the new field. A loader that lands the raw JSON in a single VARIANT column will accept both shapes without complaint, then surface the new field once a transform downstream chooses to read it.
The three reactions a loader can have to a new field:
  • Reject: refuse to process rows with the unexpected field, fail loudly
  • Silent drop: ignore the new field, lose the data, no error raised
  • Accept: store the field in a schemaless column, decide later what to do with it

Why Producers Cannot Be Asked to Stop

A natural reaction is to ask the upstream team to coordinate every schema change with the data team. That works for a quarter, then breaks down. Producer teams have their own roadmaps. The number of teams producing data grows faster than the data team can attend coordination meetings. The pipeline has to be designed to survive ungated upstream change, not to prevent it. This shifts the question from political (who must coordinate with whom) to technical (what can the pipeline absorb, and what must it route to a human).

Schema drift is a property of the world, not a bug in the upstream team. A pipeline that assumes the schema is stable is a pipeline that has not yet broken in the way it inevitably will.

The right architectural response is to design the loader for tolerance and to push schema discipline back into the producer through automation rather than meetings. Tolerance at the loader means landing data in a flexible format like JSON, Avro with schema registry, or a VARIANT column in Snowflake, then surfacing the new shape downstream once a transform decides what to do with it. Discipline through automation means a CI check on the producer's repository that registers schema changes against a registry and rejects pull requests whose schemas would break downstream. Together they let the pipeline absorb the additive case (new column, new value) automatically and surface the destructive case (rename, drop) loudly. Most schema conversations break down because the team conflates these two responses; the additive case does not need a meeting, and the destructive case needs a deliberate plan.
A well-instrumented pipeline emits a metric every time the loader encounters an unexpected field, an unexpected type, or a missing required field. The metric is a leading indicator. A spike in unexpected-field events on Tuesday afternoon is the signal that an upstream team shipped something. The data team does not have to wait for the dashboard to break to discover the change. The metric is the difference between learning about schema drift from telemetry and learning about it from an angry executive.
alert
The most common production schema break is an upstream team adding a column without coordination.
check
Loaders react to new fields in three ways: reject, silently drop, or accept and store.
query
The fix is in the pipeline's tolerance for change, not in the producer's discipline.
RejectSilent DropAccept
Reject
Strict-schema loader
Refuses any row that does not match the registered schema. Loud and safe; appropriate for audit-grade pipelines that cannot tolerate silent drift.
Silent Drop
Positional loader
Reads only the columns the loader was written to expect; ignores everything else. Convenient but dangerous because new fields are lost without notice.
Accept
Schemaless landing
Stores the full payload in a flexible column (JSON, VARIANT, MAP). Downstream transforms decide when to read new fields. Most resilient to upstream change.
TIP
Before designing schema-handling logic, name the producer team and ask how they ship new fields. The answer often reveals which of the three reactions is the right default for that source.

Forward vs Backward Compatibility

Daily Life
Interviews

Distinguish backward and forward compatibility, and classify a proposed schema change against both.

Two terms appear in nearly every conversation about schema change: backward compatible and forward compatible. They sound interchangeable. They are not. The distinction matters because it tells the producer and the consumer who can upgrade first without breaking the other. Confusing the two is the source of half the schema-related production incidents in event-driven systems.

The Two Definitions, Plain

TermPlain DefinitionWho It Protects
Backward compatibleA new schema can read data written under the old schemaThe consumer that upgrades first while old data is still in flight
Forward compatibleAn old schema can read data written under the new schemaThe consumer that has not yet upgraded when the producer ships first
Full compatibleBoth directions hold at the same timeEither side can upgrade first without coordination
Backward compatibility says the new code reads the old data. Forward compatibility says the old code reads the new data. The two questions are independent. A schema change can be backward compatible, forward compatible, both, or neither. The fourth case, neither, is the dangerous one: it forces a coordinated deploy where producers and consumers all upgrade at the same instant, which is rarely possible at scale.
Backward Compatible Change
  • Adding a new optional field with a default value
  • Widening a numeric type from int32 to int64
  • Adding a new enum value at the end of a list
  • Old rows read fine after the new schema is deployed
Not Backward Compatible Change
  • Renaming a field that downstream code reads
  • Removing a field that downstream code references
  • Narrowing a type from string to fixed-length char(8)
  • Old rows fail or produce wrong results under the new schema

A Worked Pair

1// V1 schema: the producer ships this for six months
2{ "user_id": 42, "plan": "pro" }
3
4// V2 schema: producer adds an optional field with a default
5{ "user_id": 42, "plan": "pro", "trial_days_remaining": 0 }
6
7// V2 is backward compatible: the new schema reads V1 rows
8// (trial_days_remaining defaults to 0)
9// V2 is forward compatible: old V1 readers ignore the new field
10// (consumers that do not know about trial_days_remaining keep working)
The example above is the cleanest possible change. Adding an optional field with a default value is full compatible: producer and consumer can upgrade in either order. This is why an enormous fraction of well-designed schema changes fall into the additive category. It is the only category where the two sides do not have to be coordinated.

When Compatibility Breaks

1// V1 schema in production for six months
2{ "user_id": 42, "signup_date": "2025-09-14" }
3
4// V2 schema renames signup_date to created_at
5{ "user_id": 42, "created_at": "2025-09-14" }
6
7// V2 is NOT backward compatible: any consumer code that reads
8// record.signup_date now sees null or fails
9// V2 is NOT forward compatible: any V1 consumer expecting
10// signup_date sees the new payload as a missing field
11// This rename forces a coordinated deploy across producer and ALL consumers.
The compatibility cheat sheet:
  • Add an optional field with default: full compatible, safest change
  • Add a required field with no default: backward incompatible for old writers
  • Rename a field: incompatible in both directions, forces coordinated deploy
  • Remove a field: forward incompatible (new readers may still want it)
  • Change a type: usually incompatible, depends on the type system

Most schema registries enforce one of these modes by default. Confluent Schema Registry, for example, ships with BACKWARD as the default compatibility check, which means a new schema must be readable by the most recent reader. Other modes include FORWARD, FULL, and NONE. The choice tells the team which deploy ordering is safe.

Most production systems care about backward compatibility most of the time, because the typical pattern is that producers ship code first and consumers catch up later. A consumer that has not yet upgraded must still read the old shape; that is the backward case. A consumer that upgrades early must still read events written under the old schema that are still in flight in the topic; that is also the backward case in a system where messages live for hours or days. Forward compatibility matters in narrower scenarios: a consumer that ships before the producer, or a consumer rolled back to an earlier version after the producer has already moved forward. The two questions look symmetric on paper. In production, backward is the daily concern and forward is the rollback safety net.
TIP
When proposing a schema change, name which compatibility mode it preserves. If the answer is neither, the change requires a coordinated deploy and that coordination cost should be weighed against the benefit.
Source
producer
compat check
registry
Kafka
topic
Transform
consumer
Storage
warehouse

Schema evolution stays safe when a compatibility check sits between producer and topic: adding a column is backward-compatible and passes; a rename or type change is breaking and is caught before it reaches consumers.

Adding Is Safe, Renaming Is Not

Daily Life
Interviews

Classify a proposed schema change as additive, destructive, or coordinated, and name the safe rename pattern.

The compatibility framework above implies a practical rule that holds for almost every real-world schema change. Adding things is usually safe. Removing or renaming things is almost never safe without coordination. This is not a deep theoretical claim. It is an observation about the asymmetry between adding new information and removing or relabeling information that downstream code already depends on. The asymmetry is so reliable that it shows up as a default in serialization formats, in version control conventions, and in API design across the industry. Avro and Protobuf both make additive change effortless and destructive change deliberate. REST API versioning conventions allow new fields in any response without bumping the version number; deprecating an existing field requires a major version bump. The same shape applies inside data pipelines.

Why Adding Is the Safe Default

When a producer adds a new optional column with a default value, every existing consumer continues to function. Code that did not know about the column never asked for it. Code that did know about it gets the default for old rows. No downstream join breaks. No transform sees a NULL where a value used to be. The data team can adopt the new column on its own schedule, building dashboards and aggregations against it once the value of the new field is understood.

Why Renaming Is the Dangerous Default

Renaming a column does the opposite. The rename is invisible to the producer once the deploy is finished, but every consumer that read the old name now reads NULL or fails. The fix requires every consumer to deploy a code change that points at the new name, and the producer must wait for all consumers before retiring the old name. In a system with dozens of downstream consumers, this synchronization is hard to organize and easy to mishandle. Whole pipelines have been broken for days because a rename happened on Friday and the on-call engineer assumed the upstream team had coordinated with everyone. Renames also produce a class of bug that is exceptionally hard to find: a column that exists, contains data, but means something different than it used to. A column called amount that originally tracked gross revenue and silently became net revenue is far more dangerous than a column that disappeared entirely. The disappeared column produces a NULL that someone notices. The repurposed column produces wrong numbers that look right.
ChangeRisk LevelTypical Outcome If Unannounced
Add an optional columnLowPipelines keep working; new column is ignored until adopted
Add a required column with defaultLowOld rows get the default; new rows carry the value
Add an enum value at the endLow to mediumOld readers may misclassify; new readers handle correctly
Widen a numeric typeLowOld values still fit in the wider type
Drop a columnHighAny consumer reading that column produces NULL or fails
Rename a columnHighAll consumers reading the old name break simultaneously
Narrow a typeHighSome old values no longer fit; data is silently truncated
Reorder positional fieldsCriticalType coincidences mask wrong data flowing into the wrong column

The Stripe Example

Stripe's API documentation includes a long-standing convention: new fields can appear in any response without notice, but existing fields will not be removed or renamed within a major API version. That single convention is why thousands of integrations against Stripe rarely break. The producer commits to additive change as the default. Anything destructive happens through versioned endpoints that consumers opt into when they are ready. The same pattern applies inside a company between data producer teams and data pipelines: additive is free, destructive is expensive.
The additive default in practice:
  • New fields are added; existing fields are never repurposed
  • Removals happen only after a deprecation window with deprecation flags in the schema
  • Renames are treated as a remove plus an add, with both fields populated during the migration window
  • Type changes happen by adding a new column with the new type and migrating off the old one
Additive ChangeDestructive Change
Additive Change
The safe default
New optional fields with default values, new enum values at the end, widened numeric types. Producers ship without coordination; consumers adopt at their own pace.
Destructive Change
Requires coordination
Drops, renames, narrowed types. Forces every consumer to upgrade. In well-run systems these go through a deprecation window with both old and new shapes available.
1ALTER TABLE events ADD COLUMN promo_code VARCHAR DEFAULT NULL ; ALTER TABLE events DROP COLUMN signup_source ; ALTER TABLE events RENAME COLUMN signup_date TO created_at ; ALTER TABLE events ADD COLUMN created_at TIMESTAMP ; UPDATE events SET created_at = signup_date
2
3
4
5
6
7
8WHERE created_at IS NULL ; ALTER TABLE events DROP COLUMN signup_date ;
The case against destructive change is rarely about technical impossibility. It is about who pays the cost. An additive change places the cost on the producer team, who write the new column and the migration. A destructive change places the cost on every consumer team, who must change their code to keep working. A producer team that ships destructive changes to save themselves work is externalizing the cost onto the rest of the organization. Mature data platforms treat this externalization the same way they treat any other cross-team coupling: name it, price it, and prefer the path that keeps each team accountable for the cost of its own decisions.
Do
  • Default to additive changes for any cross-team schema
  • Treat renames as add-then-drop with a deprecation window
  • Document each new field's meaning before downstream teams start using it
Don't
  • Drop or rename a column without naming every downstream consumer first
  • Reorder positional fields in serialized formats; the cost is invisible breakage
  • Narrow a type in place; add a new column with the narrower type instead
1/* A demonstration of the safe rename pattern */
2/* Phase 1: add the new column without dropping the old */
3WITH events AS (
4 SELECT
5 1 AS user_id,
6 DATE '2025-09-14' AS signup_date,
7 CAST(NULL AS TIMESTAMP) AS created_at
8
9 UNION ALL
10
11 SELECT
12 2,
13 DATE '2025-09-15',
14 CAST(NULL AS TIMESTAMP)
15
16 UNION ALL
17
18 SELECT
19 3,
20 DATE '2025-09-16',
21 CAST(NULL AS TIMESTAMP)
22)
23
24SELECT
25 user_id,
26 signup_date,
27 COALESCE(
28 created_at,
29 CAST(signup_date AS TIMESTAMP)
30 ) AS created_at
31FROM events
32ORDER BY user_id

What Late Data Means

Daily Life
Interviews

Recognize late data, distinguish event time from processing time, and name the three reactions a pipeline can have to a late event.

Schema drift is one half of the lesson. The other half is late data. Events do not always arrive in the order they were produced. A click happens on a phone with patchy reception on Tuesday morning, the SDK queues the event locally, and the event is uploaded Thursday afternoon when the phone reconnects to wifi. The event is timestamped Tuesday. It arrives Thursday. Every batch and streaming system in the industry has to decide what to do with that event.

The Two Timestamps That Matter

TimestampWhat It RecordsOwned By
Event timeWhen the event actually happened in the worldThe producer (or the user's device)
Ingestion timeWhen the pipeline first received the eventThe pipeline
Processing timeWhen the pipeline acted on the event in a transformThe processing engine
These three timestamps are usually within seconds of each other. When they diverge, the gap is what people call lateness. An event with event_time = Tuesday 09:14 and ingestion_time = Thursday 14:27 is two days late. The pipeline that aggregates daily totals has a choice: include this Tuesday event in the Tuesday bucket (correct, but it changes Tuesday's number after the fact), or include it in the Thursday bucket (wrong, but does not change history), or drop it (wrong, but keeps the math simple). All three choices show up in production systems and all three are sometimes correct, depending on the consumer. The choice is not just an engineering decision. It is a contract with whoever reads the output. A finance dashboard that reports daily revenue must use event time to satisfy the audit; a fraud-detection system reading a real-time feed may prefer processing time because it operates on what the engine actually saw. Naming the consumer first, then the time domain, prevents the most common kind of disagreement between data team and stakeholder.

Why Lateness Happens

alert
Mobile SDKs queue events offline and flush when connectivity returns; this is the largest single source of lateness.
query
Producer outages cause backlog; when the producer comes back online it ships hours of accumulated events at once.
check
Network congestion in batched producers can delay flush by minutes to hours; Producers that batch events for efficiency (e.g., flushing every 100ms) compound with downstream lag.
alert
Downstream queue backlog: an event arrives at the broker on time but the consumer does not pull it for hours.

A Concrete Late Event

1// An event captured by the SDK on a phone that lost connectivity
2{
3 "event_id": "evt_99x",
4 "event_type": "checkout_completed",
5 "user_id": 4271,
6 "event_time": "2026-04-21T09:14:02Z", // Tuesday morning
7 "ingestion_time": "2026-04-23T14:27:51Z" // Thursday afternoon
8}
9
10// The lateness: 2 days, 5 hours, 13 minutes, 49 seconds.
11// The Tuesday daily-revenue dashboard finished computing at 09:00 Wednesday.
12// At that moment, this event had not arrived. The dashboard understated Tuesday.
The three things a pipeline can do with a late event:
  • Include it in its event-time bucket and update history (correct, but mutates yesterday)
  • Include it in the bucket of when it arrived (wrong, but cheap and stable)
  • Drop it (wrong, but predictable and bounded if lateness is rare)
Most batch pipelines silently choose option two or three because they were never explicitly designed to handle lateness. The system runs on processing time and does not look back. The cost is a quiet undercount in every reporting period that contains late events. For most consumer-facing dashboards this undercount is on the order of a percent or two and goes unnoticed. For financial reporting, billing, or any audit-relevant table, even a small undercount is a serious bug.
Event Time
  • Records when the event happened in reality
  • Stable property of the event
  • Required for accurate historical reports
  • Forces the pipeline to handle out-of-order arrivals
Processing Time
  • Records when the pipeline saw the event
  • Depends on broker and consumer lag
  • Easy to compute, monotonic by construction
  • Drifts from event time when the system is under load
Aggregating by Event Time
  • Tuesday late event lands in Tuesday's bucket
  • Reruns produce the same totals over time
  • Reflects what happened in the world
  • Forces idempotent writes and partition overwrites
Aggregating by Processing Time
  • Tuesday late event lands in Thursday's bucket
  • Reruns can produce different totals as lag changes
  • Reflects what the engine saw, not what occurred
  • Cheap to implement; misleading to read
1/* Show the cost of bucketing by ingestion time when lateness is real */
2WITH evts AS (
3 SELECT
4 'a' AS id,
5 DATE '2026-04-21' AS event_date,
6 DATE '2026-04-21' AS ingest_date,
7 100 AS amount
8
9 UNION ALL
10
11 SELECT
12 'b',
13 DATE '2026-04-21',
14 DATE '2026-04-21',
15 200
16
17 UNION ALL
18
19 SELECT
20 'c',
21 DATE '2026-04-21',
22 DATE '2026-04-23',
23 50 /* 2-day-late event */
24
25 UNION ALL
26
27 SELECT
28 'd',
29 DATE '2026-04-22',
30 DATE '2026-04-22',
31 75
32)
33
34SELECT
35 'by event_date' AS bucket_key,
36 event_date AS bucket,
37 SUM(amount) AS total
38FROM evts
39GROUP BY event_date
40
41UNION ALL
42
43SELECT
44 'by ingest_date' AS bucket_key,
45 ingest_date AS bucket,
46 SUM(amount) AS total
47FROM evts
48GROUP BY ingest_date
49ORDER BY bucket_key, bucket

Lateness is not an exception condition. It is the expected behavior of any distributed system that includes mobile clients, intermittent networks, or producer batching. The question is not whether late data appears, but how the pipeline handles it.

The cost of getting lateness wrong is rarely visible on the day the bug is introduced. A pipeline that quietly drops late events looks correct in side-by-side checks during testing, because tests rarely include events that arrive on day three. The bug surfaces months later, when an analyst notices that Tuesday's revenue is consistently 1.2 percent below an external source of truth, or when a finance team reconciles a quarter-end and finds a small but persistent gap. The lag between cause and effect is what makes lateness bugs expensive. By the time the bug is found, the wrong numbers have shipped to customers, regulators, or executives, and unwinding the historical record requires explaining what was missed and why.
TIP
Before designing any aggregation, ask which timestamp the consumer cares about. Marketing reports almost always want event time. Operational metrics about pipeline health almost always want processing time. The two questions look identical and are not.

Late Data: Rerun Last 7 Days

Daily Life
Interviews

Apply a daily-rerun window as the simplest workable defense against late data in a batch pipeline.

The simplest workable fix for late data in a batch pipeline is also the most common: every day, do not compute today alone; also recompute the last several days. The size of the window depends on how late events tend to arrive. Seven days is a typical default because it covers nearly all mobile SDK retry tail behavior without making the daily run prohibitively expensive.

Why a Rerun Window Works

If today's run also recomputes the last seven days, then any event whose event_time was within the last seven days and whose ingestion_time is today will be picked up in the right bucket. The dashboard for last Tuesday gets corrected today. The cost is computational: instead of processing one day's events, the pipeline processes eight. The benefit is that history corrects itself within the window. As long as the pipeline is idempotent (running it twice produces the same answer), nothing breaks.
Window SizeCatchesMissesDaily Compute Cost
1 day (today only)Events arriving the same day they were producedAnything later than processing time1x baseline
3 daysMost mobile retry traffic and brief producer outagesLong tail of stuck SDKs and weekend outages3x baseline
7 daysNearly all real-world lateness in event-driven systemsMulti-week outages and audit-grade backfills7x baseline
30 daysEven uncommon long-tail late arrivalsTruly unbounded lateness30x baseline; rarely worth it for routine running

The Pattern in Code

1INSERT OVERWRITE TABLE analytics.daily_revenue PARTITION(event_date)
2SELECT
3 DATE(event_time) AS event_date,
4 SUM(amount_cents) AS revenue_cents,
5 COUNT(DISTINCT user_id) AS paying_users
6FROM raw.checkout_events
7WHERE event_time >= CURRENT_DATE - INTERVAL '7 days' AND event_time < CURRENT_DATE + INTERVAL '1 day'
8GROUP BY DATE(event_time) ;
The query reads the last seven days of raw events, aggregates them by event_time, and overwrites the corresponding daily partitions. Today's partition is computed for the first time. Yesterday's, the day before's, and so on are recomputed. If a late event from five days ago arrived between the last run and this one, it now appears in the correct day's totals. The INSERT OVERWRITE clause makes the operation idempotent at the partition level.
1/* The same pattern, applied to a small example dataset */
2WITH raw AS (
3 SELECT
4 1 AS user_id,
5 DATE '2026-04-20' AS event_date,
6 4999 AS amount_cents
7
8 UNION ALL
9
10 SELECT
11 2,
12 DATE '2026-04-21',
13 1999
14
15 UNION ALL
16
17 SELECT
18 3,
19 DATE '2026-04-21',
20 2999
21
22 UNION ALL
23
24 SELECT
25 4,
26 DATE '2026-04-22',
27 999
28
29 UNION ALL
30
31 SELECT
32 5,
33 DATE '2026-04-23',
34 4999
35
36 UNION ALL
37
38 /* A late event for April 21 that arrived on April 25 */
39 SELECT
40 6,
41 DATE '2026-04-21',
42 3499
43)
44
45SELECT
46 event_date,
47 SUM(amount_cents) AS revenue_cents,
48 COUNT(*) AS event_count
49FROM raw
50WHERE event_date >= DATE '2026-04-19'
51GROUP BY event_date
52ORDER BY event_date
April 21 now reflects three events totaling 8,497 cents. The late event for April 21 was absorbed because the rerun window included it. Without the window, April 21's number would have been frozen at 4,998 cents and the late event would either have landed in April 25's bucket or been dropped entirely. The behavior is what consumers expect from a daily report, even though most of them have never thought about why the number changes between Wednesday's view and Thursday's view of the same Tuesday. The pattern teaches the consumer's eye to expect minor backfill of the most recent few days; the data team learns to expect that backfill to be the late-data tail showing up; and the system as a whole stays calibrated without anyone needing to explain what is happening on a given day.
What the rerun window requires:
  • An idempotent write: re-running must produce the same answer, not duplicates
  • Partition-level overwrite (not append): yesterday's row gets replaced, not added to
  • Source data retained for at least the window length: raw events from 7 days ago must still be queryable
  • Compute budget that absorbs the window: 7 days of work, every day

When the Window Is Not Enough

A 7-day window catches the bulk of lateness in a healthy system but does not catch everything. A producer that goes offline for nine days will produce events that arrive outside the window and end up in the wrong bucket. The next-tier fixes (watermarks, allowed lateness in streaming, reconciliation passes) appear in the intermediate and advanced tiers. The point of this section is that the cheapest correct answer for many batch pipelines is not exotic infrastructure. It is a window.
The rerun window has a second-order benefit beyond catching late events: it acts as a safety net for any pipeline bug discovered within the window. If a transform was deployed on Monday with a subtle bug that produced incorrect totals, the bug is caught on Wednesday, and the fix is deployed on Wednesday afternoon, the rerun on Thursday morning will recompute Monday and Tuesday with the corrected logic. The historical record self-heals as long as the bug is found within the window. This property turns the rerun from a late-data fix into a general-purpose freshness guarantee: the pipeline's output for any day older than the window is final; output within the window is provisional and may shift as late events and bug fixes propagate.
Do
  • Pick a window size based on observed lateness, not on a hunch
  • Make the rerun idempotent at the partition level: overwrite, not append
  • Retain raw source data for at least the window length
Don't
  • Append late events to the bucket of when they arrived; this hides the late-data problem
  • Skip the rerun on weekends; producer outages happen on weekends too
  • Set the window to one day and assume the pipeline handles late data; it does not
The rerun window pattern is also the foundation that makes more sophisticated late-data designs possible. The streaming watermark concept covered in the next tier and the reconciliation pass concept covered in the advanced tier both inherit the same property: the streaming output is provisional within the lateness window, and a separate mechanism corrects it afterward. The seven-day rerun is the batch-only version of this idea, accessible without any streaming infrastructure. Many production systems run nothing more sophisticated than this and meet their accuracy targets, because the long tail of lateness past seven days is small enough to ignore for the consumer. Recognizing when the simple version is enough, and when the workload genuinely requires the streaming-plus-reconciliation composite, is one of the judgment calls that separates a senior data engineer from a junior one.
TIP
Measure the lateness distribution from production data before picking a window. If 99 percent of events arrive within four hours, a 24-hour window is overkill. If 5 percent arrive after three days, even a 7-day window leaves money on the table.
PUTTING IT ALL TOGETHER

> A subscription company runs a nightly batch pipeline that aggregates checkout events from a Kafka topic into a Snowflake daily revenue table. Two issues land in the same week. On Monday a backend engineer adds a promo_code field to the checkout event, and on Wednesday the on-call engineer notices that Tuesday's revenue total quietly grew by 1.4 percent overnight. The new data engineer is asked to design the simplest set of changes that keeps both problems from recurring.

The promo_code addition is an additive change. The loader should land the raw event in a flexible column and let downstream transforms decide when to read the new field, rather than rejecting any row that does not match a strict schema.
The 1.4 percent overnight growth is the rerun window doing its job. Late events from Tuesday arrived between the Tuesday-morning run and the Wednesday-morning run, and the Wednesday run picked them up because it recomputed the last several days, not yesterday alone.
The rerun depends on idempotency, the property covered in Lesson 5 (idempotency and backfill). Without partition overwrite, the same late event would be double-counted on every rerun day.
Both fixes share a frame: design the pipeline to absorb upstream change, not to prevent it. The producer team will keep adding fields and the SDK will keep flushing late events. The pipeline meets both with tolerance.
KEY TAKEAWAYS
Schemas drift because producer teams ship on their own cadence: the pipeline must absorb upstream change rather than prevent it.
Backward and forward compatibility are independent: a schema change can be one, both, or neither. Knowing which decides who can deploy first.
Additive changes are nearly always safe; renames and drops are nearly always dangerous: the cheap default is add-then-deprecate, with both shapes coexisting during a migration window.
Late data is the rule, not the exception: event time and processing time diverge whenever mobile clients, batched producers, or unreliable networks are involved.
A rerun window is the simplest correct answer for batch lateness: compute the last seven days every day, overwrite by partition, and history corrects itself within the window.

Data shapes shift and events arrive out of order; pipelines must absorb both without breaking

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: The Producer Added a Column Problem, Forward vs Backward Compatibility, Adding Is Safe, Renaming Is Not, What Late Data Means, Late Data: Rerun Last 7 Days

Lesson Sections

  1. The Producer Added a Column Problem (concepts: paSchemaEvolution)

    Pipelines do not own the data flowing through them. The teams that produce events, write to operational databases, or push files into shared buckets own the upstream shape. Those teams ship code on their own cadence. Sooner or later, one of them adds a field, renames a column, or changes a type, and a pipeline that has been running fine for months suddenly fails. The producer-added-a-column problem is the most common variant of this story. It is so common that every senior data engineer has a pe

  2. Forward vs Backward Compatibility (concepts: paSchemaEvolution)

    Two terms appear in nearly every conversation about schema change: backward compatible and forward compatible. They sound interchangeable. They are not. The distinction matters because it tells the producer and the consumer who can upgrade first without breaking the other. Confusing the two is the source of half the schema-related production incidents in event-driven systems. The Two Definitions, Plain Backward compatibility says the new code reads the old data. Forward compatibility says the ol

  3. Adding Is Safe, Renaming Is Not (concepts: paSchemaEvolution)

    The compatibility framework above implies a practical rule that holds for almost every real-world schema change. Adding things is usually safe. Removing or renaming things is almost never safe without coordination. This is not a deep theoretical claim. It is an observation about the asymmetry between adding new information and removing or relabeling information that downstream code already depends on. The asymmetry is so reliable that it shows up as a default in serialization formats, in version

  4. What Late Data Means (concepts: paLateData)

    Schema drift is one half of the lesson. The other half is late data. Events do not always arrive in the order they were produced. A click happens on a phone with patchy reception on Tuesday morning, the SDK queues the event locally, and the event is uploaded Thursday afternoon when the phone reconnects to wifi. The event is timestamped Tuesday. It arrives Thursday. Every batch and streaming system in the industry has to decide what to do with that event. The Two Timestamps That Matter These thre

  5. Late Data: Rerun Last 7 Days (concepts: paLateData)

    The simplest workable fix for late data in a batch pipeline is also the most common: every day, do not compute today alone; also recompute the last several days. The size of the window depends on how late events tend to arrive. Seven days is a typical default because it covers nearly all mobile SDK retry tail behavior without making the daily run prohibitively expensive. Why a Rerun Window Works If today's run also recomputes the last seven days, then any event whose event_time was within the la