A subscription analytics company at Series C scale ran a nightly job that aggregated checkout events from a Kafka topic into a Snowflake fact table. On a Tuesday afternoon a backend engineer added one new field to the checkout event, called promo_code, to support a new marketing experiment. The deploy went out at 3pm. By 6am Wednesday the nightly job had crashed in production with a schema mismatch error. The data team spent the morning patching the loader. The same week, a separate pull request increased the upstream Kafka producer's buffer flush from 5 seconds to 30 seconds. Three days later analysts noticed the daily revenue numbers were quietly understating Tuesday by about 1.4 percent because some events stamped late Monday night had not arrived in time for Tuesday's batch run. Two unrelated changes, two production headaches, both rooted in the same fact: the data flowing through a pipeline is never as stable as the diagram suggests. Schemas drift. Events arrive out of order. This lesson is the picture of why both happen and what the simplest defenses look like.
The Producer Added a Column Problem
Daily Life
Interviews
Recognize the producer-added-a-column failure mode and name the three reactions a loader can have to an unexpected field.
Pipelines do not own the data flowing through them. The teams that produce events, write to operational databases, or push files into shared buckets own the upstream shape. Those teams ship code on their own cadence. Sooner or later, one of them adds a field, renames a column, or changes a type, and a pipeline that has been running fine for months suddenly fails. The producer-added-a-column problem is the most common variant of this story. It is so common that every senior data engineer has a personal version of it.
What Actually Breaks
Symptom
Cause Upstream
Where It Surfaces
Schema mismatch error on load
A new column appeared in the source payload
Loader rejects rows because the target table lacks the column
Silent type coercion
A field's type widened from int32 to int64
Values truncate or wrap around without an error
Empty column downstream
A field was renamed and the pipeline still reads the old name
Dashboards show NULL for what used to be populated
Failed downstream join
A foreign key column was dropped
Joined fact table is missing rows or fans out incorrectly
The shared trait of all four cases is that the change happened upstream, was not announced, and was not noticed until something broke. The fix is rarely a code change in the producer. The producer ships software and treats the schema as their own. The fix is teaching the pipeline to absorb the change, or to fail loudly enough that someone gets paged before customers do.
The two payloads are nearly identical. The new one has one additional field. Whether that one extra field crashes the pipeline depends entirely on how the loader was written. A loader that does SELECT * INTO a strict schema will reject the new row. A loader that does INSERT INTO target(event_id, user_id, checkout_amount_cents, event_ts) VALUES(...) will silently drop the new field. A loader that lands the raw JSON in a single VARIANT column will accept both shapes without complaint, then surface the new field once a transform downstream chooses to read it.
The three reactions a loader can have to a new field:
▸Reject: refuse to process rows with the unexpected field, fail loudly
▸Silent drop: ignore the new field, lose the data, no error raised
▸Accept: store the field in a schemaless column, decide later what to do with it
Why Producers Cannot Be Asked to Stop
A natural reaction is to ask the upstream team to coordinate every schema change with the data team. That works for a quarter, then breaks down. Producer teams have their own roadmaps. The number of teams producing data grows faster than the data team can attend coordination meetings. The pipeline has to be designed to survive ungated upstream change, not to prevent it. This shifts the question from political (who must coordinate with whom) to technical (what can the pipeline absorb, and what must it route to a human).
Schema drift is a property of the world, not a bug in the upstream team. A pipeline that assumes the schema is stable is a pipeline that has not yet broken in the way it inevitably will.
The right architectural response is to design the loader for tolerance and to push schema discipline back into the producer through automation rather than meetings. Tolerance at the loader means landing data in a flexible format like JSON, Avro with schema registry, or a VARIANT column in Snowflake, then surfacing the new shape downstream once a transform decides what to do with it. Discipline through automation means a CI check on the producer's repository that registers schema changes against a registry and rejects pull requests whose schemas would break downstream. Together they let the pipeline absorb the additive case (new column, new value) automatically and surface the destructive case (rename, drop) loudly. Most schema conversations break down because the team conflates these two responses; the additive case does not need a meeting, and the destructive case needs a deliberate plan.
A well-instrumented pipeline emits a metric every time the loader encounters an unexpected field, an unexpected type, or a missing required field. The metric is a leading indicator. A spike in unexpected-field events on Tuesday afternoon is the signal that an upstream team shipped something. The data team does not have to wait for the dashboard to break to discover the change. The metric is the difference between learning about schema drift from telemetry and learning about it from an angry executive.
The most common production schema break is an upstream team adding a column without coordination.
Loaders react to new fields in three ways: reject, silently drop, or accept and store.
The fix is in the pipeline's tolerance for change, not in the producer's discipline.
RejectSilent DropAccept
Reject
Strict-schema loader
Refuses any row that does not match the registered schema. Loud and safe; appropriate for audit-grade pipelines that cannot tolerate silent drift.
Silent Drop
Positional loader
Reads only the columns the loader was written to expect; ignores everything else. Convenient but dangerous because new fields are lost without notice.
Accept
Schemaless landing
Stores the full payload in a flexible column (JSON, VARIANT, MAP). Downstream transforms decide when to read new fields. Most resilient to upstream change.
TIP
Before designing schema-handling logic, name the producer team and ask how they ship new fields. The answer often reveals which of the three reactions is the right default for that source.
Forward vs Backward Compatibility
Daily Life
Interviews
Distinguish backward and forward compatibility, and classify a proposed schema change against both.
Two terms appear in nearly every conversation about schema change: backward compatible and forward compatible. They sound interchangeable. They are not. The distinction matters because it tells the producer and the consumer who can upgrade first without breaking the other. Confusing the two is the source of half the schema-related production incidents in event-driven systems.
The Two Definitions, Plain
Term
Plain Definition
Who It Protects
Backward compatible
A new schema can read data written under the old schema
The consumer that upgrades first while old data is still in flight
Forward compatible
An old schema can read data written under the new schema
The consumer that has not yet upgraded when the producer ships first
Full compatible
Both directions hold at the same time
Either side can upgrade first without coordination
Backward compatibility says the new code reads the old data. Forward compatibility says the old code reads the new data. The two questions are independent. A schema change can be backward compatible, forward compatible, both, or neither. The fourth case, neither, is the dangerous one: it forces a coordinated deploy where producers and consumers all upgrade at the same instant, which is rarely possible at scale.
✓Backward Compatible Change
Adding a new optional field with a default value
Widening a numeric type from int32 to int64
Adding a new enum value at the end of a list
Old rows read fine after the new schema is deployed
•Not Backward Compatible Change
Renaming a field that downstream code reads
Removing a field that downstream code references
Narrowing a type from string to fixed-length char(8)
Old rows fail or produce wrong results under the new schema
The example above is the cleanest possible change. Adding an optional field with a default value is full compatible: producer and consumer can upgrade in either order. This is why an enormous fraction of well-designed schema changes fall into the additive category. It is the only category where the two sides do not have to be coordinated.
▸Add an optional field with default: full compatible, safest change
▸Add a required field with no default: backward incompatible for old writers
▸Rename a field: incompatible in both directions, forces coordinated deploy
▸Remove a field: forward incompatible (new readers may still want it)
▸Change a type: usually incompatible, depends on the type system
Most schema registries enforce one of these modes by default. Confluent Schema Registry, for example, ships with BACKWARD as the default compatibility check, which means a new schema must be readable by the most recent reader. Other modes include FORWARD, FULL, and NONE. The choice tells the team which deploy ordering is safe.
Most production systems care about backward compatibility most of the time, because the typical pattern is that producers ship code first and consumers catch up later. A consumer that has not yet upgraded must still read the old shape; that is the backward case. A consumer that upgrades early must still read events written under the old schema that are still in flight in the topic; that is also the backward case in a system where messages live for hours or days. Forward compatibility matters in narrower scenarios: a consumer that ships before the producer, or a consumer rolled back to an earlier version after the producer has already moved forward. The two questions look symmetric on paper. In production, backward is the daily concern and forward is the rollback safety net.
TIP
When proposing a schema change, name which compatibility mode it preserves. If the answer is neither, the change requires a coordinated deploy and that coordination cost should be weighed against the benefit.
Source
producer
compat check
registry
Kafka
topic
Transform
consumer
Storage
warehouse
Schema evolution stays safe when a compatibility check sits between producer and topic: adding a column is backward-compatible and passes; a rename or type change is breaking and is caught before it reaches consumers.
Adding Is Safe, Renaming Is Not
Daily Life
Interviews
Classify a proposed schema change as additive, destructive, or coordinated, and name the safe rename pattern.
The compatibility framework above implies a practical rule that holds for almost every real-world schema change. Adding things is usually safe. Removing or renaming things is almost never safe without coordination. This is not a deep theoretical claim. It is an observation about the asymmetry between adding new information and removing or relabeling information that downstream code already depends on. The asymmetry is so reliable that it shows up as a default in serialization formats, in version control conventions, and in API design across the industry. Avro and Protobuf both make additive change effortless and destructive change deliberate. REST API versioning conventions allow new fields in any response without bumping the version number; deprecating an existing field requires a major version bump. The same shape applies inside data pipelines.
Why Adding Is the Safe Default
When a producer adds a new optional column with a default value, every existing consumer continues to function. Code that did not know about the column never asked for it. Code that did know about it gets the default for old rows. No downstream join breaks. No transform sees a NULL where a value used to be. The data team can adopt the new column on its own schedule, building dashboards and aggregations against it once the value of the new field is understood.
Why Renaming Is the Dangerous Default
Renaming a column does the opposite. The rename is invisible to the producer once the deploy is finished, but every consumer that read the old name now reads NULL or fails. The fix requires every consumer to deploy a code change that points at the new name, and the producer must wait for all consumers before retiring the old name. In a system with dozens of downstream consumers, this synchronization is hard to organize and easy to mishandle. Whole pipelines have been broken for days because a rename happened on Friday and the on-call engineer assumed the upstream team had coordinated with everyone. Renames also produce a class of bug that is exceptionally hard to find: a column that exists, contains data, but means something different than it used to. A column called amount that originally tracked gross revenue and silently became net revenue is far more dangerous than a column that disappeared entirely. The disappeared column produces a NULL that someone notices. The repurposed column produces wrong numbers that look right.
Change
Risk Level
Typical Outcome If Unannounced
Add an optional column
Low
Pipelines keep working; new column is ignored until adopted
Add a required column with default
Low
Old rows get the default; new rows carry the value
Add an enum value at the end
Low to medium
Old readers may misclassify; new readers handle correctly
Widen a numeric type
Low
Old values still fit in the wider type
Drop a column
High
Any consumer reading that column produces NULL or fails
Rename a column
High
All consumers reading the old name break simultaneously
Narrow a type
High
Some old values no longer fit; data is silently truncated
Reorder positional fields
Critical
Type coincidences mask wrong data flowing into the wrong column
The Stripe Example
Stripe's API documentation includes a long-standing convention: new fields can appear in any response without notice, but existing fields will not be removed or renamed within a major API version. That single convention is why thousands of integrations against Stripe rarely break. The producer commits to additive change as the default. Anything destructive happens through versioned endpoints that consumers opt into when they are ready. The same pattern applies inside a company between data producer teams and data pipelines: additive is free, destructive is expensive.
The additive default in practice:
▸New fields are added; existing fields are never repurposed
▸Removals happen only after a deprecation window with deprecation flags in the schema
▸Renames are treated as a remove plus an add, with both fields populated during the migration window
▸Type changes happen by adding a new column with the new type and migrating off the old one
Additive ChangeDestructive Change
Additive Change
The safe default
New optional fields with default values, new enum values at the end, widened numeric types. Producers ship without coordination; consumers adopt at their own pace.
Destructive Change
Requires coordination
Drops, renames, narrowed types. Forces every consumer to upgrade. In well-run systems these go through a deprecation window with both old and new shapes available.
The case against destructive change is rarely about technical impossibility. It is about who pays the cost. An additive change places the cost on the producer team, who write the new column and the migration. A destructive change places the cost on every consumer team, who must change their code to keep working. A producer team that ships destructive changes to save themselves work is externalizing the cost onto the rest of the organization. Mature data platforms treat this externalization the same way they treat any other cross-team coupling: name it, price it, and prefer the path that keeps each team accountable for the cost of its own decisions.
✓Do
Default to additive changes for any cross-team schema
Treat renames as add-then-drop with a deprecation window
Document each new field's meaning before downstream teams start using it
✗Don't
Drop or rename a column without naming every downstream consumer first
Reorder positional fields in serialized formats; the cost is invisible breakage
Narrow a type in place; add a new column with the narrower type instead
1
/* A demonstration of the safe rename pattern */
2
/* Phase 1: add the new column without dropping the old */
3
WITHeventsAS(
4
SELECT
5
1ASuser_id,
6
DATE'2025-09-14'ASsignup_date,
7
CAST(NULLASTIMESTAMP)AScreated_at
8
9
UNIONALL
10
11
SELECT
12
2,
13
DATE'2025-09-15',
14
CAST(NULLASTIMESTAMP)
15
16
UNIONALL
17
18
SELECT
19
3,
20
DATE'2025-09-16',
21
CAST(NULLASTIMESTAMP)
22
)
23
24
SELECT
25
user_id,
26
signup_date,
27
COALESCE(
28
created_at,
29
CAST(signup_dateASTIMESTAMP)
30
)AScreated_at
31
FROMevents
32
ORDERBYuser_id
What Late Data Means
Daily Life
Interviews
Recognize late data, distinguish event time from processing time, and name the three reactions a pipeline can have to a late event.
Schema drift is one half of the lesson. The other half is late data. Events do not always arrive in the order they were produced. A click happens on a phone with patchy reception on Tuesday morning, the SDK queues the event locally, and the event is uploaded Thursday afternoon when the phone reconnects to wifi. The event is timestamped Tuesday. It arrives Thursday. Every batch and streaming system in the industry has to decide what to do with that event.
The Two Timestamps That Matter
Timestamp
What It Records
Owned By
Event time
When the event actually happened in the world
The producer (or the user's device)
Ingestion time
When the pipeline first received the event
The pipeline
Processing time
When the pipeline acted on the event in a transform
The processing engine
These three timestamps are usually within seconds of each other. When they diverge, the gap is what people call lateness. An event with event_time = Tuesday 09:14 and ingestion_time = Thursday 14:27 is two days late. The pipeline that aggregates daily totals has a choice: include this Tuesday event in the Tuesday bucket (correct, but it changes Tuesday's number after the fact), or include it in the Thursday bucket (wrong, but does not change history), or drop it (wrong, but keeps the math simple). All three choices show up in production systems and all three are sometimes correct, depending on the consumer. The choice is not just an engineering decision. It is a contract with whoever reads the output. A finance dashboard that reports daily revenue must use event time to satisfy the audit; a fraud-detection system reading a real-time feed may prefer processing time because it operates on what the engine actually saw. Naming the consumer first, then the time domain, prevents the most common kind of disagreement between data team and stakeholder.
Why Lateness Happens
Mobile SDKs queue events offline and flush when connectivity returns; this is the largest single source of lateness.
Producer outages cause backlog; when the producer comes back online it ships hours of accumulated events at once.
Network congestion in batched producers can delay flush by minutes to hours; Producers that batch events for efficiency (e.g., flushing every 100ms) compound with downstream lag.
Downstream queue backlog: an event arrives at the broker on time but the consumer does not pull it for hours.
The three things a pipeline can do with a late event:
▸Include it in its event-time bucket and update history (correct, but mutates yesterday)
▸Include it in the bucket of when it arrived (wrong, but cheap and stable)
▸Drop it (wrong, but predictable and bounded if lateness is rare)
Most batch pipelines silently choose option two or three because they were never explicitly designed to handle lateness. The system runs on processing time and does not look back. The cost is a quiet undercount in every reporting period that contains late events. For most consumer-facing dashboards this undercount is on the order of a percent or two and goes unnoticed. For financial reporting, billing, or any audit-relevant table, even a small undercount is a serious bug.
✓Event Time
Records when the event happened in reality
Stable property of the event
Required for accurate historical reports
Forces the pipeline to handle out-of-order arrivals
•Processing Time
Records when the pipeline saw the event
Depends on broker and consumer lag
Easy to compute, monotonic by construction
Drifts from event time when the system is under load
✓Aggregating by Event Time
Tuesday late event lands in Tuesday's bucket
Reruns produce the same totals over time
Reflects what happened in the world
Forces idempotent writes and partition overwrites
•Aggregating by Processing Time
Tuesday late event lands in Thursday's bucket
Reruns can produce different totals as lag changes
Reflects what the engine saw, not what occurred
Cheap to implement; misleading to read
1
/* Show the cost of bucketing by ingestion time when lateness is real */
2
WITHevtsAS(
3
SELECT
4
'a'ASid,
5
DATE'2026-04-21'ASevent_date,
6
DATE'2026-04-21'ASingest_date,
7
100ASamount
8
9
UNIONALL
10
11
SELECT
12
'b',
13
DATE'2026-04-21',
14
DATE'2026-04-21',
15
200
16
17
UNIONALL
18
19
SELECT
20
'c',
21
DATE'2026-04-21',
22
DATE'2026-04-23',
23
50/* 2-day-late event */
24
25
UNIONALL
26
27
SELECT
28
'd',
29
DATE'2026-04-22',
30
DATE'2026-04-22',
31
75
32
)
33
34
SELECT
35
'by event_date'ASbucket_key,
36
event_dateASbucket,
37
SUM(amount)AStotal
38
FROMevts
39
GROUPBYevent_date
40
41
UNIONALL
42
43
SELECT
44
'by ingest_date'ASbucket_key,
45
ingest_dateASbucket,
46
SUM(amount)AStotal
47
FROMevts
48
GROUPBYingest_date
49
ORDERBYbucket_key,bucket
Lateness is not an exception condition. It is the expected behavior of any distributed system that includes mobile clients, intermittent networks, or producer batching. The question is not whether late data appears, but how the pipeline handles it.
The cost of getting lateness wrong is rarely visible on the day the bug is introduced. A pipeline that quietly drops late events looks correct in side-by-side checks during testing, because tests rarely include events that arrive on day three. The bug surfaces months later, when an analyst notices that Tuesday's revenue is consistently 1.2 percent below an external source of truth, or when a finance team reconciles a quarter-end and finds a small but persistent gap. The lag between cause and effect is what makes lateness bugs expensive. By the time the bug is found, the wrong numbers have shipped to customers, regulators, or executives, and unwinding the historical record requires explaining what was missed and why.
TIP
Before designing any aggregation, ask which timestamp the consumer cares about. Marketing reports almost always want event time. Operational metrics about pipeline health almost always want processing time. The two questions look identical and are not.
Late Data: Rerun Last 7 Days
Daily Life
Interviews
Apply a daily-rerun window as the simplest workable defense against late data in a batch pipeline.
The simplest workable fix for late data in a batch pipeline is also the most common: every day, do not compute today alone; also recompute the last several days. The size of the window depends on how late events tend to arrive. Seven days is a typical default because it covers nearly all mobile SDK retry tail behavior without making the daily run prohibitively expensive.
Why a Rerun Window Works
If today's run also recomputes the last seven days, then any event whose event_time was within the last seven days and whose ingestion_time is today will be picked up in the right bucket. The dashboard for last Tuesday gets corrected today. The cost is computational: instead of processing one day's events, the pipeline processes eight. The benefit is that history corrects itself within the window. As long as the pipeline is idempotent (running it twice produces the same answer), nothing breaks.
Window Size
Catches
Misses
Daily Compute Cost
1 day (today only)
Events arriving the same day they were produced
Anything later than processing time
1x baseline
3 days
Most mobile retry traffic and brief producer outages
Long tail of stuck SDKs and weekend outages
3x baseline
7 days
Nearly all real-world lateness in event-driven systems
The query reads the last seven days of raw events, aggregates them by event_time, and overwrites the corresponding daily partitions. Today's partition is computed for the first time. Yesterday's, the day before's, and so on are recomputed. If a late event from five days ago arrived between the last run and this one, it now appears in the correct day's totals. The INSERT OVERWRITE clause makes the operation idempotent at the partition level.
1
/* The same pattern, applied to a small example dataset */
2
WITHrawAS(
3
SELECT
4
1ASuser_id,
5
DATE'2026-04-20'ASevent_date,
6
4999ASamount_cents
7
8
UNIONALL
9
10
SELECT
11
2,
12
DATE'2026-04-21',
13
1999
14
15
UNIONALL
16
17
SELECT
18
3,
19
DATE'2026-04-21',
20
2999
21
22
UNIONALL
23
24
SELECT
25
4,
26
DATE'2026-04-22',
27
999
28
29
UNIONALL
30
31
SELECT
32
5,
33
DATE'2026-04-23',
34
4999
35
36
UNIONALL
37
38
/* A late event for April 21 that arrived on April 25 */
39
SELECT
40
6,
41
DATE'2026-04-21',
42
3499
43
)
44
45
SELECT
46
event_date,
47
SUM(amount_cents)ASrevenue_cents,
48
COUNT(*)ASevent_count
49
FROMraw
50
WHEREevent_date>=DATE'2026-04-19'
51
GROUPBYevent_date
52
ORDERBYevent_date
April 21 now reflects three events totaling 8,497 cents. The late event for April 21 was absorbed because the rerun window included it. Without the window, April 21's number would have been frozen at 4,998 cents and the late event would either have landed in April 25's bucket or been dropped entirely. The behavior is what consumers expect from a daily report, even though most of them have never thought about why the number changes between Wednesday's view and Thursday's view of the same Tuesday. The pattern teaches the consumer's eye to expect minor backfill of the most recent few days; the data team learns to expect that backfill to be the late-data tail showing up; and the system as a whole stays calibrated without anyone needing to explain what is happening on a given day.
What the rerun window requires:
▸An idempotent write: re-running must produce the same answer, not duplicates
▸Partition-level overwrite (not append): yesterday's row gets replaced, not added to
▸Source data retained for at least the window length: raw events from 7 days ago must still be queryable
▸Compute budget that absorbs the window: 7 days of work, every day
When the Window Is Not Enough
A 7-day window catches the bulk of lateness in a healthy system but does not catch everything. A producer that goes offline for nine days will produce events that arrive outside the window and end up in the wrong bucket. The next-tier fixes (watermarks, allowed lateness in streaming, reconciliation passes) appear in the intermediate and advanced tiers. The point of this section is that the cheapest correct answer for many batch pipelines is not exotic infrastructure. It is a window.
The rerun window has a second-order benefit beyond catching late events: it acts as a safety net for any pipeline bug discovered within the window. If a transform was deployed on Monday with a subtle bug that produced incorrect totals, the bug is caught on Wednesday, and the fix is deployed on Wednesday afternoon, the rerun on Thursday morning will recompute Monday and Tuesday with the corrected logic. The historical record self-heals as long as the bug is found within the window. This property turns the rerun from a late-data fix into a general-purpose freshness guarantee: the pipeline's output for any day older than the window is final; output within the window is provisional and may shift as late events and bug fixes propagate.
✓Do
Pick a window size based on observed lateness, not on a hunch
Make the rerun idempotent at the partition level: overwrite, not append
Retain raw source data for at least the window length
✗Don't
Append late events to the bucket of when they arrived; this hides the late-data problem
Skip the rerun on weekends; producer outages happen on weekends too
Set the window to one day and assume the pipeline handles late data; it does not
The rerun window pattern is also the foundation that makes more sophisticated late-data designs possible. The streaming watermark concept covered in the next tier and the reconciliation pass concept covered in the advanced tier both inherit the same property: the streaming output is provisional within the lateness window, and a separate mechanism corrects it afterward. The seven-day rerun is the batch-only version of this idea, accessible without any streaming infrastructure. Many production systems run nothing more sophisticated than this and meet their accuracy targets, because the long tail of lateness past seven days is small enough to ignore for the consumer. Recognizing when the simple version is enough, and when the workload genuinely requires the streaming-plus-reconciliation composite, is one of the judgment calls that separates a senior data engineer from a junior one.
TIP
Measure the lateness distribution from production data before picking a window. If 99 percent of events arrive within four hours, a 24-hour window is overkill. If 5 percent arrive after three days, even a 7-day window leaves money on the table.
❯❯❯PUTTING IT ALL TOGETHER
> A subscription company runs a nightly batch pipeline that aggregates checkout events from a Kafka topic into a Snowflake daily revenue table. Two issues land in the same week. On Monday a backend engineer adds a promo_code field to the checkout event, and on Wednesday the on-call engineer notices that Tuesday's revenue total quietly grew by 1.4 percent overnight. The new data engineer is asked to design the simplest set of changes that keeps both problems from recurring.
The promo_code addition is an additive change. The loader should land the raw event in a flexible column and let downstream transforms decide when to read the new field, rather than rejecting any row that does not match a strict schema.
The 1.4 percent overnight growth is the rerun window doing its job. Late events from Tuesday arrived between the Tuesday-morning run and the Wednesday-morning run, and the Wednesday run picked them up because it recomputed the last several days, not yesterday alone.
The rerun depends on idempotency, the property covered in Lesson 5 (idempotency and backfill). Without partition overwrite, the same late event would be double-counted on every rerun day.
Both fixes share a frame: design the pipeline to absorb upstream change, not to prevent it. The producer team will keep adding fields and the SDK will keep flushing late events. The pipeline meets both with tolerance.
KEY TAKEAWAYS
Schemas drift because producer teams ship on their own cadence: the pipeline must absorb upstream change rather than prevent it.
Backward and forward compatibility are independent: a schema change can be one, both, or neither. Knowing which decides who can deploy first.
Additive changes are nearly always safe; renames and drops are nearly always dangerous: the cheap default is add-then-deprecate, with both shapes coexisting during a migration window.
Late data is the rule, not the exception: event time and processing time diverge whenever mobile clients, batched producers, or unreliable networks are involved.
A rerun window is the simplest correct answer for batch lateness: compute the last seven days every day, overwrite by partition, and history corrects itself within the window.
Data shapes shift and events arrive out of order; pipelines must absorb both without breaking
Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges
Topics covered: The Producer Added a Column Problem, Forward vs Backward Compatibility, Adding Is Safe, Renaming Is Not, What Late Data Means, Late Data: Rerun Last 7 Days
Pipelines do not own the data flowing through them. The teams that produce events, write to operational databases, or push files into shared buckets own the upstream shape. Those teams ship code on their own cadence. Sooner or later, one of them adds a field, renames a column, or changes a type, and a pipeline that has been running fine for months suddenly fails. The producer-added-a-column problem is the most common variant of this story. It is so common that every senior data engineer has a pe
Two terms appear in nearly every conversation about schema change: backward compatible and forward compatible. They sound interchangeable. They are not. The distinction matters because it tells the producer and the consumer who can upgrade first without breaking the other. Confusing the two is the source of half the schema-related production incidents in event-driven systems. The Two Definitions, Plain Backward compatibility says the new code reads the old data. Forward compatibility says the ol
The compatibility framework above implies a practical rule that holds for almost every real-world schema change. Adding things is usually safe. Removing or renaming things is almost never safe without coordination. This is not a deep theoretical claim. It is an observation about the asymmetry between adding new information and removing or relabeling information that downstream code already depends on. The asymmetry is so reliable that it shows up as a default in serialization formats, in version
Schema drift is one half of the lesson. The other half is late data. Events do not always arrive in the order they were produced. A click happens on a phone with patchy reception on Tuesday morning, the SDK queues the event locally, and the event is uploaded Thursday afternoon when the phone reconnects to wifi. The event is timestamped Tuesday. It arrives Thursday. Every batch and streaming system in the industry has to decide what to do with that event. The Two Timestamps That Matter These thre
The simplest workable fix for late data in a batch pipeline is also the most common: every day, do not compute today alone; also recompute the last several days. The size of the window depends on how late events tend to arrive. Seven days is a typical default because it covers nearly all mobile SDK retry tail behavior without making the daily run prohibitively expensive. Why a Rerun Window Works If today's run also recomputes the last seven days, then any event whose event_time was within the la