Loading lesson...

Schema Evolution and Late Data: Beginner

Data shapes shift and events arrive out of order; pipelines must absorb both without breaking

Data shapes shift and events arrive out of order; pipelines must absorb both without breaking

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: The Producer Added a Column Problem, Forward vs Backward Compatibility, Adding Is Safe, Renaming Is Not, What Late Data Means, Late Data: Rerun Last 7 Days

Lesson Sections

  1. The Producer Added a Column Problem (concepts: paSchemaDrift, paSchemaEvolution)

    Pipelines do not own the data flowing through them. The teams that produce events, write to operational databases, or push files into shared buckets own the upstream shape. Those teams ship code on their own cadence. Sooner or later, one of them adds a field, renames a column, or changes a type, and a pipeline that has been running fine for months suddenly fails. The producer-added-a-column problem is the most common variant of this story. It is so common that every senior data engineer has a pe

  2. Forward vs Backward Compatibility (concepts: paSchemaCompatibility)

    Two terms appear in nearly every conversation about schema change: backward compatible and forward compatible. They sound interchangeable. They are not. The distinction matters because it tells the producer and the consumer who can upgrade first without breaking the other. Confusing the two is the source of half the schema-related production incidents in event-driven systems. The Two Definitions, Plain Backward compatibility says the new code reads the old data. Forward compatibility says the ol

  3. Adding Is Safe, Renaming Is Not (concepts: paAdditiveChange, paDestructiveChange)

    The compatibility framework above implies a practical rule that holds for almost every real-world schema change. Adding things is usually safe. Removing or renaming things is almost never safe without coordination. This is not a deep theoretical claim. It is an observation about the asymmetry between adding new information and removing or relabeling information that downstream code already depends on. The asymmetry is so reliable that it shows up as a default in serialization formats, in version

  4. What Late Data Means (concepts: paLateData, paEventTimeVsProcessingTime)

    Schema drift is one half of the lesson. The other half is late data. Events do not always arrive in the order they were produced. A click happens on a phone with patchy reception on Tuesday morning, the SDK queues the event locally, and the event is uploaded Thursday afternoon when the phone reconnects to wifi. The event is timestamped Tuesday. It arrives Thursday. Every batch and streaming system in the industry has to decide what to do with that event. The Two Timestamps That Matter These thre

  5. Late Data: Rerun Last 7 Days (concepts: paLateDataRerun, paIdempotency)

    The simplest workable fix for late data in a batch pipeline is also the most common: every day, do not compute today alone; also recompute the last several days. The size of the window depends on how late events tend to arrive. Seven days is a typical default because it covers nearly all mobile SDK retry tail behavior without making the daily run prohibitively expensive. Why a Rerun Window Works If today's run also recomputes the last seven days, then any event whose event_time was within the la