Late-Arriving Data: Advanced
What you will be able to do
"What If the Data Arrives After the Pipeline Ran?"
- ▸"What happens if an event arrives 3 days late?"
- ▸"The mobile app queues events when offline"
- ▸"We receive corrections from the source system after the partition is closed"
- ▸"A fact row references a dimension that doesn't exist yet"
- ▸"How do you handle out-of-order data in the pipeline?"
- ▸Any question involving event time vs processing time, retro-active updates, or data reconciliation
What They're Really Testing
The hidden rubric: does this candidate design for event-time semantics, or do they assume processing-time equals event-time? Every production pipeline at scale deals with the gap between when an event happened (event time) and when the pipeline sees it (processing time). Candidates who conflate the two will build pipelines that produce wrong results for any late data.
The Two Categories
- The event happened in the past
- The fact row needs to land in a historical partition
- Example: mobile click from 3 days ago
- Problem: the historical partition may be closed
- Fix: reopen and insert, or accumulate in a late-data partition
- The fact row arrives before its dimension row
- The dimension FK lookup fails
- Example: new product in a fact before dim_product is updated
- Problem: the fact row has no valid dimension SK
- Fix: inferred member (placeholder dimension row)
The 60-Second Framework
Why Companies Care
The strongest opening: 'Late data is not an edge case. It is the normal case for any system with mobile clients, distributed sources, or batch upstream feeds. I design for it from the start, not as a fix-it-later.' This one sentence tells the interviewer you have operated real pipelines.
Late-Arriving Facts: Inserting Into Closed Partitions
Three Strategies: Know All Three, Recommend One
The Cascade Problem Interviewers Probe
| Component | Before Late Fact | After Late Fact | Stale? |
|---|---|---|---|
| fact_orders (March 15 partition) | 100,000 rows | 100,047 rows | Updated |
| agg_daily_revenue (March 15) | $10,000,000 | $10,000,000 | STALE until recomputed |
| agg_monthly_revenue (March) | $310,000,000 | $310,000,000 | STALE until recomputed |
| dashboard (March 15 tile) | $10M | $10M | STALE until cache refresh |
The strong answer: 'Late fact insertion requires a cascade: insert the fact, mark affected aggregate partitions for recomputation, recompute, and invalidate dashboard caches. Without this cascade, the base data is correct but everything downstream is stale.'
The Follow-Up Trap
- Define a late-data SLA (e.g., 7-day window for automatic reprocessing)
- Use event_time for partitioning analytics, processing_time for pipeline operations
- Cascade recomputation to all downstream aggregates after late fact insertion
- Drop late events because 'the partition is closed'
- Assume processing_time equals event_time
- Reprocess the entire table when only one partition needs updating
Late-Arriving Dimensions: The Inferred Member
The Pattern: Inferred Members
Inferred member: a fact arrives for unknown customer C999 before the dimension. Insert a placeholder dim_customer row (is_inferred = TRUE) so the fact keeps its FK; backfill the real attributes when the dimension catches up.
Why 'Skip the Fact' Is a No-Hire Answer
The Interaction With Type 2: The Advanced Follow-Up
| customer_sk | customer_id | name | city | is_inferred | effective_from | effective_to | is_current |
|---|---|---|---|---|---|---|---|
| 5001 | C999 | Unknown | Unknown | true | 2024-03-15 | 2024-03-17 | false |
| 5002 | C999 | Alice Zhang | Seattle | false | 2024-03-18 | 9999-12-31 | true |
Fact rows from March 15 to 17 join to SK 5001 (inferred). Fact rows from March 18 onward join to SK 5002 (real). The historical record is preserved. No data is lost or retroactively changed.
What the Interviewer Writes
- "Skip the fact if the dimension doesn't exist"
- "Queue it and retry later"
- "It shouldn't happen if the pipeline is well-designed"
- No concept of inferred members
- "Insert an inferred member with is_inferred = TRUE"
- "Backfill when the real dimension arrives"
- "Monitor inferred member age as a data quality KPI"
- Knows how inferred members interact with SCD Type 2
Correction and Reversal Patterns
Why 'Just UPDATE It' Is a No-Hire Answer
The Reversal Pattern: The Strong-Hire Answer
| row_type | order_id | date | amount | notes |
|---|---|---|---|---|
| ORIGINAL | O1 | Mar 15 | $1,000 | Loaded by daily pipeline |
| REVERSAL | O1 | Mar 18 | -$1,000 | Cancels the original |
| CORRECTION | O1 | Mar 18 | $1,100 | Correct amount |
The Alternative: Version Columns
- Three rows: original, reversal, correction
- SUM naturally produces correct result
- Preserves full audit trail
- Best for: financial data, regulatory environments
- Multiple rows per entity, latest version wins
- Requires ROW_NUMBER filter for correct results
- Preserves all versions for audit
- Best for: high-correction-volume, non-financial data
Red Flag Phrases
The senior framing: 'In an immutable fact table, corrections are additive, not destructive. A reversal row plus a correction row preserves the audit trail and lets SUM produce the right answer without special handling.' This single sentence demonstrates immutability discipline.
Designing for Reprocessability
The Three Things the Interviewer Wants to Hear
The Pattern You Should Be Able to Write
The DELETE-INSERT pattern inside a transaction is idempotent: run it once or five times, same result. The loaded_at timestamp records the reprocessing time. The aggregate metadata table triggers downstream recomputation.
The Operational Checklist That Closes the Answer
Vocabulary That Signals Seniority
| Junior Phrasing | Senior Phrasing |
|---|---|
| "Data should arrive on time" | "Late data is the normal case. The pipeline is designed to handle it without manual intervention." |
| "I'd just reload the table" | "I'd reprocess the affected partition using idempotent DELETE-INSERT, then cascade to downstream aggregates." |
| "We'd fix the dimension and rerun" | "We'd create an inferred member immediately so the fact loads without blocking, then backfill the dimension when it arrives." |
| "Update the wrong row" | "Corrections use reversal rows: a negative row to cancel the original and a positive row with the correct value. The audit trail is preserved." |
| "Late data is a bug" | "Late data is an operational reality. We monitor inferred member age and late-fact volume as data quality KPIs." |
- ▸Percentage of events arriving more than 1 hour after event_time (late-fact volume)
- ▸Count of inferred dimension members older than 24 hours (upstream degradation signal)
> You are in an Uber data engineering interview. The interviewer asks: 'A ride event arrives 4 hours after the ride ended because the driver's phone was offline. How does your pipeline handle it?'
Data always arrives late; your model either handles it gracefully or corrupts silently
- Category
- Data Modeling
- Difficulty
- advanced
- Duration
- 25 minutes
- Challenges
- 0 hands-on challenges
Topics covered: "What If the Data Arrives After the Pipeline Ran?", Late-Arriving Facts: Inserting Into Closed Partitions, Late-Arriving Dimensions: The Inferred Member, Correction and Reversal Patterns, Designing for Reprocessability
Lesson Sections
- "What If the Data Arrives After the Pipeline Ran?" (concepts: dmLateArriving)
What They're Really Testing The interviewer is testing whether you design pipelines for the ideal case or the real case. The ideal case: data arrives in order, dimensions exist before facts, and no corrections are needed. The real case: mobile events arrive days late, upstream dimension changes are delayed, and corrections arrive after reports have been generated. The Two Categories The 60-Second Framework Why Companies Care Cite these in your answer: 'At Uber, 15% of ride events arrive more tha
- Late-Arriving Facts: Inserting Into Closed Partitions (concepts: dmLateArriving)
After you identify the late-arriving fact scenario, the interviewer will ask: 'Show me how your pipeline handles it.' This is where candidates who have only read about late data stall, and candidates who have operated it in production accelerate. The answer involves partition management, aggregate recomputation, and a cascade strategy. Three Strategies: Know All Three, Recommend One The Cascade Problem Interviewers Probe Narrate the cascade: 'I insert the late fact into the March 15 partition. B
- Late-Arriving Dimensions: The Inferred Member (concepts: dmLateArriving)
The interviewer says: 'A fact row arrives but the dimension it references does not exist yet. What do you do?' This is the inferred member question, and it separates candidates who have built dimension loading pipelines from candidates who have only designed schemas. The wrong answers are 'skip the fact' and 'queue it for later.' The Pattern: Inferred Members Your inferred member answer: 'I insert a placeholder row into dim_customer with customer_id = C999, name = Unknown, city = Unknown, is_inf
- Correction and Reversal Patterns (concepts: dmImmutableLogs)
The interviewer asks: 'The source system sent incorrect data yesterday and just sent a correction. How does your model handle it?' They are testing whether you destructively update fact rows (no-hire signal) or use reversal patterns that preserve the audit trail (strong-hire signal). In financial data modeling interviews, this question is worth 20% of the scorecard. Why 'Just UPDATE It' Is a No-Hire Answer The Reversal Pattern: The Strong-Hire Answer Your reversal answer: 'Two new rows: a revers
- Designing for Reprocessability (concepts: dmLateArriving)
The closing question in any late-data interview: 'Can you reprocess this pipeline from scratch?' This tests whether your entire design is built for reprocessability or whether it has hidden assumptions that break on rerun. Candidates who say 'yes, because every component is idempotent and parameterized by date' pass. Candidates who hesitate fail. The Three Things the Interviewer Wants to Hear The Pattern You Should Be Able to Write Idempotent DELETE-INSERT per Partition The DELETE-INSERT pattern