Late-Arriving Data: Advanced

In textbook data modeling, facts and dimensions arrive in order: the dimension row exists before the fact row references it. In production, this never happens. Mobile events arrive 72 hours late because the device was offline. A dimension update from an upstream system is delayed by a failed API. A fact partition was already closed when a correction event appears. Every FAANG modeling interview tests whether you design for the happy path or for reality. Candidates who say 'I would just load it' get filtered. Candidates who say 'late data is the normal case and here is how the pipeline handles it' advance.

What you will be able to do

Recognize late-arriving data triggers in interview prompts

Design inferred members for dimensions that have not arrived yet

Handle corrections without destructive updates using reversal patterns

"What If the Data Arrives After the Pipeline Ran?"

Daily Life

Interviews

You are being tested on late-arriving data when you hear:

▸"What happens if an event arrives 3 days late?"
▸"The mobile app queues events when offline"
▸"We receive corrections from the source system after the partition is closed"
▸"A fact row references a dimension that doesn't exist yet"
▸"How do you handle out-of-order data in the pipeline?"
▸Any question involving event time vs processing time, retro-active updates, or data reconciliation

What They're Really Testing

The interviewer is testing whether you design pipelines for the ideal case or the real case. The ideal case: data arrives in order, dimensions exist before facts, and no corrections are needed. The real case: mobile events arrive days late, upstream dimension changes are delayed, and corrections arrive after reports have been generated.

The hidden rubric: does this candidate design for event-time semantics, or do they assume processing-time equals event-time? Every production pipeline at scale deals with the gap between when an event happened (event time) and when the pipeline sees it (processing time). Candidates who conflate the two will build pipelines that produce wrong results for any late data.

The Two Categories

•Late-Arriving Facts

The event happened in the past
The fact row needs to land in a historical partition
Example: mobile click from 3 days ago
Problem: the historical partition may be closed
Fix: reopen and insert, or accumulate in a late-data partition

•Late-Arriving Dimensions

The fact row arrives before its dimension row
The dimension FK lookup fails
Example: new product in a fact before dim_product is updated
Problem: the fact row has no valid dimension SK
Fix: inferred member (placeholder dimension row)

The 60-Second Framework

Classify: 'Is this a late fact (event in a closed partition) or a late dimension (FK lookup fails)?'

For late facts: 'The pipeline must be able to insert into historical partitions, and downstream aggregates must be recomputed for the affected partitions.'

For late dimensions: 'Insert an inferred member with default values and a flag. Backfill when the real dimension arrives.'

State the tradeoff: 'Handling late data adds pipeline complexity but prevents silent data loss.'

Why Companies Care

Cite these in your answer: 'At Uber, 15% of ride events arrive more than 1 hour late due to driver app connectivity. A pipeline that closes hourly partitions loses 15% of rides. At Spotify, podcast listens from offline devices arrive up to 7 days late, causing royalty calculations to be systematically wrong.' Drop one of these in 10 seconds to show late data is not an edge case at scale.

The strongest opening: 'Late data is not an edge case. It is the normal case for any system with mobile clients, distributed sources, or batch upstream feeds. I design for it from the start, not as a fix-it-later.' This one sentence tells the interviewer you have operated real pipelines.

Late-Arriving Facts: Inserting Into Closed Partitions

Daily Life

Interviews

After you identify the late-arriving fact scenario, the interviewer will ask: 'Show me how your pipeline handles it.' This is where candidates who have only read about late data stall, and candidates who have operated it in production accelerate. The answer involves partition management, aggregate recomputation, and a cascade strategy.

Three Strategies: Know All Three, Recommend One

Strategy #1Strategy #2Strategy #3

Strategy #1

Insert into the historical partition

Reopen the partition and insert the late fact with its original event_date. Downstream aggregates for that partition must be recomputed. This is the most correct approach but requires idempotent downstream pipelines that can handle re-runs.

Strategy #2

Separate late-arriving partition

Insert into a 'late_data' staging partition. A reconciliation job periodically moves these rows to their correct partitions and recomputes aggregates. Less disruptive to daily runs but adds pipeline complexity.

Strategy #3

Insert with processing-time partition, keep event-time as a column

Partition by processing_date but store event_date as a column. Queries that need event-time accuracy filter on event_date. Simple pipeline but consumers must know to filter on event_date, not the partition date.

The Cascade Problem Interviewers Probe

Narrate the cascade: 'I insert the late fact into the March 15 partition. But the daily revenue aggregate for March 15 was already computed as $10M. It is now stale. The monthly aggregate is stale. The dashboard tile is stale. I must mark affected aggregate partitions for recomputation, recompute them, and invalidate dashboard caches. Without this cascade, the base data is correct but everything downstream is wrong.' The interviewer is checking whether you think about the downstream impact, not just the insertion.

Component	Before Late Fact	After Late Fact	Stale?
fact_orders (March 15 partition)	100,000 rows	100,047 rows	Updated
agg_daily_revenue (March 15)	$10,000,000	$10,000,000	STALE until recomputed
agg_monthly_revenue (March)	$310,000,000	$310,000,000	STALE until recomputed
dashboard (March 15 tile)	$10M	$10M	STALE until cache refresh

The strong answer: 'Late fact insertion requires a cascade: insert the fact, mark affected aggregate partitions for recomputation, recompute, and invalidate dashboard caches. Without this cascade, the base data is correct but everything downstream is stale.'

The Follow-Up Trap

Follow-Up #1Follow-Up #2

Follow-Up #1

"How late is too late?"

Strong answer: 'Define a late-data SLA. Events within 7 days go to the historical partition and trigger recomputation. Events older than 7 days go to a reconciliation queue for manual review. The cutoff is a business decision, not a technical one.'

Follow-Up #2

"What about streaming pipelines?"

Strong answer: 'Streaming uses watermarks. The watermark says: I believe all events before this timestamp have arrived. Late events past the watermark go to a side output (dead letter queue) for batch reconciliation. Flink, Spark Structured Streaming, and Dataflow all support this pattern.'

TIP

Mentioning watermarks and dead letter queues when discussing late facts shows you understand both batch and streaming paradigms. Most candidates only know the batch approach.

✓Do

Define a late-data SLA (e.g., 7-day window for automatic reprocessing)
Use event_time for partitioning analytics, processing_time for pipeline operations
Cascade recomputation to all downstream aggregates after late fact insertion

✗Don't

Drop late events because 'the partition is closed'
Assume processing_time equals event_time
Reprocess the entire table when only one partition needs updating

Late-Arriving Dimensions: The Inferred Member

Daily Life

Interviews

The interviewer says: 'A fact row arrives but the dimension it references does not exist yet. What do you do?' This is the inferred member question, and it separates candidates who have built dimension loading pipelines from candidates who have only designed schemas. The wrong answers are 'skip the fact' and 'queue it for later.'

The Pattern: Inferred Members

Your inferred member answer: 'I insert a placeholder row into dim_customer with customer_id = C999, name = Unknown, city = Unknown, is_inferred = TRUE. The fact loads immediately with a valid SK. When the real dimension arrives hours or days later, I update the placeholder with actual attributes and set is_inferred = FALSE.' Walk through both steps. The interviewer is checking whether you know the two-phase pattern: placeholder now, backfill later.

1
2	INSERT INTO dim_customer(customer_sk, customer_id, name, city, tier, is_inferred, inferred_at, effective_from) VALUES(NEXTVAL('customer_sk_seq'), 'C999', 'Unknown', 'Unknown', 'Unknown', TRUE, CURRENT_TIMESTAMP, CURRENT_DATE) ; UPDATE dim_customer SET name = 'Alice Zhang', city = 'Seattle', tier = 'Gold', is_inferred = FALSE
3
4
5
6
7	WHERE customer_id = 'C999' AND is_inferred = TRUE ;

dim_customer

PKcustomer_skBIGINT

customer_idVARCHAR

nameVARCHAR

is_inferredBOOLEAN

fact_orders

PKorder_idBIGINT

FKcustomer_skBIGINT

amountNUMERIC

Inferred member: a fact arrives for unknown customer C999 before the dimension. Insert a placeholder dim_customer row (is_inferred = TRUE) so the fact keeps its FK; backfill the real attributes when the dimension catches up.

Why 'Skip the Fact' Is a No-Hire Answer

Skipping the fact means losing data. The event happened. Revenue was generated. Dropping it because the dimension is late means your totals will never reconcile.

Queuing the fact for later reinsertion adds pipeline complexity and statefulness. Every queued fact needs retry logic, expiration, and monitoring.

The inferred member approach: the fact loads immediately with a valid SK. The dimension is a placeholder. When the real data arrives, the placeholder is updated. No data loss. No queue.

The Interaction With Type 2: The Advanced Follow-Up

Your Type 2 interaction answer: 'The inferred member is the first version. When the real data arrives, I expire the inferred row and insert a new row with real attributes. Facts from the inferred period point to the inferred SK. Facts after backfill point to the real SK. The historical record is preserved.' The interviewer tests this because most candidates only know inferred members as a standalone pattern, not how they interact with SCD.

customer_sk	customer_id	name	city	is_inferred	effective_from	effective_to	is_current
5001	C999	Unknown	Unknown	true	2024-03-15	2024-03-17	false
5002	C999	Alice Zhang	Seattle	false	2024-03-18	9999-12-31	true

Fact rows from March 15 to 17 join to SK 5001 (inferred). Fact rows from March 18 onward join to SK 5002 (real). The historical record is preserved. No data is lost or retroactively changed.

What the Interviewer Writes

•No Hire

"Skip the fact if the dimension doesn't exist"
"Queue it and retry later"
"It shouldn't happen if the pipeline is well-designed"
No concept of inferred members

✓Strong Hire

"Insert an inferred member with is_inferred = TRUE"
"Backfill when the real dimension arrives"
"Monitor inferred member age as a data quality KPI"
Knows how inferred members interact with SCD Type 2

Correction and Reversal Patterns

Daily Life

Interviews

The interviewer asks: 'The source system sent incorrect data yesterday and just sent a correction. How does your model handle it?' They are testing whether you destructively update fact rows (no-hire signal) or use reversal patterns that preserve the audit trail (strong-hire signal). In financial data modeling interviews, this question is worth 20% of the scorecard.

Why 'Just UPDATE It' Is a No-Hire Answer

Updating the original row destroys the audit trail. You cannot prove what the data said at the time a report was generated.

Downstream aggregates that were computed from the original value are now inconsistent. Did the daily aggregate use the old value or the new one?

If the pipeline fails mid-correction, some rows are updated and some are not. The table is in an inconsistent state.

The alternative: add a correction row that reverses the original and applies the new value. The original row is untouched. The audit trail is preserved.

The Reversal Pattern: The Strong-Hire Answer

Your reversal answer: 'Two new rows: a reversal that negates the original (-$1,000) and a correction with the right value ($1,100). The original row is untouched. SUM produces $1,100: correct. The audit trail shows exactly what happened and when.' Walk through the three rows on the whiteboard. The interviewer is checking whether you preserve the audit trail.

row_type	order_id	date	amount	notes
ORIGINAL	O1	Mar 15	$1,000	Loaded by daily pipeline
REVERSAL	O1	Mar 18	-$1,000	Cancels the original
CORRECTION	O1	Mar 18	$1,100	Correct amount

Walk through the math: 'SUM(amount) for order O1 = $1,000 + (-$1,000) + $1,100 = $1,100. Correct. The original is still visible for audit. The reversal shows when the correction was made. The correction has the right value.' The math takes 5 seconds on the whiteboard and is the entire proof that reversals work.

The Alternative: Version Columns

Your version-column alternative: 'For high-volume corrections where full reversal rows are excessive, I use a version column. Each correction inserts a new row with an incremented version. Queries filter with ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY version DESC) WHERE rn = 1. The tradeoff: reversals let SUM work naturally, versions require a window function on every query.' Name the tradeoff. The interviewer is checking whether you know both approaches.

1	/* Version approach: keep all versions, query latest */
2	SELECT
3	*
4	FROM (
5	SELECT
6	*,
7	ROW_NUMBER() OVER (
8	PARTITION BY order_id
9	ORDER BY version DESC
10	) AS rn
11	FROM fact_orders
12	)
13	WHERE rn = 1

•Reversal Pattern

Three rows: original, reversal, correction
SUM naturally produces correct result
Preserves full audit trail
Best for: financial data, regulatory environments

•Version Column

Multiple rows per entity, latest version wins
Requires ROW_NUMBER filter for correct results
Preserves all versions for audit
Best for: high-correction-volume, non-financial data

Red Flag Phrases

"Just UPDATE the original row" - Destroys audit trail and breaks downstream consistency. Immediate credibility hit in a financial modeling interview.

"Delete and reinsert" - Non-atomic. If the delete succeeds but the insert fails, the data is gone.

"Corrections shouldn't happen in a well-designed system" - Source systems correct data. Banks issue adjustments. Retailers process returns. Corrections are a business reality, not a design flaw.

The senior framing: 'In an immutable fact table, corrections are additive, not destructive. A reversal row plus a correction row preserves the audit trail and lets SUM produce the right answer without special handling.' This single sentence demonstrates immutability discipline.

Designing for Reprocessability

Daily Life

Interviews

The closing question in any late-data interview: 'Can you reprocess this pipeline from scratch?' This tests whether your entire design is built for reprocessability or whether it has hidden assumptions that break on rerun. Candidates who say 'yes, because every component is idempotent and parameterized by date' pass. Candidates who hesitate fail.

The Three Things the Interviewer Wants to Hear

Pillar #1Pillar #2Pillar #3

Pillar #1

Idempotent Loads

Running the pipeline twice for the same partition produces the same result. Use DELETE-INSERT (delete the partition, reinsert all rows) or MERGE with dedup keys. Never use INSERT alone, which would create duplicates on rerun.

Pillar #2

Partition-Level Reprocessing

The pipeline can target a specific date or hour partition without reprocessing the entire table. This means partition-independent logic: the transformation for March 15 does not depend on March 14 having been processed first (unless explicitly designed as a running calculation).

Pillar #3

Audit Timestamps

Every row has loaded_at and (optionally) reprocessed_at timestamps. These are not the business event time. They record when the pipeline touched the row. This lets you answer: 'Was this row part of the original load or a reprocessing run?'

The Pattern You Should Be Able to Write

1	BEGIN ; DELETE
2
3	FROM fact_orders
4	WHERE order_date = '2024-03-15' ; INSERT INTO fact_orders
5
6
7	SELECT *,
8	CURRENT_TIMESTAMP AS loaded_at
9	FROM staging_orders
10	WHERE event_date = '2024-03-15' ; UPDATE aggregate_metadata SET needs_recompute = TRUE
11
12
13	WHERE affected_date = '2024-03-15' ; COMMIT ;

The DELETE-INSERT pattern inside a transaction is idempotent: run it once or five times, same result. The loaded_at timestamp records the reprocessing time. The aggregate metadata table triggers downstream recomputation.

The Operational Checklist That Closes the Answer

Store event_time and processing_time as separate columns. Partition by processing_time for pipeline simplicity, but always filter and aggregate by event_time for correctness.

Define a late-data SLA: events up to 7 days late are automatically handled. Events older than 7 days trigger an alert for manual review.

Monitor inferred member age: COUNT(*) WHERE is_inferred = TRUE AND inferred_at < CURRENT_DATE - 7. A growing count means the upstream dimension source is degrading.

Build reconciliation reports: compare SUM(revenue) from the base fact table to the source system total for each date. Discrepancies indicate missed late data.

Vocabulary That Signals Seniority

Junior Phrasing	Senior Phrasing
"Data should arrive on time"	"Late data is the normal case. The pipeline is designed to handle it without manual intervention."
"I'd just reload the table"	"I'd reprocess the affected partition using idempotent DELETE-INSERT, then cascade to downstream aggregates."
"We'd fix the dimension and rerun"	"We'd create an inferred member immediately so the fact loads without blocking, then backfill the dimension when it arrives."
"Update the wrong row"	"Corrections use reversal rows: a negative row to cancel the original and a positive row with the correct value. The audit trail is preserved."
"Late data is a bug"	"Late data is an operational reality. We monitor inferred member age and late-fact volume as data quality KPIs."

The two KPIs that prove you monitor late data:

▸Percentage of events arriving more than 1 hour after event_time (late-fact volume)
▸Count of inferred dimension members older than 24 hours (upstream degradation signal)

❯❯❯PUTTING IT ALL TOGETHER

> You are in an Uber data engineering interview. The interviewer asks: 'A ride event arrives 4 hours after the ride ended because the driver's phone was offline. How does your pipeline handle it?'

You say: 'This is a late-arriving fact. The event_time is 4 hours ago but the processing_time is now. I would insert it into the historical partition by event_time, not into the current partition.'

You address aggregates: 'The hourly revenue aggregate for that partition is now stale. The pipeline marks it for recomputation, and the next aggregate run recalculates it.'

The interviewer asks about a new driver not yet in dim_driver. You say: 'Insert an inferred member with is_inferred = TRUE. The ride fact loads with a valid SK. When the driver profile syncs, we backfill the inferred row with real attributes.'

You bridge to monitoring: 'I would track two KPIs: percentage of events arriving more than 1 hour late, and count of inferred dimension members older than 24 hours. Both signal upstream degradation before it affects dashboards.'

KEY TAKEAWAYS

Late facts: insert into historical partitions and cascade recomputation to downstream aggregates

Late dimensions: inferred members (placeholder rows with is_inferred flag) backfilled when real data arrives

Corrections: reversal + correction rows preserve the audit trail; never UPDATE fact rows destructively

Reprocessability: idempotent DELETE-INSERT per partition, event_time vs processing_time columns, audit timestamps

Monitor: inferred member age, late-fact volume, and source reconciliation as data quality KPIs

Data always arrives late; your model either handles it gracefully or corrupts silently

Category: Data Modeling
Difficulty: advanced
Duration: 25 minutes
Challenges: 0 hands-on challenges

Topics covered: "What If the Data Arrives After the Pipeline Ran?", Late-Arriving Facts: Inserting Into Closed Partitions, Late-Arriving Dimensions: The Inferred Member, Correction and Reversal Patterns, Designing for Reprocessability

Lesson Sections

"What If the Data Arrives After the Pipeline Ran?" (concepts: dmLateArriving)
What They're Really Testing The interviewer is testing whether you design pipelines for the ideal case or the real case. The ideal case: data arrives in order, dimensions exist before facts, and no corrections are needed. The real case: mobile events arrive days late, upstream dimension changes are delayed, and corrections arrive after reports have been generated. The Two Categories The 60-Second Framework Why Companies Care Cite these in your answer: 'At Uber, 15% of ride events arrive more tha
Late-Arriving Facts: Inserting Into Closed Partitions (concepts: dmLateArriving)
After you identify the late-arriving fact scenario, the interviewer will ask: 'Show me how your pipeline handles it.' This is where candidates who have only read about late data stall, and candidates who have operated it in production accelerate. The answer involves partition management, aggregate recomputation, and a cascade strategy. Three Strategies: Know All Three, Recommend One The Cascade Problem Interviewers Probe Narrate the cascade: 'I insert the late fact into the March 15 partition. B
Late-Arriving Dimensions: The Inferred Member (concepts: dmLateArriving)
The interviewer says: 'A fact row arrives but the dimension it references does not exist yet. What do you do?' This is the inferred member question, and it separates candidates who have built dimension loading pipelines from candidates who have only designed schemas. The wrong answers are 'skip the fact' and 'queue it for later.' The Pattern: Inferred Members Your inferred member answer: 'I insert a placeholder row into dim_customer with customer_id = C999, name = Unknown, city = Unknown, is_inf
Correction and Reversal Patterns (concepts: dmImmutableLogs)
The interviewer asks: 'The source system sent incorrect data yesterday and just sent a correction. How does your model handle it?' They are testing whether you destructively update fact rows (no-hire signal) or use reversal patterns that preserve the audit trail (strong-hire signal). In financial data modeling interviews, this question is worth 20% of the scorecard. Why 'Just UPDATE It' Is a No-Hire Answer The Reversal Pattern: The Strong-Hire Answer Your reversal answer: 'Two new rows: a revers
Designing for Reprocessability (concepts: dmLateArriving)
The closing question in any late-data interview: 'Can you reprocess this pipeline from scratch?' This tests whether your entire design is built for reprocessability or whether it has hidden assumptions that break on rerun. Candidates who say 'yes, because every component is idempotent and parameterized by date' pass. Candidates who hesitate fail. The Three Things the Interviewer Wants to Hear The Pattern You Should Be Able to Write Idempotent DELETE-INSERT per Partition The DELETE-INSERT pattern