The Identity Problem
A hard Pipeline Design interview practice problem on DataDriven. Write and execute real pipeline design code with instant grading.
- Domain
- Pipeline Design
- Difficulty
- hard
- Seniority
- L7
Problem
Our client has been running an Informatica ETL that populates a customer dimension with SCD Type 2 history for 15 years, but it breaks every time the source schema changes and takes 8 hours to run on a 10M-row table. We need to rewrite it in PySpark on Databricks while keeping the legacy system live during migration. The hardest part: the same customer appears under different IDs in 40 source systems and we need to unify them without losing the historical SCD trail.
Summary
Old systems. New demands. The same customer appears under three different names.
Practice This Problem
Solve this Pipeline Design problem with real code execution. DataDriven runs your solution and grades it automatically.