DataDriven
LearnPracticeInterviewDiscussDailyJobs

The Identity Problem

A hard Pipeline Design interview practice problem on DataDriven. Write and execute real pipeline design code with instant grading.

Domain
Pipeline Design
Difficulty
hard
Seniority
L7

Problem

Our client has been running an Informatica ETL that populates a customer dimension with SCD Type 2 history for 15 years, but it breaks every time the source schema changes and takes 8 hours to run on a 10M-row table. We need to rewrite it in PySpark on Databricks while keeping the legacy system live during migration. The hardest part: the same customer appears under different IDs in 40 source systems and we need to unify them without losing the historical SCD trail.

Summary

Old systems. New demands. The same customer appears under three different names.

Practice This Problem

Solve this Pipeline Design problem with real code execution. DataDriven runs your solution and grades it automatically.

Related

  • All Practice Problems
  • Mock Interview Mode
  • System Design Interview Questions
  • Data Engineering Interview Prep Guide
  • Daily Challenge
  • Data Engineering Lessons