Lesson | DataDriven

Testing Transforms (concepts: paDataQuality)

What They Want to Hear 'Three layers: unit tests (seconds, test individual transforms on mock data), integration tests (minutes, test end-to-end with a real database), and data diff tests (hours, compare production output before and after a change). Each layer catches different bugs. Unit tests catch logic errors. Integration tests catch environment issues. Data diffs catch subtle regressions that unit tests cannot anticipate.'

Quality as a System (concepts: paDataQuality)

What They Want to Hear 'Six steps: detect, triage, contain, root cause, remediate, post-mortem. Detect: automated alerts fire. Triage: is this impacting consumers? What is the blast radius? Contain: quarantine bad data, serve last known good. Root cause: trace back from the symptom. Remediate: fix and reprocess. Post-mortem: what failed, why, and what changes to prevent recurrence.'

Transform Cost (concepts: paEltVsEtl)

What They Want to Hear '60-70% of platform compute typically goes to transformation. The three biggest levers: switch full-refresh models to incremental (5-10x cheaper), use views instead of materialized tables where latency allows (free), and sort/cluster data on write to reduce downstream scan costs. Incremental models are the single biggest cost optimization most teams can make.'

Dedup at Scale (concepts: paDeduplication)

What They Want to Hear 'At 1 billion rows, ROW_NUMBER works fine with proper partitioning. At 10 billion, MERGE/UPSERT is more efficient: stage the delta, merge against the target. At 100 billion+ or for fuzzy matching, MinHash LSH (locality-sensitive hashing) reduces the comparison space from O(n^2) to near-linear by grouping similar records into buckets.'

Observability Platforms (concepts: paDataQuality)

What They Want to Hear 'Buy for small teams (Monte Carlo, Bigeye: observability as a service). Build for large platform teams (Great Expectations: open-source, customizable). The decision is cost of tool vs cost of engineering time. A $50K/year observability tool replaces 2-3 months of engineering effort to build something equivalent.'

The Transformation Layer

Lesson Sections