Loading lesson...
Answer ETL, medallion, quality, dedup, and monitoring questions
What They Want to Hear 'ELT for batch. Load raw data into the warehouse first, then transform it there using dbt. This preserves the raw data for re-transformation if business logic changes, and modern warehouses have the compute to handle it.' That is the safe, confident answer. ETL is the older pattern where you clean data before loading. The difference is just the order: clean-then-load (ETL) vs load-then-clean (ELT).
What They Want to Hear 'Three layers. Bronze is the raw safety net: exact copy of source data, append-only. Silver is where data engineering happens: deduplication, type casting, standardization. Gold is where business logic lives: aggregations, KPIs, metrics. The key insight is that Bronze lets you re-derive everything downstream. If Silver gets corrupted, you rebuild it from Bronze.'
What They Want to Hear 'Three checks at every quality gate: completeness (are all expected rows present?), accuracy (are values within expected ranges?), freshness (did the data arrive on time?). These run between medallion layers. If any check fails, the data is quarantined, not published. Consumers see the last known good data while we investigate.'
What They Want to Hear 'Duplicates come from retries, replays, overlapping sources, and late data. My standard pattern is ROW_NUMBER() OVER (PARTITION BY primary_key ORDER BY updated_at DESC), take row 1. This keeps the freshest copy of each record.' Then write the SQL:
What They Want to Hear 'I monitor three things: freshness (did the pipeline run on time?), volume (did it produce the expected number of rows?), and content (are null rates and value distributions within normal ranges?). Thresholds are set based on historical variance: alert when values exceed 2 standard deviations from the 7-day rolling average.'