Loading section...
What About Data Skew?
Concepts: paShuffleOptimization, paDistributedPrimitives, paSmallFiles
The skew question goes beyond salting. The interviewer wants to see custom partitioners, two-phase aggregation, pre-processing pipelines that eliminate skew before it reaches the join, and the judgment to know which approach fits which situation. Adaptive Skew Handling (AQE) AQE's skew join optimization detects skewed partitions after the shuffle write and automatically splits them. It compares each partition's size to the median: if a partition exceeds skewedPartitionFactor × median AND skewedPartitionThresholdInBytes, Spark splits it and replicates the matching partition from the other side. This is automatic salting - no code changes required. Two-Phase Aggregation For skewed aggregations (not joins), two-phase aggregation is the standard fix. Phase 1: add a random salt, aggregate loc