Loading lesson...
Answer the Spark architecture question that appears in every technical screen
Answer the Spark architecture question that appears in every technical screen
Topics covered: Spark Execution Model, Distributed Primitives, Shuffle Operations, Memory Management, Small File Problem
What They Want to Hear 'Spark splits work across a cluster. The driver is the coordinator: it plans the work, divides it into tasks, and sends those tasks to executors. Executors are the workers: each one processes a partition of the data in parallel. The key insight is that Spark is lazy. It builds a plan (the DAG) but does not execute anything until you call an action like .count() or .write().' That is the answer. Driver plans, executors execute, nothing happens until an action triggers it. T
What They Want to Hear 'A transformation defines a new dataset from an existing one without executing anything. An action triggers execution and returns a result. Narrow transformations like filter and map process each partition independently. Wide transformations like groupBy and join require data to move between executors, which creates a shuffle.' That is the answer. Narrow = no data movement. Wide = shuffle. This distinction is the foundation of Spark performance.
What They Want to Hear 'A shuffle redistributes data across executors. It happens when Spark needs to group or join data by a key, and the matching rows are spread across different partitions. Shuffles are expensive because every executor must write its data to disk, send it over the network, and every receiving executor must read and merge it. The number one way to avoid unnecessary shuffles is broadcast joins: if one side of the join is small enough to fit in memory, broadcast it to all execut
What They Want to Hear 'Each executor gets a fixed amount of memory, split between storage (caching data) and execution (shuffles, joins, sorts). When execution memory runs out, Spark spills data to disk, which is much slower. When the disk fills up too, the job fails with an out-of-memory error. The fix depends on the cause: too few partitions means each one is too large, so repartition to create smaller chunks. Too much data cached means storage is crowding out execution, so unpersist unused c
What They Want to Hear 'Too many small files kill read performance. Each file requires a separate metadata lookup, a separate file open, and a separate read request. Thousands of 1KB files are far slower to read than one 128MB file with the same data. The target file size is 128MB to 256MB. To fix small files, I use coalesce() to reduce the number of output partitions before writing, or run a compaction job that rewrites small files into larger ones.' That is the answer. Target size, the problem