Where the Driver Lives

The cluster manager is the layer that owns the machines and grants executors. You will run on one of three, and they are largely interchangeable from your code's point of view. What actually changes your debugging is the deploy mode: where the driver process physically runs. Client mode vs cluster mode, and why it bites you If your job runs fine in a notebook but mysteriously hangs or runs out of memory as a scheduled job, suspect the deploy mode. In client mode your driver is your laptop, with laptop memory and a flaky connection. A collect() that worked at your desk can OOM the driver when the same code runs in cluster mode against full production data.

About This Interactive Section

This section is part of the How a Spark Job Runs: Stages and Plans lesson on DataDriven, a free data engineering interview prep platform. Each section includes explanations, worked examples, and hands-on code challenges that execute in real time. SQL queries run against a live PostgreSQL database. Python runs in a sandboxed Docker container. Data modeling problems validate against interactive schema canvases. All content is framed around what data engineering interviewers actually test at companies like Meta, Google, Amazon, Netflix, Stripe, and Databricks.

How DataDriven Lessons Work

DataDriven combines four interview rounds (SQL, Python, Data Modeling, Pipeline Architecture) with adaptive difficulty and spaced repetition. Easy problems get harder as you improve. Weak concepts resurface until you master them. Your readiness score tracks progress across every topic interviewers test. Every lesson section ends with problems you solve by writing and running real code, not by picking multiple-choice answers.