The Cluster: Who Plans, Who Works
When you run a query against Postgres, one process does the work. Spark splits that single role into three. The driver is the process running your program: it holds your code, builds the execution plan, and decides what work needs doing. It is the only part that sees the whole job. The executors are separate processes on separate machines that do the actual data crunching. The cluster manager is the layer that owns the pool of machines and hands executors to your job when it asks. The mental shift from a single database The hard part of this for a SQL person is that the driver does not touch your data. It never sees the rows. It builds a plan, ships that plan to the executors, and waits. If you accidentally pull all your data back to the driver (we will see how), you have collapsed a distr
About This Interactive Section
This section is part of the How a Spark Job Runs lesson on DataDriven, a free data engineering interview prep platform. Each section includes explanations, worked examples, and hands-on code challenges that execute in real time. SQL queries run against a live PostgreSQL database. Python runs in a sandboxed Docker container. Data modeling problems validate against interactive schema canvases. All content is framed around what data engineering interviewers actually test at companies like Meta, Google, Amazon, Netflix, Stripe, and Databricks.
How DataDriven Lessons Work
DataDriven combines four interview rounds (SQL, Python, Data Modeling, Pipeline Architecture) with adaptive difficulty and spaced repetition. Easy problems get harder as you improve. Weak concepts resurface until you master them. Your readiness score tracks progress across every topic interviewers test. Every lesson section ends with problems you solve by writing and running real code, not by picking multiple-choice answers.