Task Failure and Retry
In a single database, a failed query just fails. In a cluster of hundreds of machines, something failing is normal: a node gets preempted, a network blip drops a connection, an executor runs out of memory. Spark is built to absorb that. A failed task is retried automatically, by default up to spark.task.maxFailures (4) times, before the whole stage, and then the job, is declared failed. The footgun: retries plus side effects Automatic retry is only safe because a recomputed partition produces the same result. The moment a task has a side effect that is not idempotent, for example writing to an external system row by row, a retry can double-write. This is the deep reason Spark pushes you toward idempotent, atomic writes: the engine assumes any task may run more than once. A task that is not
About This Interactive Section
This section is part of the How a Spark Job Runs: Scheduler Internals lesson on DataDriven, a free data engineering interview prep platform. Each section includes explanations, worked examples, and hands-on code challenges that execute in real time. SQL queries run against a live PostgreSQL database. Python runs in a sandboxed Docker container. Data modeling problems validate against interactive schema canvases. All content is framed around what data engineering interviewers actually test at companies like Meta, Google, Amazon, Netflix, Stripe, and Databricks.
How DataDriven Lessons Work
DataDriven combines four interview rounds (SQL, Python, Data Modeling, Pipeline Architecture) with adaptive difficulty and spaced repetition. Easy problems get harder as you improve. Weak concepts resurface until you master them. Your readiness score tracks progress across every topic interviewers test. Every lesson section ends with problems you solve by writing and running real code, not by picking multiple-choice answers.