Re-Execution: The Chain Runs Again Every Time
Here is a consequence of laziness that surprises almost everyone, and it is the bridge to the next big topic. A DataFrame remembers how to produce itself, not the data it produces. So when you call two actions on the same chain, Spark runs the entire chain twice, once for each action. It does not quietly remember the result from the first action and reuse it. It recomputes from the original source every single time you ask. Say you build an expensive chain, a big join followed by an aggregation, and then you call count to see how many rows it has, and later call show to look at a few of them. Those are two actions, so Spark reads the source data and runs the whole expensive join and aggregation twice. If that chain took ten minutes, you just spent twenty. Nothing warned you, because each a
About This Interactive Section
This section is part of the Lazy Until You Ask lesson on DataDriven, a free data engineering interview prep platform. Each section includes explanations, worked examples, and hands-on code challenges that execute in real time. SQL queries run against a live PostgreSQL database. Python runs in a sandboxed Docker container. Data modeling problems validate against interactive schema canvases. All content is framed around what data engineering interviewers actually test at companies like Meta, Google, Amazon, Netflix, Stripe, and Databricks.
How DataDriven Lessons Work
DataDriven combines four interview rounds (SQL, Python, Data Modeling, Pipeline Architecture) with adaptive difficulty and spaced repetition. Easy problems get harder as you improve. Weak concepts resurface until you master them. Your readiness score tracks progress across every topic interviewers test. Every lesson section ends with problems you solve by writing and running real code, not by picking multiple-choice answers.