When the Optimizer Guesses Wrong
For all its sophistication, the optimizer can produce a bad plan, and knowing when and why matters because the fix usually means giving it better information or an explicit hint rather than fighting it. Almost every optimizer mistake traces back to the same root: a decision made from an estimate that turned out to be wrong. The most common cause is stale or missing statistics, the issue the intermediate tier introduced. The optimizer chooses a join strategy from estimated sizes, so if a table grew tenfold since it was last analyzed, Catalyst plans for the old size and may choose a sort-merge join where a broadcast would now be wrong, or worse, try to broadcast a side that is no longer small and overwhelm memory. Bad cardinality estimates, underestimating how many rows a filter or join will
About This Interactive Section
This section is part of the Tungsten: Performance as a Hardware Problem lesson on DataDriven, a free data engineering interview prep platform. Each section includes explanations, worked examples, and hands-on code challenges that execute in real time. SQL queries run against a live PostgreSQL database. Python runs in a sandboxed Docker container. Data modeling problems validate against interactive schema canvases. All content is framed around what data engineering interviewers actually test at companies like Meta, Google, Amazon, Netflix, Stripe, and Databricks.
How DataDriven Lessons Work
DataDriven combines four interview rounds (SQL, Python, Data Modeling, Pipeline Architecture) with adaptive difficulty and spaced repetition. Easy problems get harder as you improve. Weak concepts resurface until you master them. Your readiness score tracks progress across every topic interviewers test. Every lesson section ends with problems you solve by writing and running real code, not by picking multiple-choice answers.