Loading lesson...
Predict whether your code will scale or collapse
Predict whether your code will scale or collapse
Topics covered: Why Speed Matters, O(1) -- Constant Time, O(n) -- Linear Time, O(n²) -- Quadratic Time, Reading Complexity at a Glance
Picture this: you build a deduplication step for a data pipeline. During development, it runs against ten thousand rows and finishes in three seconds. You ship it. Weeks later, the table grows to ten million rows and your pipeline does not just slow down -- it takes thirty-five days. Not thirty-five minutes. Thirty-five days. Nothing in the code changed. The data changed. And the code was never built to handle it. This is the problem that Big O notation solves. It gives you a way to predict, bef
An O(1) operation takes the same amount of time whether your dataset has ten rows or ten billion. The "1" does not mean it takes one nanosecond or performs one step. It means the time is constant with respect to the input size. Think of it like looking up a word in a dictionary by page number: it does not matter if the dictionary has 200 pages or 200,000 pages, flipping to page 47 always takes the same effort. That is O(1). Why it matters: unlike a scan, a hash lookup computes exactly where your
An O(n) algorithm does work that grows in direct proportion to the input size. If processing one million rows takes one second, processing two million takes about two seconds. The relationship is a straight line on a graph, which is why it is called linear time. Linear algorithms are the workhorses of data engineering. Full table scans, aggregations, pandas vectorized operations, and single-pass transformations are all O(n). For many tasks, O(n) is the best you can possibly do, because you need
Quadratic time is the pipeline killer. An O(n²) algorithm does work that grows with the square of the input size. If processing a thousand rows takes one second, ten thousand rows takes one hundred seconds, and one hundred thousand rows takes nearly three hours. Quadratic algorithms are the most common cause of real-world pipeline performance disasters, and they almost always come from the same pattern: doing an O(n) operation inside an O(n) loop. Why it matters: a nested loop join re-scans the
You do not need to run timing experiments every time you want to know if code will scale. Experienced data engineers glance at code and immediately identify its complexity class. This section teaches you the three simple rules that make that possible. Rule 1: Sequential Steps Add When steps run one after another (not nested), you add their complexities. Two sequential loops over n items contribute O(n) + O(n) = O(2n), which simplifies to O(n). It does not matter if you loop through the data twic