Complexity: Intermediate

The beginner lesson gave you the essentials: O(1), O(n), and O(n²). Those three classes cover about 80% of what you will encounter in data engineering. This lesson fills in the remaining 20%. You will learn why database indexes are so fast (O(log n)), what makes ORDER BY and window functions expensive (O(n log n)), why the same algorithm can be fast or slow depending on your data, and how to measure where your code actually spends its time. By the end, you will be able to diagnose performance problems with real profiling tools, not just theory.

O(log n): Logarithmic Time

Daily Life

Interviews

Explain why indexes make queries fast

Logarithmic time is one of the most powerful complexity classes in computing. An O(log n) algorithm does not look at every element. Instead, it eliminates half of the remaining possibilities at each step. This halving strategy means that even enormous inputs require surprisingly few operations. Searching a sorted list of one billion elements takes at most 30 comparisons, because you only need to halve a billion 30 times to reach a single element.

Why it matters: a sorted index lets the database cut the search space in half at every step. Doubling your table size adds only one extra comparison, which is why indexed queries barely slow down as data grows.

The Power of Halving

The logarithm answers a simple question: "How many times must I divide n by 2 to reach 1?" Start with 1,000 elements and halve each step: you reach one element in about 10 steps (because 2¹⁰ = 1,024). Start with one billion elements: you need about 30 steps (because 2³⁰ is roughly one billion). Doubling the input adds only one extra step. This is the opposite of exponential growth, and it is why logarithmic algorithms handle massive datasets effortlessly.

Binary Search: The Classic Example

Binary search works on sorted data by checking the middle element, then discarding the half that cannot contain the target. Each comparison eliminates half the remaining candidates.

	def binary_search(sorted_list, target):
	"""Find target by halving. O(log n)."""
	left, right = 0, len(sorted_list) - 1
	steps = 0

	while left <= right:
	steps += 1
	mid = (left + right) // 2
	if sorted_list[mid] == target:
	return mid, steps
	elif sorted_list[mid] < target:
	left = mid + 1
	else:
	right = mid - 1
	return -1, steps

	# Watch how slowly the step count grows
	for size in [100, 10_000, 1_000_000, 100_000_000]:
	data = list(range(size))
	_, steps = binary_search(data, size - 1)
	print(f"n={size:>12,}: found in {steps:>2} steps")

>>>Output

n=         100: found in  7 steps

n=      10,000: found in 14 steps

n=   1,000,000: found in 20 steps

n= 100,000,000: found in 27 steps

The input grew by a factor of one million (from 100 to 100 million), but the steps only grew from 7 to 27. A linear search would need up to 100 million comparisons for the same task. Binary search needs 27.

1 billion elements

Binary search: 30 steps. Linear search: up to 1,000,000,000.

1 trillion elements

Binary search: 40 steps. Linear search: up to 1,000,000,000,000.

B-Tree Indexes: O(log n) in Every Database

Every relational database uses B-tree indexes, and B-tree lookups are O(log n). A B-tree is a balanced tree where each node can hold hundreds of keys and pointers to children. Unlike binary search, which splits in two, a B-tree node splits into hundreds of branches. This makes the tree extremely shallow. A B-tree index on a table with one billion rows typically has only 3 or 4 levels, because each level multiplies capacity by the branching factor (often 200 or more).

When you write WHERE user_id = 12345 and there is a B-tree index on user_id, the database traverses down the tree: start at the root, find which child range contains 12345, descend one level, repeat. Each descent eliminates thousands of possible matches. Three or four page reads later, you have your row. This is why adding an index to a frequently queried column can transform a 30-second query into a 5-millisecond query.

Partition pruning in data warehouses follows the same principle. If your table is partitioned by date into 365 daily partitions, a query filtering on a specific date reads only 1/365th of the data. If those partitions are further organized by a sort key, the database can skip directly to the relevant rows within the partition. This layered pruning is what makes columnar warehouses like Redshift and BigQuery fast on tables with billions of rows.

TIP

Python's bisect module provides production-quality binary search. Use bisect.bisect_left(sorted_list, target) instead of writing your own. It is implemented in C for speed and runs in O(log n).

O(n log n): Quasilinear Time

Daily Life

Interviews

Predict sorting and window function costs

O(n log n) grows only slightly faster than linear. The extra log n factor means doubling the input slightly more than doubles the work, but the growth is so gentle that O(n log n) algorithms handle billions of elements comfortably. For one million rows, n log n is about 20 million operations, only 20 times more than a single linear pass. This complexity class is the home of sorting, and sorting is everywhere in data engineering.

Why it matters: sorting requires touching every element, but divide-and-conquer keeps the total passes to log(n). Ten million rows takes about twenty-three passes, not ten million.

Sorting Is O(n log n)

Computer scientists have proven that any algorithm that sorts by comparing pairs of elements needs at least O(n log n) comparisons. This means Python's sorted() and SQL's ORDER BY are already as fast as sorting can possibly be. You cannot beat O(n log n) with comparison-based sorting. The good news: O(n log n) is very close to linear, so sorting even large datasets is practical.

Python's built-in sorting uses Timsort, a hybrid algorithm designed specifically for real-world data. Timsort is O(n log n) in the worst case and O(n) when data is already partially sorted. It detects naturally ordered sequences in your data and merges them efficiently. Every time you call sorted() in Python, you benefit from one of the most carefully optimized sorting implementations ever written.

	import time

	# Demonstrate O(n log n) scaling
	import random
	random.seed(42)

	print(" n \| Sort time \| Ratio")
	print("-" * 40)
	prev_time = None
	for n in [100_000, 200_000, 400_000, 800_000]:
	data = [random.randint(0, 10**9) for _ in range(n)]
	start = time.time()
	sorted(data)
	elapsed = time.time() - start
	ratio = f"{elapsed / prev_time:.2f}x" if prev_time else " -"
	print(f" {n:>9,} \| {elapsed:.4f}s \| {ratio}")
	prev_time = elapsed

	print("\nRatio is slightly above 2x (not exactly 2x)")
	print("because of the extra log n factor.")

>>>Output

  n         | Sort time | Ratio

----------------------------------------

    100,000 | 0.0321s   |   -

    200,000 | 0.0689s   | 2.15x

    400,000 | 0.1482s   | 2.15x

    800,000 | 0.3194s   | 2.16x

Ratio is slightly above 2x (not exactly 2x)

because of the extra log n factor.

Notice the ratio is consistently around 2.15x when input doubles, not exactly 2x (which would be pure O(n)) and not 4x (which would be O(n²)). The slight extra cost beyond 2x comes from the log n factor. For practical purposes, O(n log n) behaves almost like O(n).

Where O(n log n) Appears in SQL

Every time you use ORDER BY, a window function, or a sort-merge join, the database performs an O(n log n) sort under the hood. Understanding this helps you predict query costs.

ORDER BY

Sorts the entire result set. O(n log n). Free if data is already indexed on the sort column.

Window functions

RANK, ROW_NUMBER, LAG all require sorting within each partition. O(n log n) per partition.

Sort-merge join

Sort both tables on the join key, then merge. Efficient when inputs are pre-sorted.

DISTINCT

Can use sorting to group duplicates, then deduplicate. O(n log n) via sort, O(n) via hash.

When you write RANK() OVER (ORDER BY salary DESC) on a table with 50 million rows, the database must sort all 50 million rows by salary before computing ranks. That sort costs roughly 50,000,000 × 26 = 1.3 billion comparisons. This is fast enough for most analytics but can become a bottleneck when window functions are applied across very large partitions. Maintaining sort keys on your warehouse tables can eliminate these sort costs entirely.

Best, Worst, and Average Case

Daily Life

Interviews

Diagnose data skew and worst-case traps

So far, we have mostly discussed worst-case complexity, the maximum possible work for the most difficult input. But algorithms can behave very differently depending on the data they receive. A sorting algorithm might fly through data that is already nearly sorted but struggle with random data. Understanding best, worst, and average case analysis lets you reason about performance across real-world scenarios, not just theoretical maximums.

BestWorstAverage

Best

Optimistic

Fastest on ideal input data

Worst

Pessimistic

Slowest on adversarial input

Average

Expected

Typical speed on random inputs

When Each Case Matters

Worst case matters when failures are catastrophic or when you have SLAs to meet. If your pipeline must finish in under 4 hours every night, you need worst-case guarantees, not average-case hopes. Average case matters when you care about throughput over many runs. Best case is useful for detecting when optimizations for common patterns are worth adding.

CHOOSING YOUR ANALYSIS

Use worst case when you need hard guarantees: SLAs, real-time systems, security.
Use average case when inputs are unpredictable and you care about overall throughput.
Use best case to detect when optimizations for common patterns (like nearly sorted data) are worthwhile.

Data Skew: Worst Case in Real Life

The most important application of case analysis for data engineers is understanding data skew. Hash joins have a best case and a worst case that differ dramatically. In the best case, keys distribute evenly across hash buckets and the join completes in O(n + m). In the worst case, all keys hash to the same bucket (total collision), and the join degrades to O(n × m), identical to a nested loop join.

In practice, you rarely see total collision. But you do see data skew, where a small number of keys account for most of the rows. Consider a join between an orders table and a customers table. If one customer has 5 million orders while the average customer has 50, the hash join bucket for that customer becomes enormous. The join is technically still O(n + m), but in a distributed system like Spark, one executor gets stuck processing 5 million rows while every other executor finishes in seconds. This is called a straggler task, and it is one of the most common causes of slow Spark jobs.

	# Simulating data skew in a hash join
	from collections import Counter

	# Even distribution: every customer has ~10 orders
	even_orders = [f"customer_{i % 1000}" for i in range(10_000)]
	even_dist = Counter(even_orders)

	# Skewed distribution: one customer dominates
	skewed_orders = ["big_customer"] * 9_000
	skewed_orders += [f"customer_{i}" for i in range(1_000)]
	skewed_dist = Counter(skewed_orders)

	print("Even distribution:")
	print(f" Max orders per customer: {max(even_dist.values())}")
	print(f" Min orders per customer: {min(even_dist.values())}")
	print()
	print("Skewed distribution:")
	print(f" big_customer orders: {skewed_dist["big_customer"]}")
	print(f" Other customers avg: {sum(v for k,v in skewed_dist.items() if k != "big_customer") // 1000}")
	print()
	print("In Spark, the skewed case creates a straggler")
	print("task that takes 9000x longer than the average.")

>>>Output

Even distribution:

  Max orders per customer: 10

  Min orders per customer: 10

Skewed distribution:

  big_customer orders: 9000

  Other customers avg: 1

In Spark, the skewed case creates a straggler

task that takes 9000x longer than the average.

The fix for data skew is to "salt" the skewed key: split the large customer into multiple synthetic keys (like big_customer_1, big_customer_2, etc.), distribute the work evenly across executors, then combine results. This transforms a worst-case scenario into an average-case one.

Python's sorted() Handles All Cases

Python's Timsort is O(n log n) in the worst case (guaranteed), O(n log n) on average, and O(n) in the best case (already sorted data). This makes it safe for every situation. In contrast, the classic quicksort algorithm is O(n log n) on average but O(n²) in the worst case, which is why Python uses Timsort instead. When someone says "this sort is O(n log n)," always ask: best, worst, or average? Python's sorted() gives you O(n log n) in all cases.

TIP

In technical interviews, always state which case you are analyzing. Saying "this is O(n log n)" is incomplete. Saying "this is O(n log n) in the worst case" demonstrates deeper understanding and precision.

Space Complexity

Daily Life

Interviews

Balance memory and speed in pipelines

Time complexity tells you how long an algorithm takes. Space complexity tells you how much memory it uses. Both matter in practice. A data pipeline that is fast but uses 64 GB of RAM on a machine with 16 GB will crash before processing a single row. An algorithm that is memory-efficient but takes hours defeats the purpose of real-time analytics. Understanding space complexity helps you balance speed and memory to fit your system's resources.

O(1) Space vs O(n) Space

Space complexity measures the extra memory an algorithm allocates beyond the input itself. An algorithm that uses a fixed number of variables regardless of input size is O(1) space. An algorithm that creates a new list proportional to the input is O(n) space.

	def sum_list(data):
	"""O(1) space: only uses a single variable."""
	total = 0
	for x in data:
	total += x
	return total

	def double_list(data):
	"""O(n) space: creates a new list of same size."""
	return [x * 2 for x in data]

	data = [1, 2, 3, 4, 5]
	print("Sum:", sum_list(data), "uses O(1) extra space")
	print("Doubled:", double_list(data), "uses O(n) extra space")

>>>Output

Sum: 15 uses O(1) extra space

Doubled: [2, 4, 6, 8, 10] uses O(n) extra space

Generators: 4,000x Less Memory

Generators are the data engineer's secret weapon for memory efficiency. A list comprehension creates every result in memory at once. A generator expression computes each value only when you ask for it, using O(1) memory regardless of how many elements it produces.

	import sys

	# O(n) space: stores all 100k results in memory
	list_result = [x * x for x in range(100_000)]
	list_bytes = sys.getsizeof(list_result)

	# O(1) space: generates values on demand
	gen_result = (x * x for x in range(100_000))
	gen_bytes = sys.getsizeof(gen_result)

	print(f"List of 100k items: {list_bytes:,} bytes")
	print(f"Generator: {gen_bytes:,} bytes")
	print(f"\nList uses {list_bytes // gen_bytes}x more memory!")
	print("Both produce the same values.")
	print("The generator computes each one only when needed.")

>>>Output

List of 100k items: 824,456 bytes

Generator:          200 bytes

List uses 4122x more memory!

Both produce the same values.

The generator computes each one only when needed.

This difference becomes critical at scale. A Python script that reads a 20 GB CSV into a pandas DataFrame needs at least 20 GB of RAM (often 2-5x more due to pandas overhead). If your machine has 16 GB, the script crashes with a MemoryError before processing a single row. The fix is to process in chunks: read 100,000 rows at a time, transform them, write the output, and discard the chunk. This streaming approach uses O(chunk_size) memory instead of O(n).

Broadcast Joins: A Space Complexity Decision

In Spark, the broadcast join is a space complexity decision. When joining a large fact table (billions of rows) with a small dimension table (thousands of rows), Spark can copy the small table to every executor, avoiding an expensive shuffle of the large table. The broadcast join uses O(m) memory per executor, where m is the dimension table size. If m fits in executor memory, the join completes in O(n) time with no network overhead. If m is too large to broadcast, Spark falls back to a shuffle join that moves data across the network.

Knowing the space complexity of your dimension tables tells you instantly whether a broadcast join is viable. A 100 MB dimension table? Broadcast it. A 10 GB dimension table? That probably exceeds executor memory, so Spark will shuffle instead. This is a decision you make based on space complexity every time you write a Spark join.

•Save Time (Spend Space)

Cache results in a dictionary
Build hash tables for fast lookup
Broadcast small tables to every executor
Trade O(n) space for O(1) access time

•Save Space (Spend Time)

Use generators instead of lists
Process files in chunks
Compute values on the fly
Accept O(n) time for O(1) space

TIP

Watch for hidden allocations. List comprehensions create new lists. String concatenation in a loop creates new strings. Slicing with arr[1:] creates a copy. Each of these uses O(n) space that is easy to miss.

Profiling: Measuring What Matters

Daily Life

Interviews

Find real bottlenecks with profiling tools

Theory tells you the growth rate. Profiling tells you the actual bottleneck. An O(n) algorithm with a large constant factor can be slower in practice than an O(n²) algorithm with a tiny constant factor for inputs under 10,000. Cache behavior, memory allocation patterns, and interpreter overhead all create gaps between theoretical predictions and measured performance. Profiling bridges this gap.

timeit: Precise Microbenchmarks

Python's timeit module runs a code snippet many times and reports the average execution time. It automatically disables garbage collection to reduce noise. Always use timeit instead of raw time.time() calls for comparing algorithm performance.

	import timeit

	# Compare list vs set membership
	setup = "data_list = list(range(10000)); data_set = set(data_list)"

	list_time = timeit.timeit("9999 in data_list", setup=setup, number=10000)
	set_time = timeit.timeit("9999 in data_set", setup=setup, number=10000)

	print(f"List lookup (O(n)): {list_time:.4f}s for 10k lookups")
	print(f"Set lookup (O(1)): {set_time:.4f}s for 10k lookups")
	print(f"Set is {list_time / set_time:.0f}x faster")

>>>Output

List lookup (O(n)):  3.2145s for 10k lookups

Set lookup  (O(1)):  0.0012s for 10k lookups

Set is 2679x faster

cProfile: Finding the Real Bottleneck

While timeit measures specific snippets, cProfile measures an entire program and reports which functions consumed the most time. This is essential for real applications where the slow code is not always where you expect. The Pareto principle applies: typically 5% of your code accounts for 95% of the execution time.

	import cProfile
	import io
	import pstats

	def slow_search(data, targets):
	results = []
	for t in targets:
	for item in data: # O(n) scan per target!
	if item == t:
	results.append(item)
	break
	return results

	data = list(range(50000))
	targets = list(range(0, 50000, 100))

	profiler = cProfile.Profile()
	profiler.enable()
	slow_search(data, targets)
	profiler.disable()

	stream = io.StringIO()
	stats = pstats.Stats(profiler, stream=stream)
	stats.sort_stats("cumulative")
	stats.print_stats(3)
	print(stream.getvalue()[:300])

>>>Output

         3 function calls in 0.4521 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)

        1    0.452    0.452    0.452    0.452 <string>:1(slow_search)

        1    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}

Empirical Complexity Detection

You can determine the complexity class of any function without understanding its internals. Just measure its runtime at several input sizes and compute the growth ratio. If doubling the input doubles the time, it is O(n). If it quadruples the time, it is O(n²). If it increases by a small constant, it is O(log n).

	import time

	def mystery_function(n):
	"""What is the complexity? Measure to find out."""
	count = 0
	for i in range(n):
	for j in range(n):
	count += 1
	return count

	print(" n \| Time \| Ratio")
	print("-" * 35)
	prev_time = None
	for n in [500, 1000, 2000, 4000]:
	start = time.time()
	mystery_function(n)
	elapsed = time.time() - start
	ratio = f"{elapsed / prev_time:.2f}x" if prev_time else " -"
	print(f" {n:<6} \| {elapsed:.4f}s \| {ratio}")
	prev_time = elapsed

	print("\nRatio ~4x when n doubles = O(n^2)")

>>>Output

  n      | Time     | Ratio

-----------------------------------

  500    | 0.0156s  |   -

  1000   | 0.0623s  | 3.99x

  2000   | 0.2487s  | 3.99x

  4000   | 0.9941s  | 4.00x

Ratio ~4x when n doubles = O(n^2)

Memory Profiling with tracemalloc

Time profiling tells you where CPU is spent. Memory profiling tells you where RAM is consumed. Python's tracemalloc module, included in the standard library, tracks memory allocations and identifies which lines of code are responsible.

	import tracemalloc

	tracemalloc.start()

	big_list = [i ** 2 for i in range(100_000)]
	big_dict = {str(i): i ** 2 for i in range(100_000)}
	big_set = set(range(100_000))

	current, peak = tracemalloc.get_traced_memory()
	print(f"Current memory: {current / 1024 / 1024:.1f} MB")
	print(f"Peak memory: {peak / 1024 / 1024:.1f} MB")
	tracemalloc.stop()

>>>Output

Current memory: 16.3 MB

Peak memory:    16.5 MB

SQL Profiling: EXPLAIN ANALYZE

For SQL queries, the profiling equivalent is EXPLAIN ANALYZE. It not only shows the query plan but also executes the query and reports actual row counts and timing at each step. The most important things to look for: sequential scans on large tables (missing index?), nested loop joins between large tables (should be a hash join), and sort operations on large result sets (expensive ORDER BY or window function).

	EXPLAIN ANALYZE
	SELECT
	o.order_id,
	c.name,
	o.amount
	FROM orders o
	JOIN customers c
	ON o.customer_id = c.id
	WHERE o.amount > 1000
	ORDER BY o.amount DESC ;

✓Do

Profile before optimizing, not after guessing
Use timeit for microbenchmarks and cProfile for whole programs
Measure at multiple input sizes to detect growth rate
Check EXPLAIN ANALYZE before and after query changes

✗Don't

Optimize based on intuition without measurement
Assume theoretical complexity equals practical speed
Benchmark with unrealistically small inputs
Ignore constant factors when choosing between similar algorithms

❯❯❯PUTTING IT ALL TOGETHER

> Your Spark job joining a 500-million-row clickstream table with a 50,000-row product catalog is taking 3 hours instead of the expected 20 minutes. One executor is stuck at 99% while all others finished long ago.

The straggler task suggests data skew, a worst-case scenario for the hash join. One product (or null key) likely accounts for millions of clicks, creating a massively unbalanced hash bucket.

The product catalog at 50,000 rows is small enough to broadcast (O(m) memory per executor). Switching from a shuffle join to a broadcast join eliminates the skew problem entirely and reduces network I/O.

Space complexity analysis confirms the broadcast is viable: 50,000 rows at ~1 KB each = 50 MB per executor, well within the default 10 GB memory limit.

After the fix, EXPLAIN ANALYZE confirms the new plan uses a broadcast hash join instead of a sort-merge join, and the job completes in 12 minutes with no stragglers.

KEY TAKEAWAYS

O(log n) algorithms halve the problem each step, needing only ~30 operations for a billion elements

B-tree indexes give O(log n) lookups, which is why adding an index transforms slow queries into fast ones

O(n log n) is the cost of sorting, and it is close enough to linear to handle billions of rows

Every ORDER BY, window function, and sort-merge join triggers an O(n log n) sort

Data skew is worst-case behavior in real life, causing straggler tasks in Spark

Space complexity determines if your pipeline fits in memory: generators use O(1) space vs lists' O(n)

Profile with timeit, cProfile, and tracemalloc before optimizing anything

The doubling test detects complexity empirically: 2x ratio = O(n), 4x ratio = O(n²)

Logarithms, space complexity, and profiling

Category: Python
Difficulty: intermediate
Duration: 20 minutes
Challenges: 3 hands-on challenges

Topics covered: O(log n): Logarithmic Time, O(n log n): Quasilinear Time, Best, Worst, and Average Case, Space Complexity, Profiling: Measuring What Matters

Lesson Sections

O(log n): Logarithmic Time (concepts: pyBinarySearch)
Logarithmic time is one of the most powerful complexity classes in computing. An O(log n) algorithm does not look at every element. Instead, it eliminates half of the remaining possibilities at each step. This halving strategy means that even enormous inputs require surprisingly few operations. Searching a sorted list of one billion elements takes at most 30 comparisons, because you only need to halve a billion 30 times to reach a single element. Why it matters: a sorted index lets the database
O(n log n): Quasilinear Time (concepts: pyListSort)
Why it matters: sorting requires touching every element, but divide-and-conquer keeps the total passes to log(n). Ten million rows takes about twenty-three passes, not ten million. Sorting Is O(n log n) Python's built-in sorting uses Timsort, a hybrid algorithm designed specifically for real-world data. Timsort is O(n log n) in the worst case and O(n) when data is already partially sorted. It detects naturally ordered sequences in your data and merges them efficiently. Every time you call sorted
Best, Worst, and Average Case (concepts: pyFrequencyCount)
So far, we have mostly discussed worst-case complexity, the maximum possible work for the most difficult input. But algorithms can behave very differently depending on the data they receive. A sorting algorithm might fly through data that is already nearly sorted but struggle with random data. Understanding best, worst, and average case analysis lets you reason about performance across real-world scenarios, not just theoretical maximums. When Each Case Matters Worst case matters when failures ar
Space Complexity (concepts: pyGenerators)
Time complexity tells you how long an algorithm takes. Space complexity tells you how much memory it uses. Both matter in practice. A data pipeline that is fast but uses 64 GB of RAM on a machine with 16 GB will crash before processing a single row. An algorithm that is memory-efficient but takes hours defeats the purpose of real-time analytics. Understanding space complexity helps you balance speed and memory to fit your system's resources. O(1) Space vs O(n) Space Space complexity measures the
Profiling: Measuring What Matters (concepts: pyTesting)
Theory tells you the growth rate. Profiling tells you the actual bottleneck. An O(n) algorithm with a large constant factor can be slower in practice than an O(n²) algorithm with a tiny constant factor for inputs under 10,000. Cache behavior, memory allocation patterns, and interpreter overhead all create gaps between theoretical predictions and measured performance. Profiling bridges this gap. timeit: Precise Microbenchmarks cProfile: Finding the Real Bottleneck While timeit measures specific s