Data Structures: Advanced

Dijkstra's shortest-path algorithm, implemented with a min-heap, powers routing in Uber's driver dispatch system, Lyft's route optimization engine, and every GPS navigation device on earth, computing optimal routes through graphs with millions of nodes in milliseconds. The min-heap makes it possible by always surfacing the closest unvisited node in O(log n) time rather than scanning the entire frontier on every step. Amazon's warehouse robots use the same heap-based pathfinding to coordinate hundreds of robots navigating around each other in real time. The advanced data structures in this lesson are the ones behind those systems.

The collections Module

Daily Life
Interviews

Count, group, and queue with one import

The collections module provides specialized container datatypes that extend Python's built-in list, dict, tuple, and set. Each solves a specific problem more elegantly and efficiently than the general-purpose alternatives. Learning these tools means writing less code that runs faster.

Counter: Counting Made Easy

A Counter is a dictionary subclass designed for counting hashable objects. Instead of manually building a dict with counts, Counter does it in one line. It also provides methods for finding the most common elements, arithmetic between counters, and more.

1from collections import Counter
2
3# Count word frequencies in a log
4log_events = ["login", "page_view", "login", "purchase",
5 "page_view", "page_view", "logout", "login"]
6
7event_counts = Counter(log_events)
8print("All counts:", event_counts)
9print("Most common 2:", event_counts.most_common(2))
10print("Login count:", event_counts["login"])
11
12# Count characters in a string
13char_counts = Counter("mississippi")
14print("Letter counts:", char_counts)
>>>Output
All counts: Counter({'login': 3, 'page_view': 3, 'purchase': 1, 'logout': 1})
Most common 2: [('login', 3), ('page_view', 3)]
Login count: 3
Letter counts: Counter({'i': 4, 's': 4, 'p': 2, 'm': 1})

Counter is invaluable for log analysis, frequency distributions, and any scenario where you need to tally occurrences. It accepts any iterable and counts elements automatically. Accessing a missing key returns 0 instead of raising a KeyError, which is a subtle but important improvement over regular dicts.

1from collections import Counter
2
3# Counter arithmetic for comparing datasets
4morning = Counter({"errors": 12, "warnings": 45, "info": 200})
5afternoon = Counter({"errors": 8, "warnings": 30, "info": 180})
6
7# Combine counts from both periods
8total = morning + afternoon
9print("Total:", total)
10
11# Find difference (morning minus afternoon)
12diff = morning - afternoon
13print("Morning excess:", diff)
>>>Output
Total: Counter({'info': 380, 'warnings': 75, 'errors': 20})
Morning excess: Counter({'warnings': 15, 'info': 20, 'errors': 4})

defaultdict: Auto Defaults

A defaultdict automatically creates a default value when you access a missing key. Instead of checking if a key exists before using it, you just use it. This eliminates the setdefault() pattern entirely and makes grouping code much cleaner.

1from collections import defaultdict
2
3# Group records by category - no if-check needed
4events = [
5 ("user_1", "click"), ("user_2", "view"),
6 ("user_1", "purchase"), ("user_3", "click"),
7 ("user_2", "click"), ("user_1", "view"),
8]
9
10by_user = defaultdict(list)
11for user_id, action in events:
12 by_user[user_id].append(action)
13
14for user, actions in sorted(by_user.items()):
15 print(f"{user}: {actions}")
>>>Output
user_1: ['click', 'purchase', 'view']
user_2: ['view', 'click']
user_3: ['click']

The factory function you pass to defaultdict determines the default value type. Pass list for grouping, int for counting, set for unique collections, or any callable that returns the default value you want.

listintsetdict
list
Group items
Append to lists by key
int
Count things
Increment counters by key
set
Unique per key
Collect distinct values
dict
Nested groups
Build multi-level maps
Python Quiz

> Count character frequencies using a defaultdict. Pick the factory function that initializes missing keys to zero, and the function that counts how many unique characters exist.

from collections import defaultdict
counts = defaultdict(___)
for char in "hello":
    counts[char] += 1
print(counts["l"])
print(___(counts))
list
str
sum
len
int

deque: Double-Ended Queue

A deque (pronounced "deck") supports O(1) append and pop from both ends. Lists are fast at the right end but slow at the left because inserting at position 0 requires shifting every element. Deques solve this with a doubly-linked list implementation.

1from collections import deque
2
3# Sliding window of recent events
4recent_events = deque(maxlen=3)
5
6for event in ["login", "page_view", "search", "purchase", "logout"]:
7 recent_events.append(event)
8 print(f"Window: {list(recent_events)}")
>>>Output
Window: ['login']
Window: ['login', 'page_view']
Window: ['login', 'page_view', 'search']
Window: ['page_view', 'search', 'purchase']
Window: ['search', 'purchase', 'logout']

The maxlen parameter creates a bounded deque that automatically discards old items when full. This is perfect for sliding windows, recent activity logs, and any scenario where you only care about the last N items. No manual size management needed.

list
  • O(1) append to right end
  • O(n) insert at left end
  • O(1) random index access
  • Good for stack operations
deque
  • O(1) append to either end
  • O(1) pop from either end
  • O(n) random index access
  • Good for queue operations
Python Quiz

> A bounded deque keeps only the last N items. Pick the method that adds to the right end (dropping old items from the left), and the function that shows how many items remain.

from collections import deque
recent = deque(maxlen=3)
for x in [1, 2, 3, 4, 5]:
    recent.___(x)
print(list(recent))
print(___(recent))
sum
max
len
appendleft
append

namedtuple: Lightweight

A namedtuple creates an immutable record type with named fields. It uses the same memory as a regular tuple but lets you access fields by name instead of index. This dramatically improves code readability while keeping the memory efficiency of tuples.

1from collections import namedtuple
2
3# Define a record type
4LogEntry = namedtuple("LogEntry", ["timestamp", "level", "message"])
5
6# Create instances
7entry = LogEntry("2024-01-15T10:30:00", "ERROR", "Connection timeout")
8print(f"Level: {entry.level}")
9print(f"Message: {entry.message}")
10
11# Still works like a tuple
12print(f"As tuple: {tuple(entry)}")
13print(f"Unpacking: {entry[0]}, {entry[1]}")
14
15# Immutable - cannot modify
16# entry.level = "WARNING" # AttributeError!
>>>Output
Level: ERROR
Message: Connection timeout
As tuple: ('2024-01-15T10:30:00', 'ERROR', 'Connection timeout')
Unpacking: 2024-01-15T10:30:00, ERROR

Named tuples are ideal for representing fixed records like database rows, CSV records, or API response objects. They are hashable (so they can be dictionary keys), immutable (so they are thread-safe), and self-documenting (so the code is readable).

Custom Structures

Daily Life
Interviews

Build memory-efficient record types

When named tuples are too rigid and plain dicts are too loose, Python offers dataclasses and custom classes with __slots__. These give you mutable records with type hints, default values, comparison methods, and memory optimization - all with minimal boilerplate.

dataclasses: Modern Records

The @dataclass decorator automatically generates __init__, __repr__, and __eq__ methods from your type-annotated fields. Add frozen=True for immutability, or order=True for comparison operators.

1from dataclasses import dataclass, field
2
3@dataclass
4class PipelineMetric:
5 pipeline_name: str
6 records_processed: int
7 error_count: int = 0
8 tags: list = field(default_factory=list)
9
10# Auto-generated __init__
11metric = PipelineMetric("etl_daily", 50000, error_count=3)
12print(metric)
13
14# Auto-generated __eq__
15metric2 = PipelineMetric("etl_daily", 50000, error_count=3)
16print(f"Equal: {metric == metric2}")
17
18# Mutable - can update fields
19metric.records_processed += 1000
20print(f"Updated: {metric.records_processed}")
>>>Output
PipelineMetric(pipeline_name='etl_daily', records_processed=50000, error_count=3, tags=[])
Equal: True
Updated: 51000

Notice the field(default_factory=list) for the tags field. This avoids the mutable default argument bug from the intermediate lesson. Each instance gets its own fresh list instead of sharing one across all instances.

Frozen Dataclasses

Adding frozen=True makes a dataclass immutable and hashable, combining the safety of tuples with the readability of named fields. Frozen dataclasses are ideal for configuration objects, cache keys, and any data that should not change after creation.

1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class CacheKey:
5 table_name: str
6 query_hash: str
7 partition: int
8
9key1 = CacheKey("users", "abc123", 0)
10key2 = CacheKey("users", "abc123", 0)
11
12# Hashable - can use as dict key
13cache = {key1: "cached_result"}
14print(f"Cache hit: {cache[key2]}")
15
16# Immutable - assignment raises error
17# key1.partition = 1 # Error!
18print(f"Equal: {key1 == key2}")
19print(f"Hash: {hash(key1)}")
>>>Output
Cache hit: cached_result
Equal: True
Hash: 5765498321047685832
Fill in the Blank

> You need a DbConfig dataclass that can be used as a dictionary key. Pick the decorator option and the expression to verify it works.

from dataclasses import dataclass

@dataclass()
class DbConfig:
    host: str
    port: int

c = DbConfig("localhost", 5432)
print(type())

__slots__ for Memory

By default, Python objects store attributes in a __dict__ dictionary, which is flexible but memory-hungry. Declaring __slots__ tells Python to allocate a fixed amount of memory for each instance, eliminating the per-instance dict. For millions of objects, this saves gigabytes.

1import sys
2
3class RegularPoint:
4 def __init__(self, x, y):
5 self.x = x
6 self.y = y
7
8class SlottedPoint:
9 __slots__ = ("x", "y")
10 def __init__(self, x, y):
11 self.x = x
12 self.y = y
13
14regular = RegularPoint(1, 2)
15slotted = SlottedPoint(1, 2)
16
17print(f"Regular size: {sys.getsizeof(regular)} bytes")
18print(f"Slotted size: {sys.getsizeof(slotted)} bytes")
19print(f"Savings: {sys.getsizeof(regular) - sys.getsizeof(slotted)} bytes per object")
>>>Output
Regular size: 48 bytes
Slotted size: 48 bytes
Savings: 0 bytes per object

The real savings come from eliminating the __dict__ overhead. For classes with few attributes that you create millions of times, __slots__ can reduce memory by 40-50%. Modern dataclasses also support slots=True in Python 3.10 and later.

dataclass
dataclass
Auto-generated methods, type hints, defaults. Best for most record types.
namedtuple
namedtuple
Immutable, lightweight, hashable. Best for read-only records.
__slots__
__slots__
Memory-optimized classes. Best for millions of identical small objects.
TypedDict
TypedDict
Type hints for dicts without runtime overhead. Best for JSON-shaped data.

Performance Profiling

Daily Life
Interviews

Measure time and memory trade-offs

Choosing the right data structure requires understanding their actual performance characteristics. Theoretical Big-O complexity tells you the growth rate, but real-world performance depends on constant factors, cache behavior, and memory allocation patterns. Python provides tools like timeit and sys.getsizeof() to measure both time and memory so you can make informed decisions.

Timing Operations

The timeit module runs code snippets thousands of times and reports accurate timing. This eliminates noise from system load and gives reproducible measurements. Always benchmark with realistic data sizes - performance at 100 items may differ dramatically from performance at 1 million.

1import time
2
3# Measure membership testing: list vs set
4data = list(range(100000))
5data_set = set(data)
6target = 99999
7
8# List membership
9start = time.time()
10for _ in range(1000):
11 target in data
12list_time = time.time() - start
13
14# Set membership
15start = time.time()
16for _ in range(1000):
17 target in data_set
18set_time = time.time() - start
19
20print(f"List: {list_time:.4f}s")
21print(f"Set: {set_time:.4f}s")
22print(f"Set is {list_time / max(set_time, 0.0001):.0f}x faster")
>>>Output
List: 0.8234s
Set: 0.0001s
Set is 8234x faster

The difference is staggering. For 100,000 elements with worst-case lookup, sets are thousands of times faster. This is the practical consequence of O(n) versus O(1) that you saw in theory - now you can measure it yourself.

Memory Profiling

Memory usage is equally important, especially when processing large datasets. Each data structure has different overhead per element. Understanding these costs helps you choose structures that fit within your memory budget. Use sys.getsizeof() to measure actual sizes.

1import sys
2
3# Compare per-element overhead
4n = 10000
5data = list(range(n))
6
7structures = {
8 "list": list(data),
9 "tuple": tuple(data),
10 "set": set(data),
11 "frozenset": frozenset(data),
12 "dict": {i: i for i in data},
13}
14
15print(f"Memory for {n:,} integers:")
16for name, struct in structures.items():
17 size = sys.getsizeof(struct)
18 per_item = size / n
19 print(f" {name:>10}: {size:>8,} bytes ({per_item:.1f} bytes/item)")
>>>Output
Memory for 10,000 integers:
list: 85,176 bytes (8.5 bytes/item)
tuple: 80,040 bytes (8.0 bytes/item)
set: 524,504 bytes (52.5 bytes/item)
frozenset: 524,504 bytes (52.5 bytes/item)
dict: 368,728 bytes (36.9 bytes/item)
01
tuple
Most compact at ~8 bytes per item. No overhead for mutability or hashing.
02
list
Slightly more than tuple at ~8.5 bytes. Extra space reserved for growth.
03
dict
About 37 bytes per item. Hash table overhead for key-value mapping.
04
set
About 52 bytes per item. Highest overhead due to hash table for membership.
TIP
When memory is tight and you only need to read data, convert lists to tuples and sets to frozensets. The immutable variants use slightly less memory and signal intent.
Python Quiz

> You need to measure how much memory a data structure uses. Pick the correct module to access memory measurement, and the correct function to get the size in bytes.

import ___
data = [1, 2, 3, 4, 5]
print(sys.___(data))
print(type(data).__name__)
sizeof
getsizeof
time
sys
os
Profiling should always come before optimization. Measure first, then target the specific bottleneck. Optimizing without data often means spending time on code that is not actually the performance problem.

Remember that sets and dicts trade memory for speed. When memory is the constraint, tuples and frozensets are your allies. When speed is the constraint, hash-based structures like sets and dicts give you O(1) operations.

Understanding these performance characteristics helps you reason about which structure belongs in each part of a pipeline. A sliding window uses a deque, a lookup table uses a dict, and a deduplication pass uses a set.

Caching Strategies

Daily Life
Interviews

Cache results to skip repeated work

Caching is one of the most impactful performance techniques in data engineering. By storing the results of expensive computations or database queries, you avoid repeating work. Python provides built-in caching tools, and understanding how to build custom caches using data structures gives you fine-grained control over eviction policies, size limits, and expiration.

Using functools.lru_cache

The simplest caching tool is @lru_cache from functools. It automatically memoizes function results based on arguments. LRU stands for Least Recently Used - when the cache is full, the least recently accessed entry is evicted to make room for new ones.

1from functools import lru_cache
2
3@lru_cache(maxsize=128)
4def expensive_query(user_id):
5 """Simulates a slow database lookup."""
6 print(f" [DB query for user {user_id}]")
7 return {"id": user_id, "name": f"User_{user_id}"}
8
9# First call - executes the function
10result1 = expensive_query(42)
11print(f"Result: {result1}")
12
13# Second call - returns cached result instantly
14result2 = expensive_query(42)
15print(f"Cached: {result2}")
16
17# Cache statistics
18print(f"Cache info: {expensive_query.cache_info()}")
>>>Output
[DB query for user 42]
Result: {'id': 42, 'name': 'User_42'}
Cached: {'id': 42, 'name': 'User_42'}
Cache info: CacheInfo(hits=1, misses=1, maxsize=128, currsize=1)

Notice that the database query message only prints once. The second call returns the cached result without executing the function body. The cache_info() method shows hits, misses, and current size, letting you monitor cache effectiveness.

Building a Custom LRU Cache

For more control over caching behavior, you can build a custom LRU cache using OrderedDict. This approach lets you add TTL (time-to-live) expiration, custom eviction logic, and size-based limits that lru_cache does not support.

1from collections import OrderedDict
2
3class LRUCache:
4 def __init__(self, capacity):
5 self.cache = OrderedDict()
6 self.capacity = capacity
7
8 def get(self, key):
9 if key in self.cache:
10 self.cache.move_to_end(key)
11 return self.cache[key]
12 return None
13
14 def put(self, key, value):
15 if key in self.cache:
16 self.cache.move_to_end(key)
17 self.cache[key] = value
18 if len(self.cache) > self.capacity:
19 evicted = self.cache.popitem(last=False)
20 print(f" Evicted: {evicted[0]}")
21
22cache = LRUCache(capacity=3)
23for user_id in [1, 2, 3, 1, 4]:
24 cache.put(user_id, f"data_{user_id}")
25 print(f"Cache after {user_id}: {list(cache.cache.keys())}")
>>>Output
Cache after 1: [1]
Cache after 2: [1, 2]
Cache after 3: [1, 2, 3]
Cache after 1: [2, 3, 1]
Evicted: 2
Cache after 4: [3, 1, 4]

The move_to_end() method is the key to LRU behavior. Every access moves the item to the end (most recently used). When the cache exceeds capacity, popitem(last=False) removes the first item (least recently used). This gives you O(1) get, put, and eviction operations.

Debug Challenge

> This LRU cache reads key "a" but does not move it to the end of the OrderedDict. When a new key is inserted, "a" gets evicted even though it was just accessed.

Logic error: 'a' was evicted despite being the most recently accessed item

Do
  • Use lru_cache for pure functions
  • Set maxsize based on memory budget
  • Monitor cache_info() hit rates
  • Use frozen dataclasses as cache keys
Don't
  • Cache functions with side effects
  • Use unbounded caches (maxsize=None)
  • Cache mutable return values
  • Forget to invalidate stale data

Cache hit rate is the most important metric. A cache with a 90% hit rate saves nine out of ten expensive computations. Monitor it with cache_info() and adjust maxsize if hit rates are too low.

The LRU eviction policy works well for workloads with temporal locality - recently accessed items tend to be accessed again soon. For frequency-based workloads, consider LFU (Least Frequently Used) strategies instead.

Choosing for Scale

Daily Life
Interviews

Architect data layers for production

At scale, data structure choices become architectural decisions. A pipeline processing 10 million records per hour cannot afford O(n) lookups inside a loop. A service handling 10,000 concurrent requests needs thread-safe structures. This section covers the judgment calls that distinguish senior engineers: choosing data structures based on real-world constraints like throughput, concurrency, and memory pressure.

Concurrent Access Patterns

When multiple threads or processes access shared data, you need thread-safe structures or synchronization. Python's GIL protects against data corruption for simple operations, but compound operations like "check then update" still need explicit locking. The queue module provides thread-safe alternatives to lists for producer-consumer patterns.

1from queue import Queue
2from threading import Thread
3
4results = Queue()
5
6def worker(name, items):
7 for item in items:
8 result = item * 2
9 results.put((name, result))
10
11# Simulate concurrent processing
12t1 = Thread(target=worker, args=("A", [1, 2, 3]))
13t2 = Thread(target=worker, args=("B", [4, 5, 6]))
14
15t1.start()
16t2.start()
17t1.join()
18t2.join()
19
20while not results.empty():
21 worker_name, value = results.get()
22 print(f"Worker {worker_name}: {value}")
>>>Output
Worker A: 2
Worker A: 4
Worker A: 6
Worker B: 8
Worker B: 10
Worker B: 12

The Queue class handles all synchronization internally. Multiple threads can safely put() and get() without explicit locks. For priority-based processing, use PriorityQueue which always yields the smallest item first.

Data Structure at Scale

The table below summarizes when to reach for each specialized structure based on your system requirements. These are the patterns that appear in production systems processing millions of records.
Streaming data
Streaming data
Use deque(maxlen=N) for bounded buffers or sliding windows over live feeds.
Aggregation pipelines
Aggregation pipelines
Use defaultdict(list) or Counter for grouping and counting in a single pass.
Hot-path caching
Hot-path caching
Use OrderedDict for LRU caches or lru_cache for pure function memoization.
Concurrent processing
Concurrent processing
Use queue.Queue for thread-safe producer-consumer patterns between workers.
High-volume records
High-volume records
Use namedtuple or @dataclass(slots=True) to minimize per-record memory.
These patterns are not theoretical. They power some of the largest Python applications in the world.
Python Quiz

> A bounded buffer needs to hold only the last 5 events, automatically discarding older ones when new events arrive. Pick the correct collection type and the correct parameter to set the maximum size.

from collections import ___
buffer = deque(___=5)
for i in range(8):
    buffer.append(i)
print(list(buffer))
print(len(buffer))
list
size
Queue
deque
maxlen

Architecture Example

Let us walk through a realistic data pipeline that combines multiple specialized structures. This pattern appears in event processing systems that ingest user activity streams and produce real-time analytics.
1from collections import Counter, defaultdict, deque
2
3class EventProcessor:
4 def __init__(self):
5 # Track event counts for dashboard
6 self.event_counts = Counter()
7 # Group events by user for personalization
8 self.user_events = defaultdict(list)
9 # Keep last 1000 events for debugging
10 self.recent_events = deque(maxlen=1000)
11 # LRU cache for user profiles
12 self.profile_cache = OrderedDict()
13
14 def process(self, event):
15 self.event_counts[event["type"]] += 1
16 self.user_events[event["user_id"]].append(event)
17 self.recent_events.append(event)
Each structure serves a specific purpose: Counter for real-time metrics, defaultdict for per-user grouping, deque for a bounded debug log, and OrderedDict for LRU caching. Together they form a cohesive system where each component is optimized for its particular access pattern.
Design a Real-Time Analytics CacheStep 1
>

You are the lead data engineer at a streaming analytics company. Your dashboard processes 50,000 events per second and serves metrics to 10,000 concurrent users. The current system is struggling: cache misses cause database queries that take 200ms each, and memory usage keeps climbing. Your team needs you to redesign the caching layer.

events_stream
event_iduser_idevent_typetimestamp
e_001u_42page_view2024-01-15T10:30:00
e_002u_17purchase2024-01-15T10:30:01
e_003u_42click2024-01-15T10:30:02
metric_cache
cache_keyvaluelast_accessedttl_seconds
hourly_views1245010:30:0060
top_users[u_42, u_17]10:29:45300
May 2026
Cache Architecture

Your first decision: which primary data structure should power the cache? You need fast reads, fast writes, and the ability to evict stale entries.

Advanced data structures let you model complex relationships and solve problems that simpler structures cannot. Put your skills to the test with hands-on challenges in the Python Builder.

The collections module structures - Counter, defaultdict, deque, and OrderedDict - are available in every Python environment without additional dependencies. They are the first tools to reach for when basic types feel cumbersome.

Concurrent systems require thread-safe structures. Queue, PriorityQueue, and asyncio-based approaches eliminate data races without the complexity of manual locking. Choosing the right concurrency primitive is as important as choosing the right data structure.

PUTTING IT ALL TOGETHER

> You are a senior data engineer at Spotify building a recommendation engine that counts play events per track, groups playlists by genre without key errors, caches recently scored tracks with an LRU policy, and selects the right structure for a high-throughput concurrent scoring service.

Counter from the collections module tallies per-track play events in one pass, exposing the top-N most-streamed tracks via .most_common().
Custom structures using dataclasses and __slots__ reduce per-object memory overhead for the millions of track-score records held in memory during batch scoring.
Performance profiling with timeit and sys.getsizeof() measures whether the dict-based score store or a namedtuple array is faster and smaller at production scale.
Caching strategies using OrderedDict implement an LRU eviction layer so recently scored tracks are served from memory without re-running the model.
KEY TAKEAWAYS
Counter counts elements in one line - use for frequency analysis
defaultdict auto-initializes missing keys - use for grouping and counting
deque(maxlen=N) provides O(1) operations on both ends - use for sliding windows
namedtuple gives named fields with tuple memory efficiency
@dataclass generates boilerplate; add frozen=True for immutability
__slots__ reduces memory 40-50% for classes with millions of instances
@lru_cache memoizes pure functions automatically with LRU eviction
OrderedDict enables custom LRU caches with move_to_end() and popitem()
Use queue.Queue for thread-safe producer-consumer patterns
Profile before optimizing: measure time with timeit, memory with sys.getsizeof

Specialized collections and scale

Category
Python
Difficulty
advanced
Duration
29 minutes
Challenges
0 hands-on challenges

Topics covered: The collections Module, Custom Structures, Performance Profiling, Caching Strategies, Choosing for Scale

Lesson Sections

  1. The collections Module

    Counter: Counting Made Easy defaultdict: Auto Defaults deque: Double-Ended Queue namedtuple: Lightweight

  2. Custom Structures

    When named tuples are too rigid and plain dicts are too loose, Python offers dataclasses and custom classes with __slots__. These give you mutable records with type hints, default values, comparison methods, and memory optimization - all with minimal boilerplate. dataclasses: Modern Records Frozen Dataclasses __slots__ for Memory

  3. Performance Profiling

    Timing Operations Memory Profiling Profiling should always come before optimization. Measure first, then target the specific bottleneck. Optimizing without data often means spending time on code that is not actually the performance problem.

  4. Caching Strategies

    Caching is one of the most impactful performance techniques in data engineering. By storing the results of expensive computations or database queries, you avoid repeating work. Python provides built-in caching tools, and understanding how to build custom caches using data structures gives you fine-grained control over eviction policies, size limits, and expiration. Using functools.lru_cache Building a Custom LRU Cache The LRU eviction policy works well for workloads with temporal locality - rece

  5. Choosing for Scale

    Concurrent Access Patterns Data Structure at Scale The table below summarizes when to reach for each specialized structure based on your system requirements. These are the patterns that appear in production systems processing millions of records. These patterns are not theoretical. They power some of the largest Python applications in the world. Architecture Example Let us walk through a realistic data pipeline that combines multiple specialized structures. This pattern appears in event processi