Data Structures: Advanced
Dijkstra's shortest-path algorithm, implemented with a min-heap, powers routing in Uber's driver dispatch system, Lyft's route optimization engine, and every GPS navigation device on earth, computing optimal routes through graphs with millions of nodes in milliseconds. The min-heap makes it possible by always surfacing the closest unvisited node in O(log n) time rather than scanning the entire frontier on every step. Amazon's warehouse robots use the same heap-based pathfinding to coordinate hundreds of robots navigating around each other in real time. The advanced data structures in this lesson are the ones behind those systems.
The collections Module
Count, group, and queue with one import
The collections module provides specialized container datatypes that extend Python's built-in list, dict, tuple, and set. Each solves a specific problem more elegantly and efficiently than the general-purpose alternatives. Learning these tools means writing less code that runs faster.
Counter: Counting Made Easy
A Counter is a dictionary subclass designed for counting hashable objects. Instead of manually building a dict with counts, Counter does it in one line. It also provides methods for finding the most common elements, arithmetic between counters, and more.
Counter is invaluable for log analysis, frequency distributions, and any scenario where you need to tally occurrences. It accepts any iterable and counts elements automatically. Accessing a missing key returns 0 instead of raising a KeyError, which is a subtle but important improvement over regular dicts.
defaultdict: Auto Defaults
A defaultdict automatically creates a default value when you access a missing key. Instead of checking if a key exists before using it, you just use it. This eliminates the setdefault() pattern entirely and makes grouping code much cleaner.
The factory function you pass to defaultdict determines the default value type. Pass list for grouping, int for counting, set for unique collections, or any callable that returns the default value you want.
> Count character frequencies using a defaultdict. Pick the factory function that initializes missing keys to zero, and the function that counts how many unique characters exist.
from collections import defaultdict counts = defaultdict() for char in "hello": counts[char] += 1 print(counts["l"]) print((counts))
deque: Double-Ended Queue
A deque (pronounced "deck") supports O(1) append and pop from both ends. Lists are fast at the right end but slow at the left because inserting at position 0 requires shifting every element. Deques solve this with a doubly-linked list implementation.
The maxlen parameter creates a bounded deque that automatically discards old items when full. This is perfect for sliding windows, recent activity logs, and any scenario where you only care about the last N items. No manual size management needed.
- O(1) append to right end
- O(n) insert at left end
- O(1) random index access
- Good for stack operations
- O(1) append to either end
- O(1) pop from either end
- O(n) random index access
- Good for queue operations
> A bounded deque keeps only the last N items. Pick the method that adds to the right end (dropping old items from the left), and the function that shows how many items remain.
from collections import deque recent = deque(maxlen=3) for x in [1, 2, 3, 4, 5]: recent.(x) print(list(recent)) print((recent))
namedtuple: Lightweight
A namedtuple creates an immutable record type with named fields. It uses the same memory as a regular tuple but lets you access fields by name instead of index. This dramatically improves code readability while keeping the memory efficiency of tuples.
Named tuples are ideal for representing fixed records like database rows, CSV records, or API response objects. They are hashable (so they can be dictionary keys), immutable (so they are thread-safe), and self-documenting (so the code is readable).
Custom Structures
Build memory-efficient record types
dataclasses: Modern Records
The @dataclass decorator automatically generates __init__, __repr__, and __eq__ methods from your type-annotated fields. Add frozen=True for immutability, or order=True for comparison operators.
Notice the field(default_factory=list) for the tags field. This avoids the mutable default argument bug from the intermediate lesson. Each instance gets its own fresh list instead of sharing one across all instances.
Frozen Dataclasses
Adding frozen=True makes a dataclass immutable and hashable, combining the safety of tuples with the readability of named fields. Frozen dataclasses are ideal for configuration objects, cache keys, and any data that should not change after creation.
> You need a DbConfig dataclass that can be used as a dictionary key. Pick the decorator option and the expression to verify it works.
from dataclasses import dataclass @dataclass() class DbConfig: host: str port: int c = DbConfig("localhost", 5432) print(type())
__slots__ for Memory
By default, Python objects store attributes in a __dict__ dictionary, which is flexible but memory-hungry. Declaring __slots__ tells Python to allocate a fixed amount of memory for each instance, eliminating the per-instance dict. For millions of objects, this saves gigabytes.
The real savings come from eliminating the __dict__ overhead. For classes with few attributes that you create millions of times, __slots__ can reduce memory by 40-50%. Modern dataclasses also support slots=True in Python 3.10 and later.
Performance Profiling
Measure time and memory trade-offs
Choosing the right data structure requires understanding their actual performance characteristics. Theoretical Big-O complexity tells you the growth rate, but real-world performance depends on constant factors, cache behavior, and memory allocation patterns. Python provides tools like timeit and sys.getsizeof() to measure both time and memory so you can make informed decisions.
Timing Operations
The timeit module runs code snippets thousands of times and reports accurate timing. This eliminates noise from system load and gives reproducible measurements. Always benchmark with realistic data sizes - performance at 100 items may differ dramatically from performance at 1 million.
The difference is staggering. For 100,000 elements with worst-case lookup, sets are thousands of times faster. This is the practical consequence of O(n) versus O(1) that you saw in theory - now you can measure it yourself.
Memory Profiling
Memory usage is equally important, especially when processing large datasets. Each data structure has different overhead per element. Understanding these costs helps you choose structures that fit within your memory budget. Use sys.getsizeof() to measure actual sizes.
> You need to measure how much memory a data structure uses. Pick the correct module to access memory measurement, and the correct function to get the size in bytes.
import data = [1, 2, 3, 4, 5] print(sys.(data)) print(type(data).__name__)
Remember that sets and dicts trade memory for speed. When memory is the constraint, tuples and frozensets are your allies. When speed is the constraint, hash-based structures like sets and dicts give you O(1) operations.
Understanding these performance characteristics helps you reason about which structure belongs in each part of a pipeline. A sliding window uses a deque, a lookup table uses a dict, and a deduplication pass uses a set.
Caching Strategies
Cache results to skip repeated work
Using functools.lru_cache
The simplest caching tool is @lru_cache from functools. It automatically memoizes function results based on arguments. LRU stands for Least Recently Used - when the cache is full, the least recently accessed entry is evicted to make room for new ones.
Notice that the database query message only prints once. The second call returns the cached result without executing the function body. The cache_info() method shows hits, misses, and current size, letting you monitor cache effectiveness.
Building a Custom LRU Cache
For more control over caching behavior, you can build a custom LRU cache using OrderedDict. This approach lets you add TTL (time-to-live) expiration, custom eviction logic, and size-based limits that lru_cache does not support.
The move_to_end() method is the key to LRU behavior. Every access moves the item to the end (most recently used). When the cache exceeds capacity, popitem(last=False) removes the first item (least recently used). This gives you O(1) get, put, and eviction operations.
> This LRU cache reads key "a" but does not move it to the end of the OrderedDict. When a new key is inserted, "a" gets evicted even though it was just accessed.
Logic error: 'a' was evicted despite being the most recently accessed item
- Use lru_cache for pure functions
- Set maxsize based on memory budget
- Monitor cache_info() hit rates
- Use frozen dataclasses as cache keys
- Cache functions with side effects
- Use unbounded caches (maxsize=None)
- Cache mutable return values
- Forget to invalidate stale data
Cache hit rate is the most important metric. A cache with a 90% hit rate saves nine out of ten expensive computations. Monitor it with cache_info() and adjust maxsize if hit rates are too low.
Choosing for Scale
Architect data layers for production
At scale, data structure choices become architectural decisions. A pipeline processing 10 million records per hour cannot afford O(n) lookups inside a loop. A service handling 10,000 concurrent requests needs thread-safe structures. This section covers the judgment calls that distinguish senior engineers: choosing data structures based on real-world constraints like throughput, concurrency, and memory pressure.
Concurrent Access Patterns
When multiple threads or processes access shared data, you need thread-safe structures or synchronization. Python's GIL protects against data corruption for simple operations, but compound operations like "check then update" still need explicit locking. The queue module provides thread-safe alternatives to lists for producer-consumer patterns.
The Queue class handles all synchronization internally. Multiple threads can safely put() and get() without explicit locks. For priority-based processing, use PriorityQueue which always yields the smallest item first.
Data Structure at Scale
> A bounded buffer needs to hold only the last 5 events, automatically discarding older ones when new events arrive. Pick the correct collection type and the correct parameter to set the maximum size.
from collections import buffer = deque(=5) for i in range(8): buffer.append(i) print(list(buffer)) print(len(buffer))
Architecture Example
You are the lead data engineer at a streaming analytics company. Your dashboard processes 50,000 events per second and serves metrics to 10,000 concurrent users. The current system is struggling: cache misses cause database queries that take 200ms each, and memory usage keeps climbing. Your team needs you to redesign the caching layer.
| event_id | user_id | event_type | timestamp |
|---|---|---|---|
| e_001 | u_42 | page_view | 2024-01-15T10:30:00 |
| e_002 | u_17 | purchase | 2024-01-15T10:30:01 |
| e_003 | u_42 | click | 2024-01-15T10:30:02 |
| cache_key | value | last_accessed | ttl_seconds |
|---|---|---|---|
| hourly_views | 12450 | 10:30:00 | 60 |
| top_users | [u_42, u_17] | 10:29:45 | 300 |
Your first decision: which primary data structure should power the cache? You need fast reads, fast writes, and the ability to evict stale entries.
The collections module structures - Counter, defaultdict, deque, and OrderedDict - are available in every Python environment without additional dependencies. They are the first tools to reach for when basic types feel cumbersome.
Concurrent systems require thread-safe structures. Queue, PriorityQueue, and asyncio-based approaches eliminate data races without the complexity of manual locking. Choosing the right concurrency primitive is as important as choosing the right data structure.
> You are a senior data engineer at Spotify building a recommendation engine that counts play events per track, groups playlists by genre without key errors, caches recently scored tracks with an LRU policy, and selects the right structure for a high-throughput concurrent scoring service.
Counter counts elements in one line - use for frequency analysisdefaultdict auto-initializes missing keys - use for grouping and countingdeque(maxlen=N) provides O(1) operations on both ends - use for sliding windowsnamedtuple gives named fields with tuple memory efficiency@dataclass generates boilerplate; add frozen=True for immutability__slots__ reduces memory 40-50% for classes with millions of instances@lru_cache memoizes pure functions automatically with LRU evictionOrderedDict enables custom LRU caches with move_to_end() and popitem()queue.Queue for thread-safe producer-consumer patternsSpecialized collections and scale
- Category
- Python
- Difficulty
- advanced
- Duration
- 29 minutes
- Challenges
- 0 hands-on challenges
Topics covered: The collections Module, Custom Structures, Performance Profiling, Caching Strategies, Choosing for Scale
Lesson Sections
- The collections Module
Counter: Counting Made Easy defaultdict: Auto Defaults deque: Double-Ended Queue namedtuple: Lightweight
- Custom Structures
When named tuples are too rigid and plain dicts are too loose, Python offers dataclasses and custom classes with __slots__. These give you mutable records with type hints, default values, comparison methods, and memory optimization - all with minimal boilerplate. dataclasses: Modern Records Frozen Dataclasses __slots__ for Memory
- Performance Profiling
Timing Operations Memory Profiling Profiling should always come before optimization. Measure first, then target the specific bottleneck. Optimizing without data often means spending time on code that is not actually the performance problem.
- Caching Strategies
Caching is one of the most impactful performance techniques in data engineering. By storing the results of expensive computations or database queries, you avoid repeating work. Python provides built-in caching tools, and understanding how to build custom caches using data structures gives you fine-grained control over eviction policies, size limits, and expiration. Using functools.lru_cache Building a Custom LRU Cache The LRU eviction policy works well for workloads with temporal locality - rece
- Choosing for Scale
Concurrent Access Patterns Data Structure at Scale The table below summarizes when to reach for each specialized structure based on your system requirements. These are the patterns that appear in production systems processing millions of records. These patterns are not theoretical. They power some of the largest Python applications in the world. Architecture Example Let us walk through a realistic data pipeline that combines multiple specialized structures. This pattern appears in event processi