Data Structures: Intermediate

Wikipedia's internal link-traversal system uses a queue-based breadth-first search to systematically explore related articles, processing millions of page connections by always visiting the closest links before venturing further out. The same BFS queue pattern powers every shortest-path feature in modern software, from Uber's driver routing system to Twitter's trending topic propagation. Stacks and queues are the two structures that make graph traversal possible, and understanding how to implement them in Python is what separates a programmer who can solve algorithmic problems from one who cannot.

Nested Data Structures

Daily Life
Interviews

Navigate and reshape nested data

Nested data structures are collections that contain other collections as their elements. This nesting can occur in multiple patterns: a list of dictionaries represents a table of records, similar to rows in a database. A dictionary with list values groups related items by category. A dictionary containing other dictionaries models hierarchical relationships like organizational structures or configuration settings. Understanding these patterns is essential because they mirror the structure of JSON data, database query results, and configuration files that you encounter daily in data engineering.
The key insight is that Python allows any data structure to contain any other data structure. Lists can hold dictionaries, dictionaries can hold lists, and you can nest these combinations arbitrarily deep. This flexibility enables you to model complex real-world data accurately, but it also requires developing intuition about how to navigate and transform these structures efficiently.

Lists of Dicts: Table Data

The most common nested pattern in data engineering is a list of dictionaries, where each dictionary represents a record and the list represents a collection of records, similar to a database table. This structure matches how data arrives from REST APIs, how pandas DataFrames can be converted to native Python, and how many data processing pipelines represent intermediate results. Each dictionary in the list has the same keys, acting like column names, with values representing the data in each cell.
1users = [
2 {"id": 1, "name": "Alice", "role": "engineer", "salary": 95000},
3 {"id": 2, "name": "Bob", "role": "analyst", "salary": 75000},
4 {"id": 3, "name": "Charlie", "role": "engineer", "salary": 105000},
5 {"id": 4, "name": "Diana", "role": "manager", "salary": 120000},
6]
7
8# Access a specific record by index
9first_user = users[0]
10print("First user:", first_user["name"])
11
12# Access a specific field from a specific record
13third_user_role = users[2]["role"]
14print("Third user role:", third_user_role)
15
16# Iterate and access fields
17for user in users:
18 print(f"{user['name']} ({user['role']}): ${user['salary']:,}")
>>>Output
First user: Alice
Third user role: engineer
Alice (engineer): $95,000
Bob (analyst): $75,000
Charlie (engineer): $105,000
Diana (manager): $120,000

Access patterns work in two directions. The expression users[0] selects a row by index, returning a dictionary. The expression ["name"] selects a column by key from that dictionary. You can chain these operations: users[0]["name"] gives "Alice" directly. This two-step access pattern is fundamental to working with tabular data in Python.

When processing lists of dictionaries, you often need to extract a single field from all records, filter records based on conditions, or transform records in some way. These operations become natural once you understand the access patterns.
1users = [
2 {"id": 1, "name": "Alice", "role": "engineer", "salary": 95000},
3 {"id": 2, "name": "Bob", "role": "analyst", "salary": 75000},
4 {"id": 3, "name": "Charlie", "role": "engineer", "salary": 105000},
5]
6
7# Extract all names (like SELECT name FROM users)
8names = [user["name"] for user in users]
9print("All names:", names)
10
11# Filter by condition (like WHERE role = 'engineer')
12engineers = [user for user in users if user["role"] == "engineer"]
13print("Engineers:", [e["name"] for e in engineers])
14
15# Calculate aggregate (like SELECT AVG(salary))
16avg_salary = sum(user["salary"] for user in users) / len(users)
17print(f"Average salary: ${avg_salary:,.0f}")
>>>Output
All names: ['Alice', 'Bob', 'Charlie']
Engineers: ['Alice', 'Charlie']
Average salary: $91,667

Dicts with List Values

When you need to group items by a key, use a dictionary where each value is a list. This pattern is common for categorization, grouping database records by a field, and building inverted indexes. The dictionary provides O(1) lookup by group, while the list stores all members of that group. This is more efficient than scanning through a flat list every time you need items from a specific category.

1# Group users by their department
2departments = {
3 "engineering": ["Alice", "Charlie", "Eve", "George"],
4 "analytics": ["Bob", "Diana"],
5 "management": ["Frank"],
6 "design": ["Hannah", "Ivan"],
7}
8
9# Access a specific department instantly
10print("Engineering team:", departments["engineering"])
11print("Engineering size:", len(departments["engineering"]))
12
13# Check if someone is in a department
14if "Alice" in departments["engineering"]:
15 print("Alice is in engineering")
16
17# Iterate over all departments
18print("\nDepartment sizes:")
19for dept, members in departments.items():
20 print(f" {dept}: {len(members)} members")
>>>Output
Engineering team: ['Alice', 'Charlie', 'Eve', 'George']
Engineering size: 4
Alice is in engineering
 
Department sizes:
engineering: 4 members
analytics: 2 members
management: 1 members
design: 2 members

The .items() method yields key-value pairs as tuples, allowing you to iterate over both the group name and its members simultaneously. This pattern enables efficient lookups: finding all engineers is O(1) dictionary access instead of O(n) scanning through every record. For large datasets, this performance difference is significant.

TIP
Use this pattern when you frequently need to look up all items belonging to a category. Building the grouped dictionary once and reusing it is much faster than filtering a flat list repeatedly.

Nested Dicts: Hierarchical

Dictionaries can contain other dictionaries to represent hierarchical relationships. This pattern mirrors JSON API responses, configuration files, and any tree-like data structure. When working with APIs, you will constantly encounter nested dictionaries representing complex entities with sub-entities. Learning to navigate these structures confidently is essential for API integration work.
1# Typical API response structure
2api_response = {
3 "status": "success",
4 "metadata": {
5 "request_id": "abc123",
6 "timestamp": "2024-01-15T10:30:00Z"
7 },
8 "data": {
9 "user": {
10 "id": 42,
11 "profile": {
12 "name": "Alice Johnson",
13 "email": "alice@example.com",
14 "preferences": {
15 "theme": "dark",
16 "notifications": True
17 }
18 }
19 }
20 }
21}
22
23# Navigate the hierarchy with chained access
24user_name = api_response["data"]["user"]["profile"]["name"]
25print("User name:", user_name)
26
27theme = api_response["data"]["user"]["profile"]["preferences"]["theme"]
28print("Theme preference:", theme)
29
30# Access metadata
31request_id = api_response["metadata"]["request_id"]
32print("Request ID:", request_id)
>>>Output
User name: Alice Johnson
Theme preference: dark
Request ID: abc123

Direct chaining like response["data"]["user"]["profile"] works well when you know the structure exists. However, if any key in the chain is missing, Python raises a KeyError. For uncertain data structures, you need defensive access patterns.

1api_response = {
2 "status": "success",
3 "data": {"user": {"id": 42}}
4}
5
6# Safe navigation with chained get() calls
7# If any key is missing, returns the default instead of crashing
8profile = api_response.get("data", {}).get("user", {}).get("profile", {})
9name = profile.get("name", "Unknown")
10print("Name (safe):", name)
11
12# Check before accessing
13if "profile" in api_response.get("data", {}).get("user", {}):
14 print("Profile exists")
15else:
16 print("No profile found")
>>>Output
Name (safe): Unknown
No profile found

Chaining .get() calls with empty dict defaults {} prevents KeyError exceptions. If any level is missing, the chain returns an empty dict, and subsequent .get() calls safely return their defaults. This pattern is essential when processing API responses that may have optional fields.

Single-level safety
Single-level safety
Use d.get("key", default) to return a fallback if the key is missing
Chained navigation
Chained navigation
Chain .get("a", {}).get("b", val) to walk nested dicts without crashing
Empty dict fallback
Empty dict fallback
Use {} as the intermediate default so the next .get() has a dict to call
Final default matters
Final default matters
The last .get() in the chain should return your actual fallback value

Flat to Nested Structures

Often you receive flat data that needs to be transformed into a nested structure. This is common when processing database query results that need grouping, or when building aggregations from raw event data. The fundamental pattern involves iterating through the flat data and building up the nested structure incrementally, checking for key existence and initializing empty collections as needed.
1# Flat transaction data from a database
2transactions = [
3 {"user_id": 1, "amount": 100, "category": "food"},
4 {"user_id": 2, "amount": 50, "category": "transport"},
5 {"user_id": 1, "amount": 75, "category": "food"},
6 {"user_id": 1, "amount": 200, "category": "electronics"},
7 {"user_id": 2, "amount": 30, "category": "food"},
8 {"user_id": 1, "amount": 45, "category": "transport"},
9]
10
11# Group transactions by user_id
12by_user = {}
13for tx in transactions:
14 user_id = tx["user_id"]
15 if user_id not in by_user:
16 by_user[user_id] = []
17 by_user[user_id].append(tx)
18
19# Now we can efficiently access all transactions for a user
20print("User 1 transactions:", len(by_user[1]))
21print("User 1 total:", sum(tx["amount"] for tx in by_user[1]))
>>>Output
User 1 transactions: 4
User 1 total: 420

This pattern of checking for key existence and initializing an empty list is so common that Python provides a cleaner way to write it. The setdefault() method combines the check and initialization into a single operation.

1transactions = [
2 {"user_id": 1, "amount": 100, "category": "food"},
3 {"user_id": 2, "amount": 50, "category": "transport"},
4 {"user_id": 1, "amount": 75, "category": "food"},
5]
6
7# Cleaner grouping with setdefault()
8by_user = {}
9for tx in transactions:
10 by_user.setdefault(tx["user_id"], []).append(tx)
11
12print("Grouped by user:", {k: len(v) for k, v in by_user.items()})
13
14# Even cleaner: group by category within each user
15by_user_category = {}
16for tx in transactions:
17 user_data = by_user_category.setdefault(tx["user_id"], {})
18 user_data.setdefault(tx["category"], []).append(tx["amount"])
19
20print("User 1 by category:", by_user_category[1])
>>>Output
Grouped by user: {1: 2, 2: 1}
User 1 by category: {'food': [100, 75]}

The .setdefault(key, default) method returns the value if the key exists, or sets it to the default and returns that default if the key is missing. This allows you to chain .append() directly, making the grouping operation a single line. This is more Pythonic than the explicit if-check pattern.

Python Quiz

> Group items and count total occurrences. Pick the dict method that creates missing keys automatically, and the accessor that returns all the stored lists.

groups = {}
for item in ["a", "b", "a", "c", "b"]:
    groups.___(item, []).append(1)
print(len(groups))
total = sum(
    len(v) for v in groups.___()
)
print(total)
get
keys
setdefault
values
update
The list-of-dicts pattern mirrors database table rows exactly. When you receive query results from a database driver, each row is a dictionary and the full result set is a list of those dictionaries.
Navigating nested structures fluently is one of the skills that separates data engineers who work productively with APIs from those who struggle. Practice the chained .get() pattern until it becomes instinctive.

Dict and List Comprehensions

Daily Life
Interviews

Build lists and dicts in one expression

Comprehensions are concise expressions for creating lists, dictionaries, and sets from existing iterables. They combine iteration, transformation, and optional filtering into a single readable line. Beyond being syntactic sugar, comprehensions execute faster than equivalent loops because Python optimizes them internally. They are also considered more Pythonic, expressing intent clearly without the boilerplate of explicit loop construction.
Comprehensions shine when you need to transform data from one shape to another. Extracting specific fields, applying calculations, filtering by conditions, or reshaping structures are all natural uses for comprehensions. However, they should remain readable. If a comprehension becomes too complex, a regular loop is often clearer.

List Comprehension Basics

A list comprehension has the form [expression for item in iterable]. It creates a new list by evaluating the expression once for each item in the iterable. The expression can be any valid Python expression: a simple variable reference, a calculation, a method call, or even a function application.

1# Traditional loop approach
2squares_loop = []
3for x in range(6):
4 squares_loop.append(x ** 2)
5print("Loop result:", squares_loop)
6
7# List comprehension approach
8squares_comp = [x ** 2 for x in range(6)]
9print("Comprehension:", squares_comp)
10
11# More concise and expressive
>>>Output
Loop result: [0, 1, 4, 9, 16, 25]
Comprehension: [0, 1, 4, 9, 16, 25]
The comprehension version is more concise and expresses intent directly: create a list where each element is x squared. The loop version requires understanding the initialization, the iteration, and the accumulation pattern. With comprehensions, the intent is immediately clear.
1# Practical examples with different expressions
2
3names = ["alice", "bob", "charlie", "diana"]
4
5# Transform: uppercase all names
6upper_names = [name.upper() for name in names]
7print("Uppercase:", upper_names)
8
9# Transform: get lengths
10lengths = [len(name) for name in names]
11print("Lengths:", lengths)
12
13# Transform: format strings
14formatted = [f"User: {name.title()}" for name in names]
15print("Formatted:", formatted)
16
17# Transform: extract first character
18initials = [name[0].upper() for name in names]
19print("Initials:", initials)
>>>Output
Uppercase: ['ALICE', 'BOB', 'CHARLIE', 'DIANA']
Lengths: [5, 3, 7, 5]
Formatted: ['User: Alice', 'User: Bob', 'User: Charlie', 'User: Diana']
Initials: ['A', 'B', 'C', 'D']

Filtering with Conditions

Add an if clause to filter which items are included in the result: [expression for item in iterable if condition]. Only items where the condition evaluates to True are processed and included. The condition is evaluated before the expression, so you can safely access properties that might not exist on filtered-out items.

1numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
2
3# Filter: only even numbers
4evens = [n for n in numbers if n % 2 == 0]
5print("Evens:", evens)
6
7# Filter and transform: square the even numbers
8even_squares = [n ** 2 for n in numbers if n % 2 == 0]
9print("Even squares:", even_squares)
10
11# Filter: numbers greater than 5
12large = [n for n in numbers if n > 5]
13print("Greater than 5:", large)
14
15# Multiple conditions with 'and'
16middle = [n for n in numbers if n > 3 and n < 8]
17print("Between 3 and 8:", middle)
>>>Output
Evens: [2, 4, 6, 8, 10]
Even squares: [4, 16, 36, 64, 100]
Greater than 5: [6, 7, 8, 9, 10]
Between 3 and 8: [4, 5, 6, 7]
The filtering happens before the transformation. First Python checks if n is even, and only if True does it compute n squared and add it to the result. This combines the functionality of filter() and map() into a single readable expression.
1# Filtering records from a list of dictionaries
2users = [
3 {"name": "Alice", "active": True, "age": 28, "role": "engineer"},
4 {"name": "Bob", "active": False, "age": 35, "role": "analyst"},
5 {"name": "Charlie", "active": True, "age": 22, "role": "engineer"},
6 {"name": "Diana", "active": True, "age": 31, "role": "manager"},
7]
8
9# Get names of active users
10active_names = [u["name"] for u in users if u["active"]]
11print("Active users:", active_names)
12
13# Get active engineers over 25
14senior_engineers = [
15 u["name"] for u in users
16 if u["active"] and u["role"] == "engineer" and u["age"] > 25
17]
18print("Senior active engineers:", senior_engineers)
>>>Output
Active users: ['Alice', 'Charlie', 'Diana']
Senior active engineers: ['Alice']
Fill in the Blank

> You have a list of user dictionaries with name, active status, and role fields. Pick the extraction expression and filter condition to get only active engineers' names.

users = [
    {{"name": "Alice", "active": True, "role": "engineer"}},
    {{"name": "Bob", "active": False, "role": "analyst"}},
    {{"name": "Charlie", "active": True, "role": "engineer"}},
]
result = [ for u in users if ]
print(result)

Dictionary Comprehensions

Dictionary comprehensions use curly braces with a key-value pair: {key_expr: value_expr for item in iterable}. They are essential for transforming dictionaries, filtering dictionary entries, building lookups from lists, and inverting key-value relationships.

1# Create a dict from parallel lists
2names = ["alice", "bob", "charlie"]
3scores = [85, 92, 78]
4
5# zip() pairs elements from both lists
6score_dict = {name: score for name, score in zip(names, scores)}
7print("Scores:", score_dict)
8
9# Transform values: double all scores
10doubled = {name: score * 2 for name, score in score_dict.items()}
11print("Doubled:", doubled)
12
13# Transform keys: uppercase names
14upper_keys = {name.upper(): score for name, score in score_dict.items()}
15print("Upper keys:", upper_keys)
16
17# Filter: only passing scores (>= 80)
18passing = {name: score for name, score in score_dict.items() if score >= 80}
19print("Passing:", passing)
>>>Output
Scores: {'alice': 85, 'bob': 92, 'charlie': 78}
Doubled: {'alice': 170, 'bob': 184, 'charlie': 156}
Upper keys: {'ALICE': 85, 'BOB': 92, 'CHARLIE': 78}
Passing: {'alice': 85, 'bob': 92}

The zip() function pairs elements from two iterables, creating tuples that you can unpack in the comprehension. Combined with a dict comprehension, it creates dictionaries from parallel lists in one line. This is a very common pattern for data transformation.

Inverting Dictionaries

A common operation is inverting a dictionary: swapping keys and values to create a reverse lookup. This is useful when you have a mapping in one direction but need to look up in the other direction. For example, you might have user IDs mapped to usernames, but need to find a user ID given a username.
1# Country code to name mapping
2codes_to_names = {
3 "US": "United States",
4 "UK": "United Kingdom",
5 "CA": "Canada",
6 "AU": "Australia"
7}
8
9# Invert: create name to code lookup
10names_to_codes = {name: code for code, name in codes_to_names.items()}
11print("Inverted:", names_to_codes)
12
13# Now we can look up codes by name
14print("Canada's code:", names_to_codes["Canada"])
15print("Australia's code:", names_to_codes["Australia"])
>>>Output
Inverted: {'United States': 'US', 'United Kingdom': 'UK', 'Canada': 'CA', 'Australia': 'AU'}
Canada's code: CA
Australia's code: AU

Inversion only works cleanly when values are unique. If multiple keys share the same value, later entries overwrite earlier ones. When values are not unique, you would need to group keys into lists using the setdefault pattern or defaultdict.

Loop Approach
  • More lines of code
  • Explicit step-by-step logic
  • Easier to debug complex logic
  • Better for multi-step operations
Comprehension
  • Single expression
  • Declarative style
  • Faster execution
  • Better for simple transforms

Nested Comprehensions

Comprehensions can iterate over multiple sequences or flatten nested structures. The order of for clauses matches nested loops: outer loops come first, reading left to right. While powerful, nested comprehensions can become hard to read. Use them for simple cases like flattening or creating coordinate pairs, but prefer explicit loops for complex logic.
1# Flatten a list of lists
2matrix = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
3flat = [num for row in matrix for num in row]
4print("Flattened:", flat)
5
6# Create coordinate pairs (Cartesian product)
7rows = [0, 1, 2]
8cols = ["a", "b"]
9coords = [(r, c) for r in rows for c in cols]
10print("Coordinates:", coords)
11
12# Filter within nested comprehension
13# Get all even numbers from a matrix
14matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
15evens = [num for row in matrix for num in row if num % 2 == 0]
16print("Even numbers:", evens)
>>>Output
Flattened: [1, 2, 3, 4, 5, 6, 7, 8, 9]
Coordinates: [(0, 'a'), (0, 'b'), (1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]
Even numbers: [2, 4, 6, 8]
TIP
If a comprehension becomes hard to read or requires comments to explain, convert it to a regular loop. Code readability trumps conciseness. A good rule of thumb: if it does not fit on one line or needs more than one condition, consider a loop instead.

Set Comprehensions

Set comprehensions use curly braces like dictionaries, but with single values instead of key-value pairs: {expression for item in iterable}. They automatically deduplicate results, making them perfect for extracting unique values from collections.

1# Extract unique first letters from words
2words = ["apple", "apricot", "banana", "blueberry", "cherry", "coconut"]
3first_letters = {word[0] for word in words}
4print("Unique first letters:", first_letters)
5
6# Extract unique departments from employee records
7employees = [
8 {"name": "Alice", "dept": "Engineering"},
9 {"name": "Bob", "dept": "Analytics"},
10 {"name": "Charlie", "dept": "Engineering"},
11 {"name": "Diana", "dept": "Analytics"},
12 {"name": "Eve", "dept": "Design"},
13]
14departments = {emp["dept"] for emp in employees}
15print("Unique departments:", departments)
16
17# Count: we have 5 employees but only 3 unique departments
18print(f"Employees: {len(employees)}, Departments: {len(departments)}")
>>>Output
Unique first letters: {'a', 'b', 'c'}
Unique departments: {'Engineering', 'Analytics', 'Design'}
Employees: 5, Departments: 3
Set comprehensions combine iteration, transformation, and deduplication in one expression. They are particularly useful when you need to answer questions like "what are all the unique values of this field?" or "what categories exist in this dataset?"

Set Operations

Daily Life
Interviews

Compare datasets with set math

Sets support mathematical operations that are invaluable for data comparison and analysis tasks. Union combines elements from multiple sets. Intersection finds elements common to all sets. Difference finds elements unique to one set. Symmetric difference finds elements in either set but not both. These operations execute in near-constant O(n) time regardless of set size, making them dramatically more efficient than nested loops for comparison tasks.

In data engineering, set operations are essential for data reconciliation, deduplication, access control analysis, and finding differences between datasets. Understanding these operations lets you answer questions like "which users have access to both systems?" or "which records exist in the source but not the destination?" with simple, efficient code.

Union - Combining Sets

Union returns all unique elements from both sets combined. Duplicates are automatically removed since sets only store unique values. Use the | operator or the .union() method. The method form accepts any iterable, not just sets.

1# Users with access to different systems
2system_a_users = {"alice", "bob", "charlie", "diana"}
3system_b_users = {"bob", "diana", "eve", "frank"}
4
5# All users with access to either system (or both)
6all_users = system_a_users | system_b_users
7print("All users:", sorted(all_users))
8print("Total unique users:", len(all_users))
9
10# Same result with method (can accept any iterable)
11all_users_method = system_a_users.union(system_b_users)
12print("Union method result:", sorted(all_users_method))
13
14# Union multiple sets at once
15system_c_users = {"george", "alice"}
16all_three = system_a_users | system_b_users | system_c_users
17print("All three systems:", sorted(all_three))
>>>Output
All users: ['alice', 'bob', 'charlie', 'diana', 'eve', 'frank']
Total unique users: 6
Union method result: ['alice', 'bob', 'charlie', 'diana', 'eve', 'frank']
All three systems: ['alice', 'bob', 'charlie', 'diana', 'eve', 'frank', 'george']
Notice that bob and diana appear in both original sets but only once in the union. Sets automatically handle deduplication, making union perfect for merging user lists, combining tags or categories, or aggregating items from multiple sources.

Finding Common Elements

Intersection returns only elements present in all sets. Use the & operator or the .intersection() method. This operation is symmetric: A & B equals B & A.

1# Find users with access to both systems
2system_a = {"alice", "bob", "charlie", "diana"}
3system_b = {"bob", "diana", "eve", "frank"}
4
5both_systems = system_a & system_b
6print("Access to both:", both_systems)
7
8# Practical example: skill matching for job applications
9job_requires = {"python", "sql", "spark", "airflow", "aws"}
10candidate_has = {"python", "sql", "java", "docker", "kubernetes", "aws"}
11
12matching_skills = job_requires & candidate_has
13missing_skills = job_requires - candidate_has
14
15print("Matching skills:", matching_skills)
16print("Missing skills:", missing_skills)
17print(f"Match rate: {len(matching_skills)}/{len(job_requires)} = {len(matching_skills)/len(job_requires):.0%}")
>>>Output
Access to both: {'bob', 'diana'}
Matching skills: {'python', 'sql', 'aws'}
Missing skills: {'spark', 'airflow'}
Match rate: 3/5 = 60%

Intersection is invaluable for access control analysis, skill matching, finding common customers between segments, and identifying overlapping records between datasets. The O(1) lookup time of sets makes these operations efficient even for large datasets.

Difference: Unique Elements

Difference returns elements in the first set that are not in the second. Use the - operator or the .difference() method. Unlike union and intersection, difference is not symmetric: A - B is different from B - A.

1# Find users unique to each system
2system_a = {"alice", "bob", "charlie"}
3system_b = {"bob", "diana", "eve"}
4
5only_in_a = system_a - system_b
6only_in_b = system_b - system_a
7
8print("Only in system A:", only_in_a)
9print("Only in system B:", only_in_b)
10
11# Practical: validate required fields in data
12required_fields = {"id", "name", "email", "created_at", "status"}
13provided_fields = {"id", "name", "phone", "address"}
14
15missing = required_fields - provided_fields
16extra = provided_fields - required_fields
17
18print("\nMissing required fields:", missing)
19print("Extra fields provided:", extra)
>>>Output
Only in system A: {'alice', 'charlie'}
Only in system B: {'diana', 'eve'}
 
Missing required fields: {'email', 'created_at', 'status'}
Extra fields provided: {'phone', 'address'}
Difference is essential for validation tasks, detecting changes between versions, finding gaps in data coverage, and identifying what needs to be added or removed during synchronization. The non-symmetric nature means you must think carefully about which set comes first.

Symmetric Diff and Subsets

Symmetric difference returns elements in either set but not both, using the ^ operator. Subset and superset checking use <= and >= operators to test containment relationships.

1# Symmetric difference: what changed between configs?
2old_features = {"dark_mode", "notifications", "sync"}
3new_features = {"dark_mode", "analytics", "export"}
4
5changed = old_features ^ new_features
6print("Changed features:", changed)
7
8# This equals: (old | new) - (old & new)
9also_changed = (old_features | new_features) - (old_features & new_features)
10print("Same result:", also_changed)
11
12# Subset checking: does user have required permissions?
13required_permissions = {"read", "write"}
14user_permissions = {"read", "write", "delete", "admin"}
15
16has_required = required_permissions <= user_permissions
17print(f"\nUser has required permissions: {has_required}")
18
19# Is this a subset? (all elements of A are in B)
20print(f"required is subset of user: {required_permissions <= user_permissions}")
21print(f"user is superset of required: {user_permissions >= required_permissions}")
>>>Output
Changed features: {'notifications', 'sync', 'analytics', 'export'}
Same result: {'notifications', 'sync', 'analytics', 'export'}
 
User has required permissions: True
required is subset of user: True
user is superset of required: True
a | ba & ba - ba ^ ba <= b
a | b
Union
All unique from both sets
a & b
Intersection
Only elements in both
a - b
Difference
In a but not in b only
a ^ b
Symmetric diff
In one but not in both
a <= b
Subset check
True if all of a is in b

Data Reconciliation

Set operations shine in data reconciliation tasks where you need to compare records between systems, find discrepancies, and ensure data consistency. This is a common requirement when synchronizing databases, validating ETL pipelines, or auditing data migrations.
1# Record IDs from source database and data warehouse
2source_ids = {101, 102, 103, 104, 105, 106, 107, 108}
3warehouse_ids = {102, 103, 105, 107, 109, 110, 111}
4
5# Complete reconciliation analysis
6in_sync = source_ids & warehouse_ids
7missing_from_warehouse = source_ids - warehouse_ids
8orphaned_in_warehouse = warehouse_ids - source_ids
9total_discrepancies = source_ids ^ warehouse_ids
10
11print(f"Records in sync: {len(in_sync)} - {sorted(in_sync)}")
12print(f"Missing from warehouse: {len(missing_from_warehouse)} - {sorted(missing_from_warehouse)}")
13print(f"Orphaned in warehouse: {len(orphaned_in_warehouse)} - {sorted(orphaned_in_warehouse)}")
14print(f"Total discrepancies: {len(total_discrepancies)}")
15
16# Calculate sync percentage
17sync_rate = len(in_sync) / len(source_ids)
18print(f"\nSync rate: {sync_rate:.1%}")
>>>Output
Records in sync: 4 - [102, 103, 105, 107]
Missing from warehouse: 4 - [101, 104, 106, 108]
Orphaned in warehouse: 3 - [109, 110, 111]
Total discrepancies: 7
 
Sync rate: 50.0%
Debug Challenge

> This code tries to find source records missing from the warehouse, but the set difference operands are reversed. It shows warehouse-only records instead.

Logic error: result shows {6} but should show records in source not in warehouse

Set operations are invaluable for data reconciliation. Comparing source and destination record IDs with a single - operation is far cleaner than writing a loop that checks each source ID against the destination.
The symmetric difference operator ^ gives you everything that changed between two snapshots of a dataset. If yesterday's IDs are A and today's are B, then A ^ B shows exactly what was added or removed.
Subset checking with <= lets you verify that required columns exist in a query result or that required permissions are present in a user's role set, all without writing any loop logic.

Sorting and Filtering

Daily Life
Interviews

Sort and filter records like SQL

Sorting and filtering are fundamental data operations that are often combined to answer analytical questions. Python provides flexible tools for both: the sorted() function with custom keys enables sophisticated ordering, while comprehensions and the filter() function provide powerful selection capabilities. Mastering the combination of these operations enables you to write complex data queries that rival SQL in expressiveness.

The general pattern for data queries is: filter to select relevant records, sort to order them appropriately, then optionally slice to limit results. This mirrors the SELECT...WHERE...ORDER BY...LIMIT pattern in SQL and appears constantly in data processing code.

Sorting with Custom Keys

The sorted() function accepts a key parameter that specifies how to extract a comparison value from each element. The key function is called once per element, and elements are sorted based on the returned values. This enables sorting complex objects by any attribute or computed property.

1employees = [
2 {"name": "Charlie", "age": 35, "salary": 75000, "dept": "Engineering"},
3 {"name": "Alice", "age": 28, "salary": 95000, "dept": "Engineering"},
4 {"name": "Bob", "age": 42, "salary": 85000, "dept": "Analytics"},
5 {"name": "Diana", "age": 31, "salary": 72000, "dept": "Design"},
6]
7
8# Sort by age (ascending by default)
9by_age = sorted(employees, key=lambda e: e["age"])
10print("By age:", [e["name"] for e in by_age])
11
12# Sort by salary (descending)
13by_salary = sorted(employees, key=lambda e: e["salary"], reverse=True)
14print("By salary (desc):", [e["name"] for e in by_salary])
15
16# Sort alphabetically by name
17by_name = sorted(employees, key=lambda e: e["name"])
18print("By name:", [e["name"] for e in by_name])
>>>Output
By age: ['Alice', 'Diana', 'Charlie', 'Bob']
By salary (desc): ['Alice', 'Bob', 'Charlie', 'Diana']
By name: ['Alice', 'Bob', 'Charlie', 'Diana']

The key function is called once per element to extract the sort value. The reverse=True parameter sorts in descending order. Lambda functions are commonly used for simple key extraction, but you can use any callable.

Multi-Level Sorting

To sort by multiple criteria, return a tuple from the key function. Python compares tuples element by element, creating a natural multi-level sort. The first element is the primary sort key, the second is the secondary sort key used to break ties, and so on.

1records = [
2 {"dept": "Engineering", "name": "Zara", "years": 3},
3 {"dept": "Analytics", "name": "Alice", "years": 5},
4 {"dept": "Engineering", "name": "Bob", "years": 7},
5 {"dept": "Analytics", "name": "Charlie", "years": 5},
6 {"dept": "Engineering", "name": "Diana", "years": 3},
7]
8
9# Negate years for descending within ascending sort
10sorted_records = sorted(
11 records,
12 key=lambda r: (r["dept"], -r["years"], r["name"])
13)
14
15print("Sorted by dept, then years (desc), then name:")
16for r in sorted_records:
17 print(f" {r['dept']}: {r['name']} ({r['years']} years)")
>>>Output
Sorted by dept, then years (desc), then name:
Analytics: Alice (5 years)
Analytics: Charlie (5 years)
Engineering: Bob (7 years)
Engineering: Diana (3 years)
Engineering: Zara (3 years)

The tuple (r["dept"], -r["years"], r["name"]) sorts first by department alphabetically, then by years in descending order (negation inverts numeric sorting), then by name to break any remaining ties. This gives you fine-grained control over sort order.

Python Quiz

> Sort fruit names by their length so the shortest comes first. Pick the function that returns a new sorted list and the key function that measures string length.

words = ["banana", "fig", "apple", "kiwi"]
result = ___(words, key=___)
print(result[0])
print(result[-1])
len
list
str
sorted
reversed

Filter, Sort, and Slice

Real data queries often combine filtering, sorting, and limiting results. The pattern is: filter with a comprehension to select relevant records, sort with sorted() to order them, then slice with [:n] to limit the result count. This pipeline approach mirrors SQL queries and is natural to read.
1orders = [
2 {"id": 1, "customer": "alice", "amount": 150, "status": "done"},
3 {"id": 2, "customer": "bob", "amount": 50, "status": "pending"},
4 {"id": 3, "customer": "alice", "amount": 300, "status": "done"},
5 {"id": 4, "customer": "bob", "amount": 200, "status": "done"},
6]
7
8# Filter completed, sort by amount descending, take top 2
9top_done = sorted(
10 [o for o in orders if o["status"] == "done"],
11 key=lambda o: o["amount"],
12 reverse=True
13)[:2]
14
15for o in top_done:
16 print(f"Order {o['id']}: {o['customer']} - ${o['amount']}")
17
18# Alice's orders
19alice = [o for o in orders if o["customer"] == "alice"]
20print(f"Alice total: ${sum(o['amount'] for o in alice)}")
>>>Output
Order 3: alice - $300
Order 4: bob - $200
Alice total: $450
TIP
For very large datasets where you only need the top N items, consider using heapq.nlargest() or heapq.nsmallest() instead of sorting the entire list. These functions are more efficient when N is much smaller than the list size.

itertools.groupby Grouping

The itertools.groupby() function groups consecutive elements that share the same key value. Important: the data must be sorted by the grouping key first, because groupby only groups consecutive matches. This enables SQL-like GROUP BY operations on sorted data.

1from itertools import groupby
2
3sales = [
4 {"region": "West", "product": "A", "amount": 100},
5 {"region": "East", "product": "B", "amount": 200},
6 {"region": "West", "product": "B", "amount": 150},
7 {"region": "East", "product": "A", "amount": 300},
8 {"region": "West", "product": "A", "amount": 80},
9]
10
11# Must sort by grouping key first!
12sales_sorted = sorted(sales, key=lambda s: s["region"])
13
14# Group and aggregate by region
15print("Sales by region:")
16for region, group in groupby(sales_sorted, key=lambda s: s["region"]):
17 # Must convert iterator to list
18 items = list(group)
19 total = sum(s["amount"] for s in items)
20 print(f" {region}: ${total} ({len(items)} sales)")
>>>Output
Sales by region:
East: $500 (2 sales)
West: $330 (3 sales)

The groupby() function yields (key, group_iterator) pairs. The group is an iterator, not a list, so you must convert it with list(group) if you need to iterate it multiple times or access its length. This pattern enables powerful aggregations similar to SQL GROUP BY.

Data Structure Selection

Daily Life
Interviews

Choose structures by access pattern

Choosing the right data structure is one of the most important decisions in programming. The wrong choice can make code slow, memory-hungry, or unnecessarily complex. Understanding the strengths and trade-offs of each structure helps you make informed decisions that balance readability, performance, and memory usage for your specific use case.
The key insight is that different data structures optimize for different operations. Lists are great for ordered, indexed access. Dictionaries excel at key-based lookup. Sets provide fast membership testing. Choosing well means understanding which operations your code performs most frequently and selecting the structure that makes those operations efficient.

Lists vs Sets: Membership

When checking if an item exists in a collection, sets are dramatically faster than lists. Lists scan sequentially from the beginning, making membership testing O(n). Sets use hash tables for near-instant O(1) lookups. For small collections the difference is negligible, but for thousands of items, sets can be hundreds of times faster.

1# Simulating a blocklist of user IDs
2blocked_list = list(range(10000))
3blocked_set = set(blocked_list)
4
5# Both work, but performance differs dramatically
6user_id = 9999
7
8# List: potentially scans all 10,000 items
9in_list = user_id in blocked_list
10print(f"Found in list: {in_list}")
11
12# Set: instant hash table lookup
13in_set = user_id in blocked_set
14print(f"Found in set: {in_set}")
15
16# For 10,000 items:
17# - List lookup: up to 10,000 comparisons (O(n))
18# - Set lookup: ~1 hash + comparison (O(1))
>>>Output
Found in list: True
Found in set: True
Both return True, but the computational work is vastly different. The set lookup computes a hash and does one comparison. The list lookup potentially compares against every element. Always convert lists to sets when you need repeated membership testing.
Use Lists When
  • Order of elements matters
  • Duplicates are meaningful
  • Index-based access is common
  • Sequential iteration needed
Use Sets When
  • Only unique values matter
  • Membership testing is frequent
  • Set operations needed
  • Order is irrelevant

dict vs list Lookup

When you need to find records by a specific field, dictionaries provide O(1) lookup while lists require O(n) scanning. If you frequently look up records by ID, name, or any other unique key, building a dictionary indexed by that key transforms slow linear searches into instant hash lookups.

1# User records as a list
2users_list = [
3 {"id": 1, "name": "Alice", "email": "alice@example.com"},
4 {"id": 2, "name": "Bob", "email": "bob@example.com"},
5 {"id": 3, "name": "Charlie", "email": "charlie@example.com"},
6]
7
8# Finding user by ID in list requires scanning
9def find_user_in_list(user_id):
10 for user in users_list:
11 if user["id"] == user_id:
12 return user
13 return None
14
15# Build a dictionary indexed by ID
16users_by_id = {user["id"]: user for user in users_list}
17
18# Now lookup is instant
19print("List lookup:", find_user_in_list(2)["name"])
20print("Dict lookup:", users_by_id[2]["name"])
21
22# Can also index by other fields
23users_by_email = {user["email"]: user for user in users_list}
24print("By email:", users_by_email["charlie@example.com"]["name"])
>>>Output
List lookup: Bob
Dict lookup: Bob
By email: Charlie

Building the dictionary is O(n), but each subsequent lookup is O(1). If you look up records more than once, the preprocessing cost is repaid. For APIs or data processing that repeatedly access records by key, always build lookup dictionaries.

Tuples vs Lists: Immutable

Tuples are immutable sequences. Use them when you need a fixed record that should not be modified, like coordinates, RGB colors, or database rows. Tuples use slightly less memory than lists and, critically, can be used as dictionary keys since they are hashable.

1# Tuples as lightweight immutable records
2point = (10, 20)
3rgb_color = (255, 128, 0)
4
5# Tuples can be dictionary keys (lists cannot!)
6location_names = {
7 (40.7128, -74.0060): "New York City",
8 (51.5074, -0.1278): "London",
9 (35.6762, 139.6503): "Tokyo",
10}
11
12coords = (40.7128, -74.0060)
13print(f"Location: {location_names[coords]}")
14
15# Named tuples provide field names for clarity
16from collections import namedtuple
17
18User = namedtuple("User", ["id", "name", "email"])
19user = User(1, "Alice", "alice@example.com")
20
21print(f"User: {user.name} ({user.email})")
22print(f"User ID: {user.id}")
>>>Output
Location: New York City
User: Alice (alice@example.com)
User ID: 1

Named tuples from collections.namedtuple provide the benefits of tuples (immutability, hashability) with the readability of named fields. For modern Python, consider @dataclass(frozen=True) which offers similar benefits with additional features like default values and type hints.

Choosing by Operation Type

Your data structure choice should be driven by which operations you perform most frequently. Each structure is optimized for different access patterns, and choosing well can make the difference between code that runs in seconds versus hours on large datasets.
01
list
Append/pop from end in O(1) amortized time. Best for ordered sequences.
02
collections.deque
Append/pop from both ends in O(1). Use when you need a double-ended queue.
03
set
O(1) membership testing. Convert lists to sets for repeated "in" checks.
04
dict
O(1) key-value lookup. Preserves insertion order since Python 3.7.
05
heapq
Priority queue operations. Use when you always need the smallest or largest item.
06
Counter
Automatic counting from collections module. Tallies occurrences in one line.

Memory Considerations

Data structure choice affects memory usage significantly. Sets and dictionaries have overhead for their hash tables. Lists are more compact but slower for lookups. For very large datasets, consider generators that process items one at a time instead of loading everything into memory.
1import sys
2
3# Compare memory usage for 1000 integers
4data = list(range(1000))
5
6as_list = list(data)
7as_tuple = tuple(data)
8as_set = set(data)
9
10print(f"List: {sys.getsizeof(as_list):,} bytes")
11print(f"Tuple: {sys.getsizeof(as_tuple):,} bytes")
12print(f"Set: {sys.getsizeof(as_set):,} bytes")
13
14# Sets use ~4x more memory but provide O(1) lookup
>>>Output
List: 8,856 bytes
Tuple: 8,048 bytes
Set: 32,984 bytes

Sets use significantly more memory due to their hash table structure, but this overhead enables O(1) membership testing. The trade-off is worthwhile when lookup speed matters more than memory. Tuples are slightly smaller than lists because they do not allocate extra space for potential growth.

Do
  • Profile before optimizing
  • Start with the simplest structure
  • Convert to specialized types when needed
  • Document why you chose each structure
Don't
  • Prematurely optimize
  • Use lists for frequent membership tests
  • Forget memory for large datasets
  • Assume one structure fits all cases
Python Quiz

> You have a list of user IDs and need to quickly check if a given ID exists. Pick the correct type to convert the list into for O(1) lookups, and the correct operator to test membership.

ids = [101, 202, 303, 404]
lookup = ___(ids)
print(202 ___ lookup)
print(len(lookup))
tuple
is
in
set
dict

Common Mistakes

Even experienced developers make mistakes with data structures. These pitfalls are common enough that interviewers specifically test for them. Recognizing and avoiding these patterns helps you write more robust, predictable code.

Modifying During Iteration

Modifying a collection while iterating over it causes unpredictable behavior. Items may be skipped or processed twice because the iteration index gets out of sync with the changing collection size. Always create a copy or build a new collection instead.
1numbers = [1, 2, 3, 4, 5, 6]
2# This would skip elements:
3# for n in numbers:
4# if n % 2 == 0:
5# numbers.remove(n) # DON'T DO THIS
6
7# RIGHT: Build a new list with comprehension
8numbers = [1, 2, 3, 4, 5, 6]
9odds_only = [n for n in numbers if n % 2 != 0]
10print("New list (odds):", odds_only)
11
12# RIGHT: Iterate over a copy to modify
13numbers = [1, 2, 3, 4, 5, 6]
14# [:] creates a shallow copy
15for n in numbers[:]:
16 if n % 2 == 0:
17 numbers.remove(n)
18print("Modified original:", numbers)
>>>Output
New list (odds): [1, 3, 5]
Modified original: [1, 3, 5]

Mutable Default Arguments

Default argument values are created once when the function is defined, not each time the function is called. If the default is mutable (like a list or dict), modifications persist across function calls, leading to surprising bugs.

1# WRONG: Mutable default argument
2def add_item_wrong(item, items=[]):
3 items.append(item)
4 return items
5
6print(add_item_wrong("a"))
7# Prints ['a', 'b'] - where did 'a' come from?
8print(add_item_wrong("b"))
9
10# RIGHT: Use None and create inside function
11def add_item_right(item, items=None):
12 if items is None:
13 items = []
14 items.append(item)
15 return items
16
17print(add_item_right("x"))
18print(add_item_right("y"))
>>>Output
['a']
['a', 'b']
['x']
['y']

Always use None as the default for mutable arguments and create the actual mutable object inside the function body. This is one of Python's most infamous gotchas and a favorite interview question.

Shallow vs Deep Copy

Assignment creates a reference, not a copy. Shallow copy (slice or .copy()) duplicates the outer structure but shares nested objects. Deep copy duplicates everything. Confusing these leads to mysterious bugs where modifying one variable affects another.

1import copy
2
3original = [[1, 2], [3, 4]]
4
5# Reference: same object
6ref = original
7ref[0][0] = 99
8print("Reference:", original)
9
10# Shallow: same inner
11original = [[1, 2], [3, 4]]
12shallow = original[:]
13shallow[0][0] = 88
14print("Shallow:", original)
15
16# Deep copy: fully independent
17original = [[1, 2], [3, 4]]
18deep = copy.deepcopy(original)
19deep[0][0] = 77
20print("Deep:", original)
>>>Output
Reference: [[99, 2], [3, 4]]
Shallow: [[88, 2], [3, 4]]
Deep: [[1, 2], [3, 4]]
Reference: b = a
Reference: b = a
Both variables point to the same object. All changes are shared.
Shallow: b = a[:] or .copy()
Shallow: b = a[:] or .copy()
New outer container, but nested objects are still shared references.
Deep: copy.deepcopy(a)
Deep: copy.deepcopy(a)
Fully independent clone of the entire structure including all nested data.

Use copy.deepcopy() when you need a completely independent copy of nested structures. This is especially important when working with data you receive from elsewhere and do not want to accidentally modify, or when passing data to functions that might mutate it.

Choosing the right data structure can make the difference between elegant and unwieldy code. Put these techniques to the test with hands-on challenges in the Python Builder.
PUTTING IT ALL TOGETHER

> You are a data engineer at Indeed building a pipeline that parses nested API responses for job postings, transforms salary fields with comprehensions, reconciles active versus expired listing IDs using set operations, and sorts filtered results by relevance score.

Nested data structures mirror the API response format where each job posting is a dict containing a list of required skills and a nested dict of salary ranges.
dict and list comprehensions flatten and transform the nested posting data into a clean list of records in a single readable expression.
set operations compute the difference between active listing IDs and expired ones, producing only the postings that should be surfaced to candidates.
Sorting and filtering chain together to rank the reconciled postings by descending relevance score and remove any below the quality threshold.
KEY TAKEAWAYS
Lists of dicts model tables; dicts with list values enable grouping
.setdefault() and defaultdict simplify building nested structures
Chain .get() with empty dict defaults for safe nested access
Comprehensions are faster and more readable than equivalent loops
Dict comprehension syntax: {k: v for k, v in items}
Set operations: | union, & intersection, - difference, ^ symmetric diff
Multi-key sort uses tuples: key=lambda x: (x["a"], -x["b"])
Use sets for O(1) membership testing; dicts for O(1) key lookup
Never modify a collection while iterating over it
Use None as default for mutable function arguments
Use copy.deepcopy() for fully independent copies of nested data

Complex data manipulation patterns

Category
Python
Difficulty
intermediate
Duration
48 minutes
Challenges
0 hands-on challenges

Topics covered: Nested Data Structures, Dict and List Comprehensions, Set Operations, Sorting and Filtering, Data Structure Selection

Lesson Sections

  1. Nested Data Structures

    Nested data structures are collections that contain other collections as their elements. This nesting can occur in multiple patterns: a list of dictionaries represents a table of records, similar to rows in a database. A dictionary with list values groups related items by category. A dictionary containing other dictionaries models hierarchical relationships like organizational structures or configuration settings. Understanding these patterns is essential because they mirror the structure of JSO

  2. Dict and List Comprehensions

    Comprehensions are concise expressions for creating lists, dictionaries, and sets from existing iterables. They combine iteration, transformation, and optional filtering into a single readable line. Beyond being syntactic sugar, comprehensions execute faster than equivalent loops because Python optimizes them internally. They are also considered more Pythonic, expressing intent clearly without the boilerplate of explicit loop construction. Comprehensions shine when you need to transform data fro

  3. Set Operations

    In data engineering, set operations are essential for data reconciliation, deduplication, access control analysis, and finding differences between datasets. Understanding these operations lets you answer questions like "which users have access to both systems?" or "which records exist in the source but not the destination?" with simple, efficient code. Union - Combining Sets Notice that bob and diana appear in both original sets but only once in the union. Sets automatically handle deduplication

  4. Sorting and Filtering

    Sorting and filtering are fundamental data operations that are often combined to answer analytical questions. Python provides flexible tools for both: the sorted() function with custom keys enables sophisticated ordering, while comprehensions and the filter() function provide powerful selection capabilities. Mastering the combination of these operations enables you to write complex data queries that rival SQL in expressiveness. Sorting with Custom Keys Multi-Level Sorting Filter, Sort, and Slice

  5. Data Structure Selection

    Choosing the right data structure is one of the most important decisions in programming. The wrong choice can make code slow, memory-hungry, or unnecessarily complex. Understanding the strengths and trade-offs of each structure helps you make informed decisions that balance readability, performance, and memory usage for your specific use case. The key insight is that different data structures optimize for different operations. Lists are great for ordered, indexed access. Dictionaries excel at ke