Data Structures: Intermediate

Wikipedia's internal link-traversal system uses a queue-based breadth-first search to systematically explore related articles, processing millions of page connections by always visiting the closest links before venturing further out. The same BFS queue pattern powers every shortest-path feature in modern software, from Uber's driver routing system to Twitter's trending topic propagation. Stacks and queues are the two structures that make graph traversal possible, and understanding how to implement them in Python is what separates a programmer who can solve algorithmic problems from one who cannot.

Nested Data Structures

Daily Life

Interviews

Navigate and reshape nested data

Nested data structures are collections that contain other collections as their elements. This nesting can occur in multiple patterns: a list of dictionaries represents a table of records, similar to rows in a database. A dictionary with list values groups related items by category. A dictionary containing other dictionaries models hierarchical relationships like organizational structures or configuration settings. Understanding these patterns is essential because they mirror the structure of JSON data, database query results, and configuration files that you encounter daily in data engineering.

The key insight is that Python allows any data structure to contain any other data structure. Lists can hold dictionaries, dictionaries can hold lists, and you can nest these combinations arbitrarily deep. This flexibility enables you to model complex real-world data accurately, but it also requires developing intuition about how to navigate and transform these structures efficiently.

Lists of Dicts: Table Data

The most common nested pattern in data engineering is a list of dictionaries, where each dictionary represents a record and the list represents a collection of records, similar to a database table. This structure matches how data arrives from REST APIs, how pandas DataFrames can be converted to native Python, and how many data processing pipelines represent intermediate results. Each dictionary in the list has the same keys, acting like column names, with values representing the data in each cell.

	users = [
	{"id": 1, "name": "Alice", "role": "engineer", "salary": 95000},
	{"id": 2, "name": "Bob", "role": "analyst", "salary": 75000},
	{"id": 3, "name": "Charlie", "role": "engineer", "salary": 105000},
	{"id": 4, "name": "Diana", "role": "manager", "salary": 120000},
	]

	# Access a specific record by index
	first_user = users[0]
	print("First user:", first_user["name"])

	# Access a specific field from a specific record
	third_user_role = users[2]["role"]
	print("Third user role:", third_user_role)

	# Iterate and access fields
	for user in users:
	print(f"{user['name']} ({user['role']}): ${user['salary']:,}")

>>>Output

First user: Alice

Third user role: engineer

Alice (engineer): $95,000

Bob (analyst): $75,000

Charlie (engineer): $105,000

Diana (manager): $120,000

Access patterns work in two directions. The expression users[0] selects a row by index, returning a dictionary. The expression ["name"] selects a column by key from that dictionary. You can chain these operations: users[0]["name"] gives "Alice" directly. This two-step access pattern is fundamental to working with tabular data in Python.

When processing lists of dictionaries, you often need to extract a single field from all records, filter records based on conditions, or transform records in some way. These operations become natural once you understand the access patterns.

	users = [
	{"id": 1, "name": "Alice", "role": "engineer", "salary": 95000},
	{"id": 2, "name": "Bob", "role": "analyst", "salary": 75000},
	{"id": 3, "name": "Charlie", "role": "engineer", "salary": 105000},
	]

	# Extract all names (like SELECT name FROM users)
	names = [user["name"] for user in users]
	print("All names:", names)

	# Filter by condition (like WHERE role = 'engineer')
	engineers = [user for user in users if user["role"] == "engineer"]
	print("Engineers:", [e["name"] for e in engineers])

	# Calculate aggregate (like SELECT AVG(salary))
	avg_salary = sum(user["salary"] for user in users) / len(users)
	print(f"Average salary: ${avg_salary:,.0f}")

>>>Output

All names: ['Alice', 'Bob', 'Charlie']

Engineers: ['Alice', 'Charlie']

Average salary: $91,667

Dicts with List Values

When you need to group items by a key, use a dictionary where each value is a list. This pattern is common for categorization, grouping database records by a field, and building inverted indexes. The dictionary provides O(1) lookup by group, while the list stores all members of that group. This is more efficient than scanning through a flat list every time you need items from a specific category.

	# Group users by their department
	departments = {
	"engineering": ["Alice", "Charlie", "Eve", "George"],
	"analytics": ["Bob", "Diana"],
	"management": ["Frank"],
	"design": ["Hannah", "Ivan"],
	}

	# Access a specific department instantly
	print("Engineering team:", departments["engineering"])
	print("Engineering size:", len(departments["engineering"]))

	# Check if someone is in a department
	if "Alice" in departments["engineering"]:
	print("Alice is in engineering")

	# Iterate over all departments
	print("\nDepartment sizes:")
	for dept, members in departments.items():
	print(f" {dept}: {len(members)} members")

>>>Output

Engineering team: ['Alice', 'Charlie', 'Eve', 'George']

Engineering size: 4

Alice is in engineering

Department sizes:

  engineering: 4 members

  analytics: 2 members

  management: 1 members

  design: 2 members

The .items() method yields key-value pairs as tuples, allowing you to iterate over both the group name and its members simultaneously. This pattern enables efficient lookups: finding all engineers is O(1) dictionary access instead of O(n) scanning through every record. For large datasets, this performance difference is significant.

TIP

Use this pattern when you frequently need to look up all items belonging to a category. Building the grouped dictionary once and reusing it is much faster than filtering a flat list repeatedly.

Nested Dicts: Hierarchical

Dictionaries can contain other dictionaries to represent hierarchical relationships. This pattern mirrors JSON API responses, configuration files, and any tree-like data structure. When working with APIs, you will constantly encounter nested dictionaries representing complex entities with sub-entities. Learning to navigate these structures confidently is essential for API integration work.

	# Typical API response structure
	api_response = {
	"status": "success",
	"metadata": {
	"request_id": "abc123",
	"timestamp": "2024-01-15T10:30:00Z"
	},
	"data": {
	"user": {
	"id": 42,
	"profile": {
	"name": "Alice Johnson",
	"email": "alice@example.com",
	"preferences": {
	"theme": "dark",
	"notifications": True
	}
	}
	}
	}
	}

	# Navigate the hierarchy with chained access
	user_name = api_response["data"]["user"]["profile"]["name"]
	print("User name:", user_name)

	theme = api_response["data"]["user"]["profile"]["preferences"]["theme"]
	print("Theme preference:", theme)

	# Access metadata
	request_id = api_response["metadata"]["request_id"]
	print("Request ID:", request_id)

>>>Output

User name: Alice Johnson

Theme preference: dark

Request ID: abc123

Direct chaining like response["data"]["user"]["profile"] works well when you know the structure exists. However, if any key in the chain is missing, Python raises a KeyError. For uncertain data structures, you need defensive access patterns.

	api_response = {
	"status": "success",
	"data": {"user": {"id": 42}}
	}

	# Safe navigation with chained get() calls
	# If any key is missing, returns the default instead of crashing
	profile = api_response.get("data", {}).get("user", {}).get("profile", {})
	name = profile.get("name", "Unknown")
	print("Name (safe):", name)

	# Check before accessing
	if "profile" in api_response.get("data", {}).get("user", {}):
	print("Profile exists")
	else:
	print("No profile found")

>>>Output

Name (safe): Unknown

No profile found

Chaining .get() calls with empty dict defaults {} prevents KeyError exceptions. If any level is missing, the chain returns an empty dict, and subsequent .get() calls safely return their defaults. This pattern is essential when processing API responses that may have optional fields.

Single-level safety

Use d.get("key", default) to return a fallback if the key is missing

Chained navigation

Chain .get("a", {}).get("b", val) to walk nested dicts without crashing

Empty dict fallback

Use {} as the intermediate default so the next .get() has a dict to call

Final default matters

The last .get() in the chain should return your actual fallback value

Flat to Nested Structures

Often you receive flat data that needs to be transformed into a nested structure. This is common when processing database query results that need grouping, or when building aggregations from raw event data. The fundamental pattern involves iterating through the flat data and building up the nested structure incrementally, checking for key existence and initializing empty collections as needed.

	# Flat transaction data from a database
	transactions = [
	{"user_id": 1, "amount": 100, "category": "food"},
	{"user_id": 2, "amount": 50, "category": "transport"},
	{"user_id": 1, "amount": 75, "category": "food"},
	{"user_id": 1, "amount": 200, "category": "electronics"},
	{"user_id": 2, "amount": 30, "category": "food"},
	{"user_id": 1, "amount": 45, "category": "transport"},
	]

	# Group transactions by user_id
	by_user = {}
	for tx in transactions:
	user_id = tx["user_id"]
	if user_id not in by_user:
	by_user[user_id] = []
	by_user[user_id].append(tx)

	# Now we can efficiently access all transactions for a user
	print("User 1 transactions:", len(by_user[1]))
	print("User 1 total:", sum(tx["amount"] for tx in by_user[1]))

>>>Output

User 1 transactions: 4

User 1 total: 420

This pattern of checking for key existence and initializing an empty list is so common that Python provides a cleaner way to write it. The setdefault() method combines the check and initialization into a single operation.

	transactions = [
	{"user_id": 1, "amount": 100, "category": "food"},
	{"user_id": 2, "amount": 50, "category": "transport"},
	{"user_id": 1, "amount": 75, "category": "food"},
	]

	# Cleaner grouping with setdefault()
	by_user = {}
	for tx in transactions:
	by_user.setdefault(tx["user_id"], []).append(tx)

	print("Grouped by user:", {k: len(v) for k, v in by_user.items()})

	# Even cleaner: group by category within each user
	by_user_category = {}
	for tx in transactions:
	user_data = by_user_category.setdefault(tx["user_id"], {})
	user_data.setdefault(tx["category"], []).append(tx["amount"])

	print("User 1 by category:", by_user_category[1])

>>>Output

Grouped by user: {1: 2, 2: 1}

User 1 by category: {'food': [100, 75]}

The .setdefault(key, default) method returns the value if the key exists, or sets it to the default and returns that default if the key is missing. This allows you to chain .append() directly, making the grouping operation a single line. This is more Pythonic than the explicit if-check pattern.

Python Quiz

> Group items and count total occurrences. Pick the dict method that creates missing keys automatically, and the accessor that returns all the stored lists.

groups = {}
for item in ["a", "b", "a", "c", "b"]:
    groups.___(item, []).append(1)
print(len(groups))
total = sum(
    len(v) for v in groups.___()
)
print(total)

setdefault

get

update

values

keys

The list-of-dicts pattern mirrors database table rows exactly. When you receive query results from a database driver, each row is a dictionary and the full result set is a list of those dictionaries.

Navigating nested structures fluently is one of the skills that separates data engineers who work productively with APIs from those who struggle. Practice the chained .get() pattern until it becomes instinctive.

Dict and List Comprehensions

Daily Life

Interviews

Build lists and dicts in one expression

Comprehensions are concise expressions for creating lists, dictionaries, and sets from existing iterables. They combine iteration, transformation, and optional filtering into a single readable line. Beyond being syntactic sugar, comprehensions execute faster than equivalent loops because Python optimizes them internally. They are also considered more Pythonic, expressing intent clearly without the boilerplate of explicit loop construction.

Comprehensions shine when you need to transform data from one shape to another. Extracting specific fields, applying calculations, filtering by conditions, or reshaping structures are all natural uses for comprehensions. However, they should remain readable. If a comprehension becomes too complex, a regular loop is often clearer.

List Comprehension Basics

A list comprehension has the form [expression for item in iterable]. It creates a new list by evaluating the expression once for each item in the iterable. The expression can be any valid Python expression: a simple variable reference, a calculation, a method call, or even a function application.

	# Traditional loop approach
	squares_loop = []
	for x in range(6):
	squares_loop.append(x ** 2)
	print("Loop result:", squares_loop)

	# List comprehension approach
	squares_comp = [x ** 2 for x in range(6)]
	print("Comprehension:", squares_comp)

	# More concise and expressive

>>>Output

Loop result: [0, 1, 4, 9, 16, 25]

Comprehension: [0, 1, 4, 9, 16, 25]

The comprehension version is more concise and expresses intent directly: create a list where each element is x squared. The loop version requires understanding the initialization, the iteration, and the accumulation pattern. With comprehensions, the intent is immediately clear.

	# Practical examples with different expressions

	names = ["alice", "bob", "charlie", "diana"]

	# Transform: uppercase all names
	upper_names = [name.upper() for name in names]
	print("Uppercase:", upper_names)

	# Transform: get lengths
	lengths = [len(name) for name in names]
	print("Lengths:", lengths)

	# Transform: format strings
	formatted = [f"User: {name.title()}" for name in names]
	print("Formatted:", formatted)

	# Transform: extract first character
	initials = [name[0].upper() for name in names]
	print("Initials:", initials)

>>>Output

Uppercase: ['ALICE', 'BOB', 'CHARLIE', 'DIANA']

Lengths: [5, 3, 7, 5]

Formatted: ['User: Alice', 'User: Bob', 'User: Charlie', 'User: Diana']

Initials: ['A', 'B', 'C', 'D']

Filtering with Conditions

Add an if clause to filter which items are included in the result: [expression for item in iterable if condition]. Only items where the condition evaluates to True are processed and included. The condition is evaluated before the expression, so you can safely access properties that might not exist on filtered-out items.

	numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

	# Filter: only even numbers
	evens = [n for n in numbers if n % 2 == 0]
	print("Evens:", evens)

	# Filter and transform: square the even numbers
	even_squares = [n ** 2 for n in numbers if n % 2 == 0]
	print("Even squares:", even_squares)

	# Filter: numbers greater than 5
	large = [n for n in numbers if n > 5]
	print("Greater than 5:", large)

	# Multiple conditions with 'and'
	middle = [n for n in numbers if n > 3 and n < 8]
	print("Between 3 and 8:", middle)

>>>Output

Evens: [2, 4, 6, 8, 10]

Even squares: [4, 16, 36, 64, 100]

Greater than 5: [6, 7, 8, 9, 10]

Between 3 and 8: [4, 5, 6, 7]

The filtering happens before the transformation. First Python checks if n is even, and only if True does it compute n squared and add it to the result. This combines the functionality of filter() and map() into a single readable expression.

	# Filtering records from a list of dictionaries
	users = [
	{"name": "Alice", "active": True, "age": 28, "role": "engineer"},
	{"name": "Bob", "active": False, "age": 35, "role": "analyst"},
	{"name": "Charlie", "active": True, "age": 22, "role": "engineer"},
	{"name": "Diana", "active": True, "age": 31, "role": "manager"},
	]

	# Get names of active users
	active_names = [u["name"] for u in users if u["active"]]
	print("Active users:", active_names)

	# Get active engineers over 25
	senior_engineers = [
	u["name"] for u in users
	if u["active"] and u["role"] == "engineer" and u["age"] > 25
	]
	print("Senior active engineers:", senior_engineers)

>>>Output

Active users: ['Alice', 'Charlie', 'Diana']

Senior active engineers: ['Alice']

Fill in the Blank

> You have a list of user dictionaries with name, active status, and role fields. Pick the extraction expression and filter condition to get only active engineers' names.

users = [
    {{"name": "Alice", "active": True, "role": "engineer"}},
    {{"name": "Bob", "active": False, "role": "analyst"}},
    {{"name": "Charlie", "active": True, "role": "engineer"}},
]
result = [ for u in users if ]
print(result)

Dictionary Comprehensions

Dictionary comprehensions use curly braces with a key-value pair: {key_expr: value_expr for item in iterable}. They are essential for transforming dictionaries, filtering dictionary entries, building lookups from lists, and inverting key-value relationships.

	# Create a dict from parallel lists
	names = ["alice", "bob", "charlie"]
	scores = [85, 92, 78]

	# zip() pairs elements from both lists
	score_dict = {name: score for name, score in zip(names, scores)}
	print("Scores:", score_dict)

	# Transform values: double all scores
	doubled = {name: score * 2 for name, score in score_dict.items()}
	print("Doubled:", doubled)

	# Transform keys: uppercase names
	upper_keys = {name.upper(): score for name, score in score_dict.items()}
	print("Upper keys:", upper_keys)

	# Filter: only passing scores (>= 80)
	passing = {name: score for name, score in score_dict.items() if score >= 80}
	print("Passing:", passing)

>>>Output

Scores: {'alice': 85, 'bob': 92, 'charlie': 78}

Doubled: {'alice': 170, 'bob': 184, 'charlie': 156}

Upper keys: {'ALICE': 85, 'BOB': 92, 'CHARLIE': 78}

Passing: {'alice': 85, 'bob': 92}

The zip() function pairs elements from two iterables, creating tuples that you can unpack in the comprehension. Combined with a dict comprehension, it creates dictionaries from parallel lists in one line. This is a very common pattern for data transformation.

Inverting Dictionaries

A common operation is inverting a dictionary: swapping keys and values to create a reverse lookup. This is useful when you have a mapping in one direction but need to look up in the other direction. For example, you might have user IDs mapped to usernames, but need to find a user ID given a username.

	# Country code to name mapping
	codes_to_names = {
	"US": "United States",
	"UK": "United Kingdom",
	"CA": "Canada",
	"AU": "Australia"
	}

	# Invert: create name to code lookup
	names_to_codes = {name: code for code, name in codes_to_names.items()}
	print("Inverted:", names_to_codes)

	# Now we can look up codes by name
	print("Canada's code:", names_to_codes["Canada"])
	print("Australia's code:", names_to_codes["Australia"])

>>>Output

Inverted: {'United States': 'US', 'United Kingdom': 'UK', 'Canada': 'CA', 'Australia': 'AU'}

Canada's code: CA

Australia's code: AU

Inversion only works cleanly when values are unique. If multiple keys share the same value, later entries overwrite earlier ones. When values are not unique, you would need to group keys into lists using the setdefault pattern or defaultdict.

•Loop Approach

More lines of code
Explicit step-by-step logic
Easier to debug complex logic
Better for multi-step operations

•Comprehension

Single expression
Declarative style
Faster execution
Better for simple transforms

Nested Comprehensions

Comprehensions can iterate over multiple sequences or flatten nested structures. The order of for clauses matches nested loops: outer loops come first, reading left to right. While powerful, nested comprehensions can become hard to read. Use them for simple cases like flattening or creating coordinate pairs, but prefer explicit loops for complex logic.

	# Flatten a list of lists
	matrix = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
	flat = [num for row in matrix for num in row]
	print("Flattened:", flat)

	# Create coordinate pairs (Cartesian product)
	rows = [0, 1, 2]
	cols = ["a", "b"]
	coords = [(r, c) for r in rows for c in cols]
	print("Coordinates:", coords)

	# Filter within nested comprehension
	# Get all even numbers from a matrix
	matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
	evens = [num for row in matrix for num in row if num % 2 == 0]
	print("Even numbers:", evens)

>>>Output

Flattened: [1, 2, 3, 4, 5, 6, 7, 8, 9]

Coordinates: [(0, 'a'), (0, 'b'), (1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]

Even numbers: [2, 4, 6, 8]

TIP

If a comprehension becomes hard to read or requires comments to explain, convert it to a regular loop. Code readability trumps conciseness. A good rule of thumb: if it does not fit on one line or needs more than one condition, consider a loop instead.

Set Comprehensions

Set comprehensions use curly braces like dictionaries, but with single values instead of key-value pairs: {expression for item in iterable}. They automatically deduplicate results, making them perfect for extracting unique values from collections.

	# Extract unique first letters from words
	words = ["apple", "apricot", "banana", "blueberry", "cherry", "coconut"]
	first_letters = {word[0] for word in words}
	print("Unique first letters:", first_letters)

	# Extract unique departments from employee records
	employees = [
	{"name": "Alice", "dept": "Engineering"},
	{"name": "Bob", "dept": "Analytics"},
	{"name": "Charlie", "dept": "Engineering"},
	{"name": "Diana", "dept": "Analytics"},
	{"name": "Eve", "dept": "Design"},
	]
	departments = {emp["dept"] for emp in employees}
	print("Unique departments:", departments)

	# Count: we have 5 employees but only 3 unique departments
	print(f"Employees: {len(employees)}, Departments: {len(departments)}")

>>>Output

Unique first letters: {'a', 'b', 'c'}

Unique departments: {'Engineering', 'Analytics', 'Design'}

Employees: 5, Departments: 3

Set comprehensions combine iteration, transformation, and deduplication in one expression. They are particularly useful when you need to answer questions like "what are all the unique values of this field?" or "what categories exist in this dataset?"

Set Operations

Daily Life

Interviews

Compare datasets with set math

Sets support mathematical operations that are invaluable for data comparison and analysis tasks. Union combines elements from multiple sets. Intersection finds elements common to all sets. Difference finds elements unique to one set. Symmetric difference finds elements in either set but not both. These operations execute in near-constant O(n) time regardless of set size, making them dramatically more efficient than nested loops for comparison tasks.

In data engineering, set operations are essential for data reconciliation, deduplication, access control analysis, and finding differences between datasets. Understanding these operations lets you answer questions like "which users have access to both systems?" or "which records exist in the source but not the destination?" with simple, efficient code.

Union - Combining Sets

Union returns all unique elements from both sets combined. Duplicates are automatically removed since sets only store unique values. Use the | operator or the .union() method. The method form accepts any iterable, not just sets.

	# Users with access to different systems
	system_a_users = {"alice", "bob", "charlie", "diana"}
	system_b_users = {"bob", "diana", "eve", "frank"}

	# All users with access to either system (or both)
	all_users = system_a_users \| system_b_users
	print("All users:", sorted(all_users))
	print("Total unique users:", len(all_users))

	# Same result with method (can accept any iterable)
	all_users_method = system_a_users.union(system_b_users)
	print("Union method result:", sorted(all_users_method))

	# Union multiple sets at once
	system_c_users = {"george", "alice"}
	all_three = system_a_users \| system_b_users \| system_c_users
	print("All three systems:", sorted(all_three))

>>>Output

All users: ['alice', 'bob', 'charlie', 'diana', 'eve', 'frank']

Total unique users: 6

Union method result: ['alice', 'bob', 'charlie', 'diana', 'eve', 'frank']

All three systems: ['alice', 'bob', 'charlie', 'diana', 'eve', 'frank', 'george']

Notice that bob and diana appear in both original sets but only once in the union. Sets automatically handle deduplication, making union perfect for merging user lists, combining tags or categories, or aggregating items from multiple sources.

Finding Common Elements

Intersection returns only elements present in all sets. Use the & operator or the .intersection() method. This operation is symmetric: A & B equals B & A.

	# Find users with access to both systems
	system_a = {"alice", "bob", "charlie", "diana"}
	system_b = {"bob", "diana", "eve", "frank"}

	both_systems = system_a & system_b
	print("Access to both:", both_systems)

	# Practical example: skill matching for job applications
	job_requires = {"python", "sql", "spark", "airflow", "aws"}
	candidate_has = {"python", "sql", "java", "docker", "kubernetes", "aws"}

	matching_skills = job_requires & candidate_has
	missing_skills = job_requires - candidate_has

	print("Matching skills:", matching_skills)
	print("Missing skills:", missing_skills)
	print(f"Match rate: {len(matching_skills)}/{len(job_requires)} = {len(matching_skills)/len(job_requires):.0%}")

>>>Output

Access to both: {'bob', 'diana'}

Matching skills: {'python', 'sql', 'aws'}

Missing skills: {'spark', 'airflow'}

Match rate: 3/5 = 60%

Intersection is invaluable for access control analysis, skill matching, finding common customers between segments, and identifying overlapping records between datasets. The O(1) lookup time of sets makes these operations efficient even for large datasets.

Difference: Unique Elements

Difference returns elements in the first set that are not in the second. Use the - operator or the .difference() method. Unlike union and intersection, difference is not symmetric: A - B is different from B - A.

	# Find users unique to each system
	system_a = {"alice", "bob", "charlie"}
	system_b = {"bob", "diana", "eve"}

	only_in_a = system_a - system_b
	only_in_b = system_b - system_a

	print("Only in system A:", only_in_a)
	print("Only in system B:", only_in_b)

	# Practical: validate required fields in data
	required_fields = {"id", "name", "email", "created_at", "status"}
	provided_fields = {"id", "name", "phone", "address"}

	missing = required_fields - provided_fields
	extra = provided_fields - required_fields

	print("\nMissing required fields:", missing)
	print("Extra fields provided:", extra)

>>>Output

Only in system A: {'alice', 'charlie'}

Only in system B: {'diana', 'eve'}

Missing required fields: {'email', 'created_at', 'status'}

Extra fields provided: {'phone', 'address'}

Difference is essential for validation tasks, detecting changes between versions, finding gaps in data coverage, and identifying what needs to be added or removed during synchronization. The non-symmetric nature means you must think carefully about which set comes first.

Symmetric Diff and Subsets

Symmetric difference returns elements in either set but not both, using the ^ operator. Subset and superset checking use <= and >= operators to test containment relationships.

	# Symmetric difference: what changed between configs?
	old_features = {"dark_mode", "notifications", "sync"}
	new_features = {"dark_mode", "analytics", "export"}

	changed = old_features ^ new_features
	print("Changed features:", changed)

	# This equals: (old \| new) - (old & new)
	also_changed = (old_features \| new_features) - (old_features & new_features)
	print("Same result:", also_changed)

	# Subset checking: does user have required permissions?
	required_permissions = {"read", "write"}
	user_permissions = {"read", "write", "delete", "admin"}

	has_required = required_permissions <= user_permissions
	print(f"\nUser has required permissions: {has_required}")

	# Is this a subset? (all elements of A are in B)
	print(f"required is subset of user: {required_permissions <= user_permissions}")
	print(f"user is superset of required: {user_permissions >= required_permissions}")

>>>Output

Changed features: {'notifications', 'sync', 'analytics', 'export'}

Same result: {'notifications', 'sync', 'analytics', 'export'}

User has required permissions: True

required is subset of user: True

user is superset of required: True

a | ba & ba - ba ^ ba <= b

a | b

Union

All unique from both sets

a & b

Intersection

Only elements in both

a - b

Difference

In a but not in b only

a ^ b

Symmetric diff

In one but not in both

a <= b

Subset check

True if all of a is in b

Data Reconciliation

Set operations shine in data reconciliation tasks where you need to compare records between systems, find discrepancies, and ensure data consistency. This is a common requirement when synchronizing databases, validating ETL pipelines, or auditing data migrations.

	# Record IDs from source database and data warehouse
	source_ids = {101, 102, 103, 104, 105, 106, 107, 108}
	warehouse_ids = {102, 103, 105, 107, 109, 110, 111}

	# Complete reconciliation analysis
	in_sync = source_ids & warehouse_ids
	missing_from_warehouse = source_ids - warehouse_ids
	orphaned_in_warehouse = warehouse_ids - source_ids
	total_discrepancies = source_ids ^ warehouse_ids

	print(f"Records in sync: {len(in_sync)} - {sorted(in_sync)}")
	print(f"Missing from warehouse: {len(missing_from_warehouse)} - {sorted(missing_from_warehouse)}")
	print(f"Orphaned in warehouse: {len(orphaned_in_warehouse)} - {sorted(orphaned_in_warehouse)}")
	print(f"Total discrepancies: {len(total_discrepancies)}")

	# Calculate sync percentage
	sync_rate = len(in_sync) / len(source_ids)
	print(f"\nSync rate: {sync_rate:.1%}")

>>>Output

Records in sync: 4 - [102, 103, 105, 107]

Missing from warehouse: 4 - [101, 104, 106, 108]

Orphaned in warehouse: 3 - [109, 110, 111]

Total discrepancies: 7

Sync rate: 50.0%

Debug Challenge

> This code tries to find source records missing from the warehouse, but the set difference operands are reversed. It shows warehouse-only records instead.

Logic error: result shows {6} but should show records in source not in warehouse

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99




source = {1, 2, 3, 4, 5}
warehouse = {2, 4, 6}
missing = warehouse - source
print("Missing:", missing)
source = {1, 2, 3, 4, 5}
warehouse = {2, 4, 6}
missing = warehouse - source
print("Missing:", missing)

Set operations are invaluable for data reconciliation. Comparing source and destination record IDs with a single - operation is far cleaner than writing a loop that checks each source ID against the destination.

The symmetric difference operator ^ gives you everything that changed between two snapshots of a dataset. If yesterday's IDs are A and today's are B, then A ^ B shows exactly what was added or removed.

Subset checking with <= lets you verify that required columns exist in a query result or that required permissions are present in a user's role set, all without writing any loop logic.

Sorting and Filtering

Daily Life

Interviews

Sort and filter records like SQL

Sorting and filtering are fundamental data operations that are often combined to answer analytical questions. Python provides flexible tools for both: the sorted() function with custom keys enables sophisticated ordering, while comprehensions and the filter() function provide powerful selection capabilities. Mastering the combination of these operations enables you to write complex data queries that rival SQL in expressiveness.

The general pattern for data queries is: filter to select relevant records, sort to order them appropriately, then optionally slice to limit results. This mirrors the SELECT...WHERE...ORDER BY...LIMIT pattern in SQL and appears constantly in data processing code.

Sorting with Custom Keys

The sorted() function accepts a key parameter that specifies how to extract a comparison value from each element. The key function is called once per element, and elements are sorted based on the returned values. This enables sorting complex objects by any attribute or computed property.

	employees = [
	{"name": "Charlie", "age": 35, "salary": 75000, "dept": "Engineering"},
	{"name": "Alice", "age": 28, "salary": 95000, "dept": "Engineering"},
	{"name": "Bob", "age": 42, "salary": 85000, "dept": "Analytics"},
	{"name": "Diana", "age": 31, "salary": 72000, "dept": "Design"},
	]

	# Sort by age (ascending by default)
	by_age = sorted(employees, key=lambda e: e["age"])
	print("By age:", [e["name"] for e in by_age])

	# Sort by salary (descending)
	by_salary = sorted(employees, key=lambda e: e["salary"], reverse=True)
	print("By salary (desc):", [e["name"] for e in by_salary])

	# Sort alphabetically by name
	by_name = sorted(employees, key=lambda e: e["name"])
	print("By name:", [e["name"] for e in by_name])

>>>Output

By age: ['Alice', 'Diana', 'Charlie', 'Bob']

By salary (desc): ['Alice', 'Bob', 'Charlie', 'Diana']

By name: ['Alice', 'Bob', 'Charlie', 'Diana']

The key function is called once per element to extract the sort value. The reverse=True parameter sorts in descending order. Lambda functions are commonly used for simple key extraction, but you can use any callable.

Multi-Level Sorting

To sort by multiple criteria, return a tuple from the key function. Python compares tuples element by element, creating a natural multi-level sort. The first element is the primary sort key, the second is the secondary sort key used to break ties, and so on.

	records = [
	{"dept": "Engineering", "name": "Zara", "years": 3},
	{"dept": "Analytics", "name": "Alice", "years": 5},
	{"dept": "Engineering", "name": "Bob", "years": 7},
	{"dept": "Analytics", "name": "Charlie", "years": 5},
	{"dept": "Engineering", "name": "Diana", "years": 3},
	]

	# Negate years for descending within ascending sort
	sorted_records = sorted(
	records,
	key=lambda r: (r["dept"], -r["years"], r["name"])
	)

	print("Sorted by dept, then years (desc), then name:")
	for r in sorted_records:
	print(f" {r['dept']}: {r['name']} ({r['years']} years)")

>>>Output

Sorted by dept, then years (desc), then name:

  Analytics: Alice (5 years)

  Analytics: Charlie (5 years)

  Engineering: Bob (7 years)

  Engineering: Diana (3 years)

  Engineering: Zara (3 years)

The tuple (r["dept"], -r["years"], r["name"]) sorts first by department alphabetically, then by years in descending order (negation inverts numeric sorting), then by name to break any remaining ties. This gives you fine-grained control over sort order.

Python Quiz

> Sort fruit names by their length so the shortest comes first. Pick the function that returns a new sorted list and the key function that measures string length.

words = ["banana", "fig", "apple", "kiwi"]
result = ___(words, key=___)
print(result[0])
print(result[-1])

sorted

list

reversed

len

str

Filter, Sort, and Slice

Real data queries often combine filtering, sorting, and limiting results. The pattern is: filter with a comprehension to select relevant records, sort with sorted() to order them, then slice with [:n] to limit the result count. This pipeline approach mirrors SQL queries and is natural to read.

	orders = [
	{"id": 1, "customer": "alice", "amount": 150, "status": "done"},
	{"id": 2, "customer": "bob", "amount": 50, "status": "pending"},
	{"id": 3, "customer": "alice", "amount": 300, "status": "done"},
	{"id": 4, "customer": "bob", "amount": 200, "status": "done"},
	]

	# Filter completed, sort by amount descending, take top 2
	top_done = sorted(
	[o for o in orders if o["status"] == "done"],
	key=lambda o: o["amount"],
	reverse=True
	)[:2]

	for o in top_done:
	print(f"Order {o['id']}: {o['customer']} - ${o['amount']}")

	# Alice's orders
	alice = [o for o in orders if o["customer"] == "alice"]
	print(f"Alice total: ${sum(o['amount'] for o in alice)}")

>>>Output

Order 3: alice - $300

Order 4: bob - $200

Alice total: $450

TIP

For very large datasets where you only need the top N items, consider using heapq.nlargest() or heapq.nsmallest() instead of sorting the entire list. These functions are more efficient when N is much smaller than the list size.

itertools.groupby Grouping

The itertools.groupby() function groups consecutive elements that share the same key value. Important: the data must be sorted by the grouping key first, because groupby only groups consecutive matches. This enables SQL-like GROUP BY operations on sorted data.

	from itertools import groupby

	sales = [
	{"region": "West", "product": "A", "amount": 100},
	{"region": "East", "product": "B", "amount": 200},
	{"region": "West", "product": "B", "amount": 150},
	{"region": "East", "product": "A", "amount": 300},
	{"region": "West", "product": "A", "amount": 80},
	]

	# Must sort by grouping key first!
	sales_sorted = sorted(sales, key=lambda s: s["region"])

	# Group and aggregate by region
	print("Sales by region:")
	for region, group in groupby(sales_sorted, key=lambda s: s["region"]):
	# Must convert iterator to list
	items = list(group)
	total = sum(s["amount"] for s in items)
	print(f" {region}: ${total} ({len(items)} sales)")

>>>Output

Sales by region:

  East: $500 (2 sales)

  West: $330 (3 sales)

The groupby() function yields (key, group_iterator) pairs. The group is an iterator, not a list, so you must convert it with list(group) if you need to iterate it multiple times or access its length. This pattern enables powerful aggregations similar to SQL GROUP BY.

Data Structure Selection

Daily Life

Interviews

Choose structures by access pattern

Choosing the right data structure is one of the most important decisions in programming. The wrong choice can make code slow, memory-hungry, or unnecessarily complex. Understanding the strengths and trade-offs of each structure helps you make informed decisions that balance readability, performance, and memory usage for your specific use case.

The key insight is that different data structures optimize for different operations. Lists are great for ordered, indexed access. Dictionaries excel at key-based lookup. Sets provide fast membership testing. Choosing well means understanding which operations your code performs most frequently and selecting the structure that makes those operations efficient.

Lists vs Sets: Membership

When checking if an item exists in a collection, sets are dramatically faster than lists. Lists scan sequentially from the beginning, making membership testing O(n). Sets use hash tables for near-instant O(1) lookups. For small collections the difference is negligible, but for thousands of items, sets can be hundreds of times faster.

	# Simulating a blocklist of user IDs
	blocked_list = list(range(10000))
	blocked_set = set(blocked_list)

	# Both work, but performance differs dramatically
	user_id = 9999

	# List: potentially scans all 10,000 items
	in_list = user_id in blocked_list
	print(f"Found in list: {in_list}")

	# Set: instant hash table lookup
	in_set = user_id in blocked_set
	print(f"Found in set: {in_set}")

	# For 10,000 items:
	# - List lookup: up to 10,000 comparisons (O(n))
	# - Set lookup: ~1 hash + comparison (O(1))

>>>Output

Found in list: True

Found in set: True

Both return True, but the computational work is vastly different. The set lookup computes a hash and does one comparison. The list lookup potentially compares against every element. Always convert lists to sets when you need repeated membership testing.

•Use Lists When

Order of elements matters
Duplicates are meaningful
Index-based access is common
Sequential iteration needed

•Use Sets When

Only unique values matter
Membership testing is frequent
Set operations needed
Order is irrelevant

dict vs list Lookup

When you need to find records by a specific field, dictionaries provide O(1) lookup while lists require O(n) scanning. If you frequently look up records by ID, name, or any other unique key, building a dictionary indexed by that key transforms slow linear searches into instant hash lookups.

	# User records as a list
	users_list = [
	{"id": 1, "name": "Alice", "email": "alice@example.com"},
	{"id": 2, "name": "Bob", "email": "bob@example.com"},
	{"id": 3, "name": "Charlie", "email": "charlie@example.com"},
	]

	# Finding user by ID in list requires scanning
	def find_user_in_list(user_id):
	for user in users_list:
	if user["id"] == user_id:
	return user
	return None

	# Build a dictionary indexed by ID
	users_by_id = {user["id"]: user for user in users_list}

	# Now lookup is instant
	print("List lookup:", find_user_in_list(2)["name"])
	print("Dict lookup:", users_by_id[2]["name"])

	# Can also index by other fields
	users_by_email = {user["email"]: user for user in users_list}
	print("By email:", users_by_email["charlie@example.com"]["name"])

>>>Output

List lookup: Bob

Dict lookup: Bob

By email: Charlie

Building the dictionary is O(n), but each subsequent lookup is O(1). If you look up records more than once, the preprocessing cost is repaid. For APIs or data processing that repeatedly access records by key, always build lookup dictionaries.

Tuples vs Lists: Immutable

Tuples are immutable sequences. Use them when you need a fixed record that should not be modified, like coordinates, RGB colors, or database rows. Tuples use slightly less memory than lists and, critically, can be used as dictionary keys since they are hashable.

	# Tuples as lightweight immutable records
	point = (10, 20)
	rgb_color = (255, 128, 0)

	# Tuples can be dictionary keys (lists cannot!)
	location_names = {
	(40.7128, -74.0060): "New York City",
	(51.5074, -0.1278): "London",
	(35.6762, 139.6503): "Tokyo",
	}

	coords = (40.7128, -74.0060)
	print(f"Location: {location_names[coords]}")

	# Named tuples provide field names for clarity
	from collections import namedtuple

	User = namedtuple("User", ["id", "name", "email"])
	user = User(1, "Alice", "alice@example.com")

	print(f"User: {user.name} ({user.email})")
	print(f"User ID: {user.id}")

>>>Output

Location: New York City

User: Alice (alice@example.com)

User ID: 1

Named tuples from collections.namedtuple provide the benefits of tuples (immutability, hashability) with the readability of named fields. For modern Python, consider @dataclass(frozen=True) which offers similar benefits with additional features like default values and type hints.

Choosing by Operation Type

Your data structure choice should be driven by which operations you perform most frequently. Each structure is optimized for different access patterns, and choosing well can make the difference between code that runs in seconds versus hours on large datasets.

list

Append/pop from end in O(1) amortized time. Best for ordered sequences.

collections.deque

Append/pop from both ends in O(1). Use when you need a double-ended queue.

set

O(1) membership testing. Convert lists to sets for repeated "in" checks.

dict

O(1) key-value lookup. Preserves insertion order since Python 3.7.

heapq

Priority queue operations. Use when you always need the smallest or largest item.

Counter

Automatic counting from collections module. Tallies occurrences in one line.

Memory Considerations

Data structure choice affects memory usage significantly. Sets and dictionaries have overhead for their hash tables. Lists are more compact but slower for lookups. For very large datasets, consider generators that process items one at a time instead of loading everything into memory.

	import sys

	# Compare memory usage for 1000 integers
	data = list(range(1000))

	as_list = list(data)
	as_tuple = tuple(data)
	as_set = set(data)

	print(f"List: {sys.getsizeof(as_list):,} bytes")
	print(f"Tuple: {sys.getsizeof(as_tuple):,} bytes")
	print(f"Set: {sys.getsizeof(as_set):,} bytes")

	# Sets use ~4x more memory but provide O(1) lookup

>>>Output

List:  8,856 bytes

Tuple: 8,048 bytes

Set:   32,984 bytes

Sets use significantly more memory due to their hash table structure, but this overhead enables O(1) membership testing. The trade-off is worthwhile when lookup speed matters more than memory. Tuples are slightly smaller than lists because they do not allocate extra space for potential growth.

✓Do

Profile before optimizing
Start with the simplest structure
Convert to specialized types when needed
Document why you chose each structure

✗Don't

Prematurely optimize
Use lists for frequent membership tests
Forget memory for large datasets
Assume one structure fits all cases

Python Quiz

> You have a list of user IDs and need to quickly check if a given ID exists. Pick the correct type to convert the list into for O(1) lookups, and the correct operator to test membership.

ids = [101, 202, 303, 404]
lookup = ___(ids)
print(202 ___ lookup)
print(len(lookup))

set

tuple

dict

Common Mistakes

Even experienced developers make mistakes with data structures. These pitfalls are common enough that interviewers specifically test for them. Recognizing and avoiding these patterns helps you write more robust, predictable code.

Modifying During Iteration

Modifying a collection while iterating over it causes unpredictable behavior. Items may be skipped or processed twice because the iteration index gets out of sync with the changing collection size. Always create a copy or build a new collection instead.

	numbers = [1, 2, 3, 4, 5, 6]
	# This would skip elements:
	# for n in numbers:
	# if n % 2 == 0:
	# numbers.remove(n) # DON'T DO THIS

	# RIGHT: Build a new list with comprehension
	numbers = [1, 2, 3, 4, 5, 6]
	odds_only = [n for n in numbers if n % 2 != 0]
	print("New list (odds):", odds_only)

	# RIGHT: Iterate over a copy to modify
	numbers = [1, 2, 3, 4, 5, 6]
	# [:] creates a shallow copy
	for n in numbers[:]:
	if n % 2 == 0:
	numbers.remove(n)
	print("Modified original:", numbers)

>>>Output

New list (odds): [1, 3, 5]

Modified original: [1, 3, 5]

Mutable Default Arguments

Default argument values are created once when the function is defined, not each time the function is called. If the default is mutable (like a list or dict), modifications persist across function calls, leading to surprising bugs.

	# WRONG: Mutable default argument
	def add_item_wrong(item, items=[]):
	items.append(item)
	return items

	print(add_item_wrong("a"))
	# Prints ['a', 'b'] - where did 'a' come from?
	print(add_item_wrong("b"))

	# RIGHT: Use None and create inside function
	def add_item_right(item, items=None):
	if items is None:
	items = []
	items.append(item)
	return items

	print(add_item_right("x"))
	print(add_item_right("y"))

>>>Output

['a']

['a', 'b']

['x']

['y']

Always use None as the default for mutable arguments and create the actual mutable object inside the function body. This is one of Python's most infamous gotchas and a favorite interview question.

Shallow vs Deep Copy

Assignment creates a reference, not a copy. Shallow copy (slice or .copy()) duplicates the outer structure but shares nested objects. Deep copy duplicates everything. Confusing these leads to mysterious bugs where modifying one variable affects another.

	import copy

	original = [[1, 2], [3, 4]]

	# Reference: same object
	ref = original
	ref[0][0] = 99
	print("Reference:", original)

	# Shallow: same inner
	original = [[1, 2], [3, 4]]
	shallow = original[:]
	shallow[0][0] = 88
	print("Shallow:", original)

	# Deep copy: fully independent
	original = [[1, 2], [3, 4]]
	deep = copy.deepcopy(original)
	deep[0][0] = 77
	print("Deep:", original)

>>>Output

Reference: [[99, 2], [3, 4]]

Shallow: [[88, 2], [3, 4]]

Deep: [[1, 2], [3, 4]]

Reference: b = a

Both variables point to the same object. All changes are shared.

Shallow: b = a[:] or .copy()

New outer container, but nested objects are still shared references.

Deep: copy.deepcopy(a)

Fully independent clone of the entire structure including all nested data.

Use copy.deepcopy() when you need a completely independent copy of nested structures. This is especially important when working with data you receive from elsewhere and do not want to accidentally modify, or when passing data to functions that might mutate it.

Choosing the right data structure can make the difference between elegant and unwieldy code. Put these techniques to the test with hands-on challenges in the Python Builder.

❯❯❯PUTTING IT ALL TOGETHER

> You are a data engineer at Indeed building a pipeline that parses nested API responses for job postings, transforms salary fields with comprehensions, reconciles active versus expired listing IDs using set operations, and sorts filtered results by relevance score.

Nested data structures mirror the API response format where each job posting is a dict containing a list of required skills and a nested dict of salary ranges.

dict and list comprehensions flatten and transform the nested posting data into a clean list of records in a single readable expression.

set operations compute the difference between active listing IDs and expired ones, producing only the postings that should be surfaced to candidates.

Sorting and filtering chain together to rank the reconciled postings by descending relevance score and remove any below the quality threshold.

KEY TAKEAWAYS

Lists of dicts model tables; dicts with list values enable grouping

.setdefault() and defaultdict simplify building nested structures

Chain .get() with empty dict defaults for safe nested access

Comprehensions are faster and more readable than equivalent loops

Dict comprehension syntax: {k: v for k, v in items}

Set operations: | union, & intersection, - difference, ^ symmetric diff

Multi-key sort uses tuples: key=lambda x: (x["a"], -x["b"])

Use sets for O(1) membership testing; dicts for O(1) key lookup

Never modify a collection while iterating over it

Use None as default for mutable function arguments

Use copy.deepcopy() for fully independent copies of nested data

Complex data manipulation patterns

Category: Python
Difficulty: intermediate
Duration: 48 minutes
Challenges: 0 hands-on challenges

Topics covered: Nested Data Structures, Dict and List Comprehensions, Set Operations, Sorting and Filtering, Data Structure Selection

Lesson Sections

Nested Data Structures (concepts: pyDictNested)
Nested data structures are collections that contain other collections as their elements. This nesting can occur in multiple patterns: a list of dictionaries represents a table of records, similar to rows in a database. A dictionary with list values groups related items by category. A dictionary containing other dictionaries models hierarchical relationships like organizational structures or configuration settings. Understanding these patterns is essential because they mirror the structure of JSO
Dict and List Comprehensions (concepts: pyListComprehension)
Comprehensions are concise expressions for creating lists, dictionaries, and sets from existing iterables. They combine iteration, transformation, and optional filtering into a single readable line. Beyond being syntactic sugar, comprehensions execute faster than equivalent loops because Python optimizes them internally. They are also considered more Pythonic, expressing intent clearly without the boilerplate of explicit loop construction. Comprehensions shine when you need to transform data fro
Set Operations (concepts: pySetOperations)
In data engineering, set operations are essential for data reconciliation, deduplication, access control analysis, and finding differences between datasets. Understanding these operations lets you answer questions like "which users have access to both systems?" or "which records exist in the source but not the destination?" with simple, efficient code. Union - Combining Sets Notice that bob and diana appear in both original sets but only once in the union. Sets automatically handle deduplication
Sorting and Filtering (concepts: pyListSort)
Sorting and filtering are fundamental data operations that are often combined to answer analytical questions. Python provides flexible tools for both: the sorted() function with custom keys enables sophisticated ordering, while comprehensions and the filter() function provide powerful selection capabilities. Mastering the combination of these operations enables you to write complex data queries that rival SQL in expressiveness. Sorting with Custom Keys Multi-Level Sorting Filter, Sort, and Slice
Data Structure Selection (concepts: pyCollections)
Choosing the right data structure is one of the most important decisions in programming. The wrong choice can make code slow, memory-hungry, or unnecessarily complex. Understanding the strengths and trade-offs of each structure helps you make informed decisions that balance readability, performance, and memory usage for your specific use case. The key insight is that different data structures optimize for different operations. Lists are great for ordered, indexed access. Dictionaries excel at ke