Collections: Intermediate

Slack's real-time messaging infrastructure uses deques to implement sliding-window rate limiters that efficiently add new messages to one end while dropping expired ones from the other, handling thousands of messages per second without memory growth. A deque makes appending and removing from both ends an O(1) operation, whereas a plain list requires O(n) time to remove from the front - a difference that compounds catastrophically at scale. The intermediate collections patterns in this lesson, including deque, defaultdict, and their real-world applications, are what separate efficient production code from scripts that collapse under load.

Using sorted() with key

Daily Life
Interviews

Sort any data by custom criteria

The sorted() function returns a new sorted list from any iterable. While basic sorting works on simple values, the real power comes from the key parameter. The key function transforms each element into a comparison value, letting you sort by any criteria you can express in code.

This is one of Python's most powerful features for data processing. Without key functions, sorting a list of dictionaries by a specific field would require writing a custom comparison function or manually extracting values. With key, you express the sort criteria in a single line that Python handles efficiently.
The key parameter represents a fundamental shift from imperative to declarative programming. Instead of writing code that compares elements step by step, you declare what value to use for comparison. Python handles the mechanics of actually performing the comparisons and swaps. This separation of concerns makes your code more readable and less prone to off-by-one errors or incorrect comparisons.

In SQL, you use ORDER BY to sort query results. The key parameter in sorted() serves the same purpose: it defines the sorting expression. When you write sorted(users, key=lambda u: u["age"]), you are essentially writing ORDER BY age in Python. This conceptual connection helps when you move between SQL queries and Python data processing.

Basic sorted() Usage

The sorted() function always returns a new list, leaving the original unchanged:

1numbers = [3, 1, 4, 1, 5, 9, 2, 6]
2words = ["banana", "apple", "cherry", "date"]
3
4# sorted() returns a new list
5sorted_numbers = sorted(numbers)
6sorted_words = sorted(words)
7
8print("Original numbers:", numbers)
9print("Sorted numbers:", sorted_numbers)
10print()
11print("Original words:", words)
12print("Sorted words:", sorted_words)
>>>Output
Original numbers: [3, 1, 4, 1, 5, 9, 2, 6]
Sorted numbers: [1, 1, 2, 3, 4, 5, 6, 9]
 
Original words: ['banana', 'apple', 'cherry', 'date']
Sorted words: ['apple', 'banana', 'cherry', 'date']
Notice that sorted() works on any iterable and always produces a list. The original collection is never modified. This immutability makes your code safer and easier to reason about, especially when the same data is used in multiple places.
The immutability of sorted() is a deliberate design choice in Python. When you pass a list to sorted(), you get a completely new list with no connection to the original. Modifying one does not affect the other. This contrasts with the list.sort() method which modifies the list in place. Understanding when to use sorted() versus .sort() is essential for writing correct Python code.

The key Parameter

The key parameter accepts a function that transforms each element into a comparison value. Python sorts based on these transformed values:

1words = ["banana", "Apple", "cherry", "Date"]
2
3print("Default sort:", sorted(words))
4
5# Sort case-insensitively using str.lower as key
6print("Case-insensitive:", sorted(words, key=str.lower))
7
8# Sort by word length
9print("By length:", sorted(words, key=len))
>>>Output
Default sort: ['Apple', 'Date', 'banana', 'cherry']
Case-insensitive: ['Apple', 'banana', 'cherry', 'Date']
By length: ['Date', 'Apple', 'banana', 'cherry']
The key function is called once for each element. Python uses the returned values for comparison but keeps the original elements in the result. This is why the output contains the original strings with their original casing, even when sorted case-insensitively.
This design is elegant and efficient. The key function extracts or computes a comparison value, but the sorted result contains the original objects. You never need to transform your data just for sorting, which means you can sort complex objects by computed properties without modifying them. The key values are cached during the sort, so each element's key function is called exactly once regardless of how many comparisons are needed.

Sorting Complex Objects

The key parameter becomes essential when sorting dictionaries, tuples, or objects. You can sort by any attribute or computed value. This is where sorted() truly shines compared to basic sorting: you can order complex data structures by any property without modifying them or creating comparison methods.
1users = [
2 {"name": "Alice", "age": 30, "score": 85},
3 {"name": "Bob", "age": 25, "score": 92},
4 {"name": "Charlie", "age": 35, "score": 78},
5]
6
7# Sort by age
8by_age = sorted(users, key=lambda u: u["age"])
9print("By age:")
10for u in by_age:
11 print(" " + u["name"] + ": " + str(u["age"]))
12
13print()
14
15# Sort by score
16by_score = sorted(users, key=lambda u: u["score"])
17print("By score:")
18for u in by_score:
19 print(" " + u["name"] + ": " + str(u["score"]))
>>>Output
By age:
Bob: 25
Alice: 30
Charlie: 35
 
By score:
Charlie: 78
Alice: 85
Bob: 92

The lambda creates an anonymous function inline. The expression lambda u: u["age"] means "given a user u, return their age." This extracted value is what Python uses for comparison.

TIP
For simple attribute access, you can also use operator.itemgetter() or operator.attrgetter() from the operator module. These are slightly faster than lambda for large datasets.
Lambda functions are the most common way to define key functions because they are inline and readable. However, for very large datasets or tight loops, the operator module provides optimized alternatives. itemgetter("age") does the same thing as lambda u: u["age"], but runs slightly faster because it avoids the overhead of creating a new function call frame.

Sorting Tuples

Tuples are commonly used in data processing as lightweight records. Sorting by specific positions is straightforward. When data comes from CSV files, database queries, or API responses, it often arrives as tuples or lists that you need to sort by specific columns. The key function lets you specify which position (column) to use as the sort criterion.
1# Format: (name, department, salary)
2employees = [
3 ("Alice", "Engineering", 95000),
4 ("Bob", "Sales", 75000),
5 ("Charlie", "Engineering", 105000),
6 ("Diana", "Sales", 82000),
7]
8
9# Sort by salary (index 2)
10by_salary = sorted(employees, key=lambda e: e[2])
11print("By salary:")
12for name, dept, sal in by_salary:
13 print(" " + name + ": " + str(sal))
14
15print()
16
17# Sort by department, then by salary within department
18by_dept_salary = sorted(employees, key=lambda e: (e[1], e[2]))
19print("By department, then salary:")
20for name, dept, sal in by_dept_salary:
21 print(" " + dept + " - " + name + ": " + str(sal))
>>>Output
By salary:
Bob: 75000
Diana: 82000
Alice: 95000
Charlie: 105000
 
By department, then salary:
Engineering - Alice: 95000
Engineering - Charlie: 105000
Sales - Bob: 75000
Sales - Diana: 82000
When the key function returns a tuple, Python sorts by the first element, then uses subsequent elements as tiebreakers. This multi-level sorting is extremely powerful for organizing data hierarchically.

This tuple-based sorting mirrors ORDER BY with multiple columns in SQL. When you write sorted(employees, key=lambda e: (e[1], e[2])), you are expressing ORDER BY department, salary. The first element of the tuple is the primary sort key; subsequent elements only matter when earlier elements are equal. You can have as many tiebreaker levels as needed.

Performance matters when sorting large datasets. Python's Timsort algorithm is highly optimized, running in O(n log n) time. The key function is called exactly once per element (O(n) total), and the results are cached. This means even complex key computations do not dramatically slow down sorting. However, if your key function itself does O(n) work, the overall complexity becomes O(n²), so keep key functions simple.

len
len
Sort strings or collections by their length
str.lower
str.lower
Case-insensitive sorting by converting to lowercase for comparison
abs
abs
Sort numbers by absolute value, ignoring sign
lambda x: x[n]
lambda x: x[n]
Sort tuples by a specific positional element
lambda d: d["field"]
lambda d: d["field"]
Sort dictionaries by a named field value
lambda x: (x.a, x.b)
lambda x: (x.a, x.b)
Multi-level sort with tuple keys for tiebreaking
Fill in the Blank

> You have a list of fruit names and want to sort them from shortest to longest. Pick the key function that measures word length.

words = ["banana", "fig", "cherry", "kiwi"]
result = sorted(words, key=)
print(result)
The key parameter separates the comparison logic from the data itself. Python calls the key function once per element, caches the result, and uses it for all comparisons during the sort.
For multi-level sorting, return a tuple from the key function. Python compares tuples element by element, so the first element is the primary key and subsequent elements serve as tiebreakers.
sorted() always returns a new list and leaves the original unchanged. This immutability makes it safe to sort the same data multiple times by different criteria without side effects.

Using sorted() in Reverse

Daily Life
Interviews

Rank items and find top N results

The reverse parameter controls whether sorted() returns results in ascending (default) or descending order. Setting reverse=True flips the sort order completely without requiring you to reverse the resulting list afterward.

While you could achieve the same result by sorting normally and then reversing, using the reverse parameter is both cleaner and more efficient. Python handles the reversal during the sort rather than as a separate pass through the data.

Basic Reverse Sorting

1scores = [85, 92, 78, 95, 88, 73]
2
3# Ascending (default)
4print("Ascending:", sorted(scores))
5
6# Descending
7print("Descending:", sorted(scores, reverse=True))
8
9# Get top 3 scores
10top_3 = sorted(scores, reverse=True)[:3]
11print("Top 3:", top_3)
>>>Output
Ascending: [73, 78, 85, 88, 92, 95]
Descending: [95, 92, 88, 85, 78, 73]
Top 3: [95, 92, 88]
Getting the top N items from a collection is a common operation. Sorting in descending order and slicing the first N elements is simple and readable. For very large collections where you only need a few top items, consider heapq.nlargest() for better performance.
This pattern appears constantly in data analysis. Finding the top 10 customers by revenue, the bottom 5 products by sales, or the most recent 100 transactions all use descending sorts with slicing. The combination of reverse=True and list slicing is a fundamental tool in your data processing toolkit that you will use almost daily.

Combining key and reverse

The key and reverse parameters work together seamlessly. The key determines what to compare; reverse determines the order. You can use both simultaneously for powerful sorting combinations:

1products = [
2 {"name": "Laptop", "price": 999, "rating": 4.5},
3 {"name": "Phone", "price": 699, "rating": 4.8},
4 {"name": "Tablet", "price": 449, "rating": 4.2},
5 {"name": "Watch", "price": 299, "rating": 4.6},
6]
7
8# Most expensive first
9by_price_desc = sorted(products, key=lambda p: p["price"], reverse=True)
10print("Most expensive:")
11for p in by_price_desc:
12 print(" " + p["name"] + ": " + str(p["price"]))
13
14print()
15
16# Highest rated first
17by_rating_desc = sorted(products, key=lambda p: p["rating"], reverse=True)
18print("Highest rated:")
19for p in by_rating_desc:
20 print(" " + p["name"] + ": " + str(p["rating"]) + " stars")
>>>Output
Most expensive:
Laptop: 999
Phone: 699
Tablet: 449
Watch: 299
 
Highest rated:
Phone: 4.8 stars
Watch: 4.6 stars
Laptop: 4.5 stars
Tablet: 4.2 stars

Mixed Ascending/Descending

Sometimes you need to sort by multiple criteria with different directions. One common technique is to negate numeric values in the key:
1# Format: (name, department, salary)
2employees = [
3 ("Alice", "Engineering", 95000),
4 ("Bob", "Engineering", 105000),
5 ("Charlie", "Sales", 82000),
6 ("Diana", "Sales", 75000),
7]
8
9# Sort by department ascending, salary descending
10by_dept_salary = sorted(employees, key=lambda e: (e[1], -e[2]))
11
12print("By dept (asc), then salary (desc):")
13for name, dept, sal in by_dept_salary:
14 print(" " + dept + " - " + name + ": " + str(sal))
>>>Output
By dept (asc), then salary (desc):
Engineering - Bob: 105000
Engineering - Alice: 95000
Sales - Charlie: 82000
Sales - Diana: 75000
By negating the salary in the key tuple, higher salaries become smaller numbers and sort first. This technique only works for numeric values. For strings or other types, you may need multiple sort passes.
The negation trick is powerful but limited to numbers. For mixed-direction sorting with strings, you need to sort in multiple passes, applying Python's stable sort property. A stable sort preserves the relative order of elements with equal keys. By sorting on secondary keys first (in reverse order of priority) and primary keys last, you achieve the correct multi-criteria ordering.
Understanding when to use reverse=True versus negation in the key is an important skill. Use reverse=True when you want the entire result in descending order. Use negation within a tuple key when you need mixed ascending and descending on different criteria. In interviews, being able to articulate this distinction demonstrates deep understanding of Python sorting mechanics.
reverse=True
  • Reverses entire sort order
  • Works with any data type
  • Single criterion only
  • Clean and readable
Negate in key
  • Reverses one criterion
  • Numbers only
  • Enables mixed directions
  • Use for multi-level sorts

Timsort is particularly efficient when data has "runs" of already-sorted elements. Real-world data often has this structure, whether from timestamps, alphabetical lists, or numeric sequences. The algorithm detects these runs and merges them efficiently. This is why Python's sorted() often outperforms theoretical O(n log n) expectations on practical data. The same algorithm was adopted by Java 7 and the V8 JavaScript engine for their standard sorting implementations.

Debug Challenge

> This code tries to grab the top 3 scores by slicing the result of .sort(), but .sort() returns None because it modifies the list in place.

TypeError: 'NoneType' object is not subscriptable

The key distinction between sorted() and .sort() is their return value. sorted() always returns a new list; .sort() modifies in place and returns None. Assigning .sort() to a variable is a common bug.

When you need the top or bottom N elements from a large collection, combining sorted() with slicing is simple and readable. For very large N, heapq.nlargest() or heapq.nsmallest() offers better performance.

The reverse parameter works together with key seamlessly. Python applies the key first to determine comparison values, then reverses the order of those results to produce descending output.

Using map() for Transforms

Daily Life
Interviews

Transform every element without a loop

The map() function applies a transformation function to every element in an iterable, returning a map object that yields the results lazily. This is one of Python's core functional programming tools, enabling concise data transformations without explicit loops.

In data engineering, map() is used constantly. Converting a column of strings to integers, extracting fields from records, formatting values for output, all of these are map operations. Understanding map() leads to cleaner code and prepares you for distributed processing frameworks like PySpark where map operations are fundamental.

The map() function embodies the principle that transformations should be separate from iteration. When you use a for loop to transform data, you mix the mechanics of iteration with the logic of transformation. With map(), you cleanly express the transformation once and let Python handle the iteration. This separation makes code easier to understand, test, and parallelize.

Map operations have a mathematical foundation in category theory, but you do not need to understand the theory to use them effectively. What matters is recognizing the pattern: when you want to apply the same operation to every element of a collection, map() is usually cleaner than a loop. The pattern is so universal that SQL (SELECT expression), spreadsheets (formulas applied to columns), and big data tools (Spark's map) all have equivalent concepts.

Basic map() Usage

1numbers = [1, 2, 3, 4, 5]
2
3# Square each number
4squared = map(lambda x: x ** 2, numbers)
5print("Squared:", list(squared))
6
7# Convert strings to integers
8str_numbers = ["10", "20", "30", "40"]
9int_numbers = map(int, str_numbers)
10print("As integers:", list(int_numbers))
11
12# Get lengths of words
13words = ["Python", "is", "awesome"]
14lengths = map(len, words)
15print("Lengths:", list(lengths))
>>>Output
Squared: [1, 4, 9, 16, 25]
As integers: [10, 20, 30, 40]
Lengths: [6, 2, 7]

Notice that map() returns a map object, not a list. We wrap it in list() to see all the results. This lazy evaluation means map() is memory-efficient for large datasets since it processes one element at a time.

map() with Builtins

Many built-in functions work directly with map() without needing lambda:
1# Common type conversions
2prices = ["19.99", "29.99", "9.99", "49.99"]
3float_prices = list(map(float, prices))
4print("Floats:", float_prices)
5
6# Uppercase all strings
7names = ["alice", "bob", "charlie"]
8upper_names = list(map(str.upper, names))
9print("Upper:", upper_names)
10
11# Strip whitespace
12messy = [" hello ", " world ", " python"]
13clean = list(map(str.strip, messy))
14print("Clean:", clean)
15
16# Absolute values
17values = [-5, 3, -8, 2, -1]
18absolutes = list(map(abs, values))
19print("Absolute:", absolutes)
>>>Output
Floats: [19.99, 29.99, 9.99, 49.99]
Upper: ['ALICE', 'BOB', 'CHARLIE']
Clean: ['hello', 'world', 'python']
Absolute: [5, 3, 8, 2, 1]
TIP
When the transformation is a single built-in function like int, float, str.upper, or len, pass it directly to map() without lambda. This is cleaner and slightly faster.
This direct function passing demonstrates that functions are first-class objects in Python. You can pass them as arguments, store them in variables, and return them from other functions. When you write map(int, strings), you pass the int function object itself, not the result of calling int. Python then calls int on each string internally. This functional style is central to Python's design.

Transforming Complex Data

In data engineering, you often transform records by extracting or computing specific fields. The lambda function specifies exactly what transformation to apply, whether it is simple field access, string manipulation, or complex computation involving multiple fields:
1users = [
2 {"name": "Alice", "email": "alice@example.com", "age": 30},
3 {"name": "Bob", "email": "bob@test.org", "age": 25},
4 {"name": "Charlie", "email": "charlie@demo.net", "age": 35},
5]
6
7# Extract just names
8names = list(map(lambda u: u["name"], users))
9print("Names:", names)
10
11# Extract email domains
12domains = list(map(lambda u: u["email"].split("@")[1], users))
13print("Domains:", domains)
14
15# Create formatted strings
16formatted = list(map(lambda u: u["name"] + " (" + str(u["age"]) + ")", users))
17print("Formatted:", formatted)
>>>Output
Names: ['Alice', 'Bob', 'Charlie']
Domains: ['example.com', 'test.org', 'demo.net']
Formatted: ['Alice (30)', 'Bob (25)', 'Charlie (35)']
These patterns are essential for ETL pipelines. Extracting fields from records, computing derived values, and reformatting data for downstream systems are all map operations. The declarative style makes the transformation intent clear at a glance.
Notice how each transformation is a single expression. Extracting names is u["name"]. Extracting domains is u["email"].split("@")[1]. Formatting is string concatenation. Each transformation is self-contained and testable. You could test the lambda in isolation before using it in map(). This modularity makes debugging easier because you can verify each transformation step independently.

map() Across Multiple Lists

The map() function can take multiple iterables. The function receives one element from each iterable:

1# Add corresponding elements
2list1 = [1, 2, 3, 4]
3list2 = [10, 20, 30, 40]
4sums = list(map(lambda a, b: a + b, list1, list2))
5print("Sums:", sums)
6
7# Combine first and last names
8first_names = ["Alice", "Bob", "Charlie"]
9last_names = ["Smith", "Jones", "Brown"]
10full_names = list(map(lambda f, l: f + " " + l, first_names, last_names))
11print("Full names:", full_names)
12
13# Calculate prices with tax
14prices = [100, 200, 150]
15tax_rates = [0.08, 0.10, 0.07]
16totals = list(map(lambda p, t: p * (1 + t), prices, tax_rates))
17print("With tax:", totals)
>>>Output
Sums: [11, 22, 33, 44]
Full names: ['Alice Smith', 'Bob Jones', 'Charlie Brown']
With tax: [108.0, 220.0, 160.5]
When map() receives multiple iterables, it stops when the shortest one is exhausted. This behavior is the same as zip(), which we will cover later in this lesson.
Using map() with multiple iterables is particularly useful when you have parallel arrays of data that need to be combined. These element-wise operations are natural fits for multi-argument map().
Knowing when to reach for map() versus a list comprehension will keep your code clean.
Do
  • Use map(int, strings) when passing an existing function directly
  • Use list comprehensions for complex expressions or filtering
  • Convert map() to list() when you need indexing or reuse
Don't
  • Wrap a simple built-in in lambda: map(lambda x: int(x), items)
  • Use map() when the transform also needs filtering logic
  • Forget that map() returns an iterator, not a list
Python Quiz

> String prices need to become numbers for a total. Pick the type conversion that preserves decimals, and the function that adds all values together.

prices = ["19.99", "29.99", "9.99"]
amounts = list(map(___, prices))
total = ___(amounts)
print(round(total, 2))
int
float
str
len
sum

map() is most readable when the transformation is a named function. Passing int, float, or str.upper directly is cleaner than wrapping them in a lambda.

map() returns a lazy iterator, meaning it only computes each result when requested. This makes it memory-efficient for large collections since the entire transformed dataset is never held in memory at once.

The functional style of map() scales naturally to distributed computing frameworks like PySpark, where map operations run across multiple machines using the same mental model.

Using filter() for Selection

Daily Life
Interviews

Select matching records from any list

The filter() function selects elements from an iterable where a predicate function returns True. Like map(), it returns an iterator that yields results lazily. This is Python's built-in way to extract subsets of data matching specific criteria.

Filtering is fundamental to data processing. Every query that includes a WHERE clause is a filter operation. Every data validation step that removes invalid records is filtering. Understanding filter() helps you think declaratively about data selection and prepares you for SQL and data pipeline tools that use the same concepts.

The predicate function you pass to filter() is a decision maker. For each element, it answers yes or no: should this element be included in the result? This binary decision mirrors the boolean logic of SQL WHERE clauses. In fact, when you write filter(lambda x: x > 5, numbers), you are expressing WHERE x > 5 in Python. The conceptual model is identical across languages and tools.

Filter operations are lazy by default in Python 3. The filter object only computes results as you iterate through it. This means you can filter a massive dataset without loading everything into memory at once. Each element is tested and yielded one at a time. This lazy evaluation is crucial for processing data that does not fit in memory, a common situation in data engineering.

Basic filter() Usage

1numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
2
3# Keep only even numbers
4evens = list(filter(lambda x: x % 2 == 0, numbers))
5print("Evens:", evens)
6
7# Keep numbers greater than 5
8large = list(filter(lambda x: x > 5, numbers))
9print("Greater than 5:", large)
10
11# Keep positive numbers from mixed list
12mixed = [-3, 5, -1, 8, -2, 4, 0]
13positive = list(filter(lambda x: x > 0, mixed))
14print("Positive:", positive)
>>>Output
Evens: [2, 4, 6, 8, 10]
Greater than 5: [6, 7, 8, 9, 10]
Positive: [5, 8, 4]
The predicate function must return a truthy or falsy value. Elements where the predicate returns True are kept; others are discarded. Like map(), filter() returns an iterator, so we wrap in list() to see all results.
Python's truthy and falsy evaluation makes predicates flexible. Your function does not need to return exactly True or False. Returning a non-empty string, a non-zero number, or any object evaluates as truthy. Returning None, zero, an empty string, or an empty collection evaluates as falsy. This flexibility lets you write concise predicates without explicit boolean conversion.

Filtering with None

Passing None as the function removes falsy values (empty strings, zero, None, empty lists, etc.):

1# Remove falsy values
2mixed = [0, 1, "", "hello", None, [], [1, 2], False, True]
3truthy = list(filter(None, mixed))
4print("Truthy values:", truthy)
5
6# Clean up a list of strings (remove empty ones)
7strings = ["apple", "", "banana", "", "", "cherry"]
8non_empty = list(filter(None, strings))
9print("Non-empty:", non_empty)
10
11# Remove zeros from numbers
12values = [0, 5, 0, 3, 0, 8, 0]
13non_zero = list(filter(None, values))
14print("Non-zero:", non_zero)
>>>Output
Truthy values: [1, 'hello', [1, 2], True]
Non-empty: ['apple', 'banana', 'cherry']
Non-zero: [5, 3, 8]
This shorthand is useful for cleaning data. After splitting a string or parsing a file, you often end up with empty strings or None values that filter(None, ...) removes efficiently.
Be careful with filter(None, ...) when zeros or empty strings might be valid data. If you are processing numeric data where zero is meaningful, filter(None, ...) will incorrectly remove those zeros. In such cases, write an explicit predicate like lambda x: x is not None to remove only None values while keeping zeros and empty strings.

Filtering Complex Objects

Real-world filtering typically involves records with multiple fields. Data from databases, APIs, and files usually arrives as dictionaries or objects with named attributes. Filtering these complex structures requires predicates that access specific fields and combine conditions with boolean operators. The lambda syntax keeps these predicates concise while remaining readable.
1orders = [
2 {"id": 1, "customer": "Alice", "total": 150, "status": "completed"},
3 {"id": 2, "customer": "Bob", "total": 75, "status": "pending"},
4 {"id": 3, "customer": "Charlie", "total": 200, "status": "completed"},
5 {"id": 4, "customer": "Diana", "total": 50, "status": "cancelled"},
6 {"id": 5, "customer": "Eve", "total": 300, "status": "completed"},
7]
8
9# Filter completed orders
10completed = list(filter(lambda o: o["status"] == "completed", orders))
11print("Completed orders:")
12for o in completed:
13 print(" Order " + str(o["id"]) + ": " + str(o["total"]))
14
15print()
16
17# Filter high-value completed orders
18high_value = list(filter(
19 lambda o: o["status"] == "completed" and o["total"] >= 150,
20 orders
21))
22print("High-value completed:")
23for o in high_value:
24 print(" Order " + str(o["id"]) + ": " + str(o["total"]))
>>>Output
Completed orders:
Order 1: 150
Order 3: 200
Order 5: 300
 
High-value completed:
Order 1: 150
Order 3: 200
Order 5: 300

Complex predicates can combine multiple conditions with and/or. This mirrors WHERE clauses in SQL. The lambda expresses the condition declaratively, making the intent clear without loop mechanics.

When your filter predicate becomes complex with many conditions, consider extracting it into a named function. The filter call becomes filter(is_high_value, orders), which reads almost like English.
Fill in the Blank

> You have a mixed list of positive and negative integers and need to filter it down to only the positive even ones. Pick the lambda that combines both conditions.

nums = [-2, 3, 4, -5, 6, 7, 8]
result = list(filter(, nums))
print(result)

Chaining filter() and map()

A common pattern is to filter data first, then transform the results. This filter-then-map pattern appears everywhere in data processing: select the records you want, then extract or compute the values you need from those records. The two-step approach keeps each operation focused on a single responsibility, making the code easier to understand, test, and modify.
1users = [
2 {"name": "Alice", "age": 30, "active": True},
3 {"name": "Bob", "age": 17, "active": True},
4 {"name": "Charlie", "age": 25, "active": False},
5 {"name": "Diana", "age": 19, "active": True},
6 {"name": "Eve", "age": 16, "active": True},
7]
8
9# Get names of active adult users
10active_users = filter(lambda u: u["active"], users)
11adult_users = filter(lambda u: u["age"] >= 18, active_users)
12names = map(lambda u: u["name"], adult_users)
13result = list(names)
14print("Active adults:", result)
15
16# Alternative: chain in one expression
17result2 = list(map(
18 lambda u: u["name"],
19 filter(lambda u: u["active"] and u["age"] >= 18, users)
20))
21print("Same result:", result2)
>>>Output
Active adults: ['Alice', 'Diana']
Same result: ['Alice', 'Diana']
Because filter() and map() return iterators, you can chain them without creating intermediate lists. The data flows through the pipeline, processed element by element. This is memory-efficient for large datasets.
This chaining pattern is the foundation of data pipelines. Each step takes the previous step's output as input. Filter, then transform, then filter again, then aggregate. The data flows through without materializing intermediate results. In big data frameworks like PySpark and Apache Beam, this exact pattern is how you build complex data transformations that process terabytes efficiently.
The order of filter and map matters for performance. Filtering first reduces the number of elements that map needs to transform. This optimization is called "predicate pushdown" in database query optimizers.
01
Source data
Start with the raw collection of records
02
Filter first
Remove unwanted elements early to reduce work
03
Transform next
Apply map() only to the elements that survived filtering
04
Materialize last
Convert to list() only when you need the final result
filter()[... if]filter()[... if]
filter()
Named predicate
Best with existing functions
[... if]
Inline filter
More Pythonic for simple use
filter()
Lazy evaluation
Processes one at a time
[... if]
Eager creation
Builds full list in memory

Using zip() for Combining

Daily Life
Interviews

Merge parallel lists into paired records

The zip() function combines multiple iterables element-by-element, creating tuples of corresponding items. It's like a zipper that interleaves teeth from both sides. This is essential for working with parallel data structures where related information is stored in separate sequences.

In data engineering, zip() appears when merging columns, pairing keys with values, iterating over multiple lists simultaneously, or transposing data structures. It's also the foundation for creating dictionaries from separate key and value lists, a common data transformation pattern.
The name "zip" comes from the analogy to a physical zipper, which interleaves two rows of teeth into one. Just as a zipper combines alternating teeth from each side, the zip() function combines alternating elements from each input sequence. The result is pairs (or tuples if you have more than two inputs) where each tuple contains elements that were at the same position in their original sequences.
Understanding zip() is essential for working with columnar data. When you have separate lists for names, ages, and cities, zip() lets you iterate over them together as if they were rows in a table. This row-oriented view is often more natural for processing records, even when the original data came in column-oriented format.

Basic zip() Usage

1names = ["Alice", "Bob", "Charlie"]
2ages = [30, 25, 35]
3cities = ["Seattle", "Portland", "Denver"]
4
5# Zip two lists
6pairs = list(zip(names, ages))
7print("Name-age pairs:", pairs)
8
9# Zip three lists
10combined = list(zip(names, ages, cities))
11print("All three:", combined)
12
13# Iterate over zipped data
14print()
15print("Formatted:")
16for name, age, city in zip(names, ages, cities):
17 print(" " + name + ", " + str(age) + ", from " + city)
>>>Output
Name-age pairs: [('Alice', 30), ('Bob', 25), ('Charlie', 35)]
All three: [('Alice', 30, 'Seattle'), ('Bob', 25, 'Portland'), ('Charlie', 35, 'Denver')]
 
Formatted:
Alice, 30, from Seattle
Bob, 25, from Portland
Charlie, 35, from Denver
Each call to zip() creates an iterator of tuples. The first tuple contains the first element from each input, the second tuple contains the second elements, and so on. This parallel iteration is cleaner than managing multiple index variables.

Without zip(), you would need to iterate using indices: for i in range(len(names)). This approach is error-prone because you might index out of bounds if lists have different lengths, and it is harder to read. The zip() version directly expresses the intent: iterate over these collections together, giving me one element from each at a time.

The unpacking in the for loop, for name, age, city in zip(...), is called tuple unpacking. Each iteration, zip() yields a tuple containing one element from each input. The for statement unpacks that tuple directly into named variables. This combination of zip() and tuple unpacking is idiomatic Python that you will see in professional codebases everywhere.

Handling Unequal Lengths

By default, zip() stops when the shortest iterable is exhausted. Extra elements in longer iterables are silently ignored. This behavior is often surprising to beginners and can cause subtle bugs if you do not expect it:

1names = ["Alice", "Bob", "Charlie", "Diana"]
2scores = [95, 87, 92]
3
4# Default: stops at shortest
5result = list(zip(names, scores))
6print("Default zip:", result)
7print("Diana is missing!")
8
9# Use itertools.zip_longest to include all
10from itertools import zip_longest
11
12result_all = list(zip_longest(names, scores, fillvalue=0))
13print()
14print("With zip_longest:", result_all)
15
16# Custom fill value
17result_na = list(zip_longest(names, scores, fillvalue="N/A"))
18print("With N/A fill:", result_na)
>>>Output
Default zip: [('Alice', 95), ('Bob', 87), ('Charlie', 92)]
Diana is missing!
 
With zip_longest: [('Alice', 95), ('Bob', 87), ('Charlie', 92), ('Diana', 0)]
With N/A fill: [('Alice', 95), ('Bob', 87), ('Charlie', 92), ('Diana', 'N/A')]

The itertools.zip_longest() function continues until the longest iterable is exhausted, filling missing values with the specified fillvalue. This is crucial when you need to preserve all data even when sources have different lengths.

Choosing between zip() and zip_longest() depends on your data quality assumptions. If you expect perfectly aligned data and want errors when they do not match, use zip() with strict=True. If missing values are acceptable and you want to fill them with defaults, use zip_longest(). If you do not care about extra data and want to silently truncate, use plain zip(). Each choice reflects a different assumption about your data.
TIP
In Python 3.10+, you can use zip() with strict=True to raise an error if lengths differ. This catches data alignment bugs early instead of silently dropping data.

Dicts from zip()

One of the most common uses of zip() is creating dictionaries from parallel lists of keys and values:
1# Column headers and row data
2headers = ["name", "age", "city", "salary"]
3row = ["Alice", 30, "Seattle", 95000]
4
5# Create dict from parallel lists
6record = dict(zip(headers, row))
7print("Record:", record)
8print("Name:", record["name"])
9print("Salary:", record["salary"])
10
11# Process multiple rows
12rows = [
13 ["Alice", 30, "Seattle", 95000],
14 ["Bob", 25, "Portland", 85000],
15 ["Charlie", 35, "Denver", 105000],
16]
17
18records = [dict(zip(headers, row)) for row in rows]
19print()
20print("All records:")
21for r in records:
22 print(" " + r["name"] + ": " + chr(36) + str(r["salary"]))
>>>Output
Record: {'name': 'Alice', 'age': 30, 'city': 'Seattle', 'salary': 95000}
Name: Alice
Salary: 95000
 
All records:
Alice: $95000
Bob: $85000
Charlie: $105000
This pattern is essential for CSV processing and working with columnar data. When you read a file with headers in the first row and data in subsequent rows, zip() with dict() converts each row to a named record instantly.
The dict(zip(keys, values)) idiom is so common in Python that it has become a standard pattern. You will see it in code that parses configuration files, processes API responses, and transforms database query results. It is more concise than building the dictionary manually with a loop, and it clearly expresses the intent: create a mapping from these keys to these values.
When working with pandas DataFrames, you often convert between row-oriented and column-oriented data. The zip() pattern helps you understand what these conversions do under the hood.
Debug Challenge

> This code zips two parallel lists together and tries to access a key, but the zip object is an iterator, not a dictionary.

TypeError: 'zip' object is not subscriptable

Unzipping with zip(*)

You can "unzip" a list of tuples back into separate lists using zip(*pairs). The * unpacks the list, passing each tuple as a separate argument. This reverses the zip operation, separating combined data back into its component sequences:

1# Data as tuples
2data = [("Alice", 95), ("Bob", 87), ("Charlie", 92)]
3
4# Unzip into separate lists
5names, scores = zip(*data)
6print("Names:", names)
7print("Scores:", scores)
8
9# Note: results are tuples, convert if needed
10names_list = list(names)
11scores_list = list(scores)
12print()
13print("As lists:", names_list, scores_list)
14
15# Useful for transposing
16matrix = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
17transposed = list(zip(*matrix))
18print()
19print("Original rows:", matrix)
20print("Transposed:", transposed)
>>>Output
Names: ('Alice', 'Bob', 'Charlie')
Scores: (95, 87, 92)
 
As lists: ['Alice', 'Bob', 'Charlie'] [95, 87, 92]
 
Original rows: [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
Transposed: [(1, 4, 7), (2, 5, 8), (3, 6, 9)]
The unzip operation is essentially transposing: rows become columns and columns become rows. This is useful when you need to switch between row-oriented and column-oriented data formats, a common transformation in data processing.
Matrix transposition appears everywhere in data science: converting features from row format to column format for machine learning libraries, pivoting data for visualization, or preparing data for database insertion. The zip(*matrix) idiom is the cleanest way to transpose a list of lists in Python. No explicit loops, no index manipulation, just elegant composition of built-in functions.
zip() Default
  • Stops at shortest input
  • Safe, no errors
  • May silently lose data
  • Use when lengths match
zip_longest()
  • Continues to longest input
  • Fills missing with default
  • Preserves all data
  • Use for unequal lengths

Common Mistakes

These are the most frequent errors when working with collection functions. Each mistake appears regularly in code reviews and interviews. Understanding why they occur helps you write correct code from the start.
Technical interviews often include questions designed to expose these common mistakes. An interviewer might ask you to filter a list, then ask what happens if you try to use the result twice. Or they might give you a sorting problem and see whether you accidentally assign the result of .sort() to a variable. Knowing these pitfalls in advance lets you avoid them under pressure.

Mistake 1: Spent Iterators

1numbers = [1, 2, 3, 4, 5]
2squared = map(lambda x: x ** 2, numbers)
3
4# First use: works fine
5print("First iteration:", list(squared))
6
7# Second use: empty! Iterator exhausted
8print("Second iteration:", list(squared))
9
10# FIX: Convert to list immediately if reusing
11squared_list = list(map(lambda x: x ** 2, numbers))
12print("First use:", squared_list)
13print("Second use:", squared_list)
>>>Output
First iteration: [1, 4, 9, 16, 25]
Second iteration: []
First use: [1, 4, 9, 16, 25]
Second use: [1, 4, 9, 16, 25]
Map, filter, and zip return iterators that can only be consumed once. If you need to use the result multiple times, convert to a list immediately. This is a common source of bugs where code works the first time but fails on subsequent uses.
This behavior exists for memory efficiency. An iterator only stores the current position and the logic to compute the next element, not all the results at once. For large datasets, this saves enormous amounts of memory. But it means you must plan ahead: if you need the results more than once, pay the memory cost upfront by converting to a list. If you only need one pass, keep it as an iterator.

Mistake 2: Convert First

1numbers = [1, 2, 3]
2doubled = map(lambda x: x * 2, numbers)
3
4# WRONG: Printing the map object
5print("Wrong:", doubled)
6
7# WRONG: Trying to index
8# print(doubled[0]) # TypeError!
9
10# CORRECT: Convert first
11doubled_list = list(map(lambda x: x * 2, numbers))
12print("Correct:", doubled_list)
13print("First element:", doubled_list[0])
>>>Output
Wrong: <map object at 0x...>
Correct: [2, 4, 6]
First element: 2
Map, filter, and zip objects are iterators, not lists. You cannot index them or see their contents directly. Always convert to list() if you need random access or want to print the actual values.
The confusing output from printing a map object is a common first encounter with Python's iterator protocol. The output shows the object type and memory address, not the contents. This is because iterators do not compute their results until requested. Printing triggers Python to ask the object for a string representation, but the map object does not evaluate and format all its results just for display.

Mistake 3: Sort Mutation

1# WRONG: Confusing sorted() with sort()
2numbers = [3, 1, 4, 1, 5]
3
4# sorted() returns a NEW list
5result = sorted(numbers)
6print("Original:", numbers)
7print("Sorted copy:", result)
8
9result2 = numbers.sort()
10print()
11print("After .sort():", numbers)
12print("Return value:", result2)
>>>Output
Original: [3, 1, 4, 1, 5]
Sorted copy: [1, 1, 3, 4, 5]
 
After .sort(): [1, 1, 3, 4, 5]
Return value: None

Remember: sorted() is a function that returns a new list; .sort() is a method that modifies in place and returns None. A common bug is writing x = mylist.sort() which makes x equal to None.

This distinction between sorted() and .sort() reflects Python's design philosophy. Methods that modify objects in place return None to make clear that the modification happened to the original object. If they returned the object, you might think you got a copy. The None return value forces you to acknowledge the in-place mutation. This same pattern appears with list.append(), list.extend(), and dict.update().

Mistake 4: zip() Data Loss

1names = ["Alice", "Bob", "Charlie", "Diana", "Eve"]
2scores = [95, 87, 92]
3
4# zip() silently drops Diana and Eve
5paired = dict(zip(names, scores))
6print("Paired:", paired)
7print("Missing: Diana and Eve!")
8
9# FIX: Use zip_longest or strict mode (Python 3.10+)
10from itertools import zip_longest
11complete = list(zip_longest(names, scores, fillvalue="MISSING"))
12print()
13print("All data:", complete)
>>>Output
Paired: {'Alice': 95, 'Bob': 87, 'Charlie': 92}
Missing: Diana and Eve!
 
All data: [('Alice', 95), ('Bob', 87), ('Charlie', 92), ('Diana', 'MISSING'), ('Eve', 'MISSING')]
The default zip() behavior silently drops data when inputs have different lengths. This can hide data quality issues. Use zip_longest() when preserving all data matters, or strict=True in Python 3.10+ to catch mismatches.
Silent data loss is particularly dangerous in data pipelines. If your ETL job silently drops records because of a length mismatch, you might not notice until someone wonders why the downstream reports are missing data. Adding assertions or using strict=True turns these silent failures into loud errors that you can investigate and fix before they cause business problems.
TIP
When zip() produces fewer results than expected, check that your input iterables have equal lengths. Length mismatches often indicate data problems upstream.
Python Quiz

> Combine separate lists of column names and row values into a dictionary record. Pick the function that pairs parallel elements and the constructor that creates a key-value mapping.

keys = ["name", "age", "role"]
vals = ["Alice", 30, "Engineer"]
record = ___(___(keys, vals))
print(record["role"])
tuple
zip
map
list
dict

zip() is a fundamental tool for working with parallel data. Whenever you have related information stored in separate sequences, zip() lets you iterate over them together as if they were rows in a table.

Choosing the right collection type can dramatically improve both code clarity and performance. Put these techniques to the test with hands-on challenges in the Python Builder.
PUTTING IT ALL TOGETHER

> You are a data engineer at Stripe building a daily reconciliation report that ranks payment processors by total volume, normalizes transaction amounts, removes failed entries, and pairs each processor with its currency symbol from a reference list.

sorted() with a key function ranks the processors list by descending transaction volume without mutating the original data.
sorted() with reverse=True reorders the ranked list so the highest-volume processor appears first in the generated report.
map() applies a normalization function to every transaction amount in the collection, producing a transformed sequence without an explicit loop.
filter() removes failed-status entries from the transaction list so only settled payments feed into the volume totals.
KEY TAKEAWAYS
sorted(items, key=func) returns a new sorted list using func to extract comparison values
sorted(items, reverse=True) sorts in descending order; combine with key for custom descending sorts
For multi-level sorts with mixed directions, negate numeric values in the key tuple
map(func, items) applies func to each element, returning an iterator of transformed values
filter(predicate, items) keeps elements where predicate returns True; filter(None, items) removes falsy values
zip(a, b, c) pairs corresponding elements; stops at shortest input by default
Use itertools.zip_longest() to preserve all data when inputs have unequal lengths
dict(zip(keys, values)) creates a dictionary from parallel key and value lists
map(), filter(), and zip() return iterators; convert to list() if reusing or indexing
Remember: sorted() returns a new list; .sort() modifies in place and returns None

Transform, filter, and combine data

Category
Python
Difficulty
intermediate
Duration
45 minutes
Challenges
0 hands-on challenges

Topics covered: Using sorted() with key, Using sorted() in Reverse, Using map() for Transforms, Using filter() for Selection, Using zip() for Combining

Lesson Sections

  1. Using sorted() with key

    This is one of Python's most powerful features for data processing. Without key functions, sorting a list of dictionaries by a specific field would require writing a custom comparison function or manually extracting values. With key, you express the sort criteria in a single line that Python handles efficiently. The key parameter represents a fundamental shift from imperative to declarative programming. Instead of writing code that compares elements step by step, you declare what value to use fo

  2. Using sorted() in Reverse

    While you could achieve the same result by sorting normally and then reversing, using the reverse parameter is both cleaner and more efficient. Python handles the reversal during the sort rather than as a separate pass through the data. Basic Reverse Sorting Getting the top N items from a collection is a common operation. Sorting in descending order and slicing the first N elements is simple and readable. For very large collections where you only need a few top items, consider heapq.nlargest() f

  3. Using map() for Transforms (concepts: pyMapFilter)

    The map() function embodies the principle that transformations should be separate from iteration. When you use a for loop to transform data, you mix the mechanics of iteration with the logic of transformation. With map(), you cleanly express the transformation once and let Python handle the iteration. This separation makes code easier to understand, test, and parallelize. Basic map() Usage map() with Builtins Many built-in functions work directly with map() without needing lambda: This direct fu

  4. Using filter() for Selection

    Filter operations are lazy by default in Python 3. The filter object only computes results as you iterate through it. This means you can filter a massive dataset without loading everything into memory at once. Each element is tested and yielded one at a time. This lazy evaluation is crucial for processing data that does not fit in memory, a common situation in data engineering. Basic filter() Usage The predicate function must return a truthy or falsy value. Elements where the predicate returns T

  5. Using zip() for Combining

    In data engineering, zip() appears when merging columns, pairing keys with values, iterating over multiple lists simultaneously, or transposing data structures. It's also the foundation for creating dictionaries from separate key and value lists, a common data transformation pattern. The name "zip" comes from the analogy to a physical zipper, which interleaves two rows of teeth into one. Just as a zipper combines alternating teeth from each side, the zip() function combines alternating elements