Data Structures: Beginner

Friendster collapsed under its own success because it stored friend relationships in arrays and performed full scans to check connections, making the site grind to a halt as its user base grew. Redis chose hash maps as its core data structure, achieving constant-time key lookups that scale to millions of queries per second without a performance cliff. The data structure you choose in the first week of a project determines whether your application thrives or crashes under growth. This lesson teaches you how to match the right data structure to the right problem before it becomes a production crisis.

Lists: Ordered Collections

Daily Life
Interviews

Store and retrieve ordered data

Lists are Python's workhorse data structure. They hold items in a specific order, allow duplicates, and can grow or shrink as needed. When you receive a batch of records from an API, process rows from a CSV file, or collect results from a database query, you typically work with lists. Lists are by far the most commonly used data structure in Python.
What makes lists so versatile is their flexibility. They can hold any type of data - numbers, strings, other lists, dictionaries, or custom objects. You can mix types within the same list, though in practice keeping types consistent makes code easier to understand and maintain.

Create a list using square brackets []. Items are separated by commas. The order you specify is the order they are stored, and that order is preserved throughout the life of the list.

1# A list of transaction amounts
2transactions = [150.00, 89.99, 42.50, 200.00, 89.99]
3print("Transactions:", transactions)
4print("Count:", len(transactions))
5
6print("First transaction:", transactions[0])
7print("Last transaction:", transactions[-1])
>>>Output
Transactions: [150.0, 89.99, 42.5, 200.0, 89.99]
Count: 5
First transaction: 150.0
Last transaction: 89.99
Indexing is how you access individual elements. The first element is at index 0, the second at index 1, and so on. Python also supports negative indexing: -1 refers to the last element, -2 to the second-to-last, and so on. This is incredibly useful when you need to access elements from the end of a list without knowing its length.
Lists are mutable, meaning you can modify them after creation. You can append new items, insert at specific positions, remove items, and change existing values. This flexibility makes lists ideal for building up results incrementally, which is a common pattern in data processing.
1# Building a list of processed records
2processed_ids = []
3
4# Simulate processing incoming data
5for record_id in [101, 102, 103]:
6 processed_ids.append(record_id)
7 print(f"Processed record {record_id}")
8
9print("All processed:", processed_ids)
>>>Output
Processed record 101
Processed record 102
Processed record 103
All processed: [101, 102, 103]

The append method adds an item to the end of the list. This is one of the most common operations you will perform. Lists grow dynamically - you do not need to specify a size upfront, and Python handles the memory management for you. This makes lists perfect for situations where you do not know in advance how many items you will have.

1# Slicing: extract portions of a list
2metrics = [10, 20, 30, 40, 50, 60, 70]
3
4# Get first three elements
5print("First three:", metrics[0:3])
6
7# Get elements from index 2 to 5
8print("Middle:", metrics[2:5])
9
10# Get last three elements
11print("Last three:", metrics[-3:])
12
13# Skip every other element
14print("Every other:", metrics[::2])
>>>Output
First three: [10, 20, 30]
Middle: [30, 40, 50]
Last three: [50, 60, 70]
Every other: [10, 30, 50, 70]

Slicing is a powerful feature that lets you extract portions of a list using the syntax list[start:stop:step]. The start index is included, but the stop index is excluded. If you omit start, it defaults to the beginning. If you omit stop, it goes to the end. The optional step parameter lets you skip elements.

Fill in the Blank

> You have a list of five metrics [10, 20, 30, 40, 50] and need to extract the middle three values. Pick the slice that captures exactly those elements.

metrics = [10, 20, 30, 40, 50]
print(metrics)
Lists are the go-to choice whenever your data has a natural ordering. Here are the scenarios where lists outperform every other option.
Order matters
Order matters
First in, first out processing where sequence is meaningful
Duplicates allowed
Duplicates allowed
Track every occurrence, like repeated transactions or events
Dynamic size
Dynamic size
Items will be added or removed as your program runs
Position-based access
Position-based access
Retrieve the 1st, 5th, or last item by numeric index instantly
Incremental building
Incremental building
Append results one at a time inside a processing loop

Essential List Methods

Lists come with a rich set of built-in methods for manipulation. Understanding these methods lets you write cleaner, more efficient code. Here are the most commonly used methods that you will encounter in almost every Python project.
1scores = [85, 92, 78, 95, 88]
2
3# Add elements
4scores.append(90)
5scores.insert(0, 100)
6print("After adding:", scores)
7
8# Remove elements
9scores.remove(78)
10last = scores.pop()
11print("After removing:", scores)
12print("Popped value:", last)
13
14# Find and count
15print("Index of 95:", scores.index(95))
16print("Count of 92:", scores.count(92))
>>>Output
After adding: [100, 85, 92, 78, 95, 88, 90]
After removing: [100, 85, 92, 95, 88]
Popped value: 90
Index of 95: 3
Count of 92: 1
Python Quiz

> Add an element to the end of a list, then remove and capture the last element. Pick the method that grows the list and the one that shrinks it while returning the removed value.

data = [10, 20, 30, 40]
data.___(50)
last = data.___()
print(last)
print(len(data))
append
extend
pop
insert
remove

List Performance Overview

Understanding performance helps you write efficient code. Lists excel at accessing items by index and appending to the end - both operations happen in O(1) constant time, meaning they are equally fast regardless of list size. However, searching for a specific value requires checking each item one by one, which becomes slow for large lists.

Inserting or removing from the beginning or middle of a list is slow because all subsequent elements must be shifted. If you frequently need to add or remove from both ends, consider using a deque from the collections module instead.

Fast Operations
  • Access by index: list[0]
  • Append to end: list.append(x)
  • Pop from end: list.pop()
  • Get length: len(list)
Slow Operations
  • Search: x in list
  • Insert at start: list.insert(0, x)
  • Remove by value: list.remove(x)
  • Insert in middle
TIP
If you frequently need to check whether an item exists in a collection, consider using a set instead of a list. Sets are optimized for membership testing and can check membership in constant time.
The implementation of Python lists explains why some operations are fast and others are not.

Tuples: Immutable Sequences

Daily Life
Interviews

Lock down data that must not change

Tuples look similar to lists but have one critical difference: they cannot be changed after creation. Once you create a tuple, you cannot add, remove, or modify its elements. This immutability is not a limitation - it is a feature that makes your code safer and more predictable.
Think about data that should never change: database connection parameters, geographic coordinates, RGB color values, or API response codes. If you accidentally modify such data, bugs can be extremely difficult to track down. Tuples prevent this entire category of bugs by making modification impossible.

Create a tuple using parentheses () or just commas. Access elements the same way as lists, using square bracket indexing. You can iterate over tuples, slice them, and use all the read-only operations that work on lists.

1# Database connection settings (should never change)
2db_config = ("prod-db.company.com", 5432, "analytics")
3print("Host:", db_config[0])
4print("Port:", db_config[1])
5print("Database:", db_config[2])
6
7# Geographic coordinates
8location = (37.7749, -122.4194)
9print(f"Latitude: {location[0]}, Longitude: {location[1]}")
>>>Output
Host: prod-db.company.com
Port: 5432
Database: analytics
Latitude: 37.7749, Longitude: -122.4194
Notice that we still use square brackets to access tuple elements, just like lists. The difference is only in what operations are allowed. Reading is fine; writing is forbidden. Attempting to modify a tuple raises a TypeError, and Python enforces this at runtime.
1coordinates = (10, 20)
2# TypeError: 'tuple' object does not support item assignment
3coordinates[0] = 15
4
5# If you need to "change" a tuple, create a new one
6new_coords = (15, coordinates[1])
If you find yourself needing to "modify" a tuple, the solution is to create a new tuple with the desired values. This pattern is common in functional programming and ensures that any code holding a reference to the original tuple is not affected by your changes.

Tuples Instead of Lists?

Immutability provides several important benefits that make tuples worth using even though lists are more flexible. Understanding these benefits helps you make informed decisions about which structure to use.
First, immutability prevents bugs. If data should not change, making it a tuple ensures it cannot change accidentally - not by your code, not by library code, not by anyone. This is a form of defensive programming that catches errors at runtime rather than letting them silently corrupt your data.
Second, tuples are hashable, which means they can be used as dictionary keys. Lists cannot be dictionary keys because they are mutable - if you could change a list after using it as a key, the dictionary would become corrupted. This makes tuples essential for certain data structures.
Third, tuples are slightly more memory-efficient and faster to create than lists. For small sequences that you create many times, this can add up. Python can also optimize certain operations on tuples because it knows they will not change.
1# Tuples as dictionary keys (lists cannot do this)
2# Map city coordinates to population
3city_populations = {
4 (40.7128, -74.0060): 8_336_817,
5 (34.0522, -118.2437): 3_979_576,
6 (37.7749, -122.4194): 873_965,
7}
8
9sf_coords = (37.7749, -122.4194)
10print(f"SF population: {city_populations[sf_coords]:,}")
>>>Output
SF population: 873,965
Use Lists When
  • Items will be added/removed
  • Order may change (sorting)
  • Building results incrementally
Use Tuples When
  • Data should never change
  • Need dictionary keys
  • Returning multiple values

Tuple Unpacking

One of the most elegant features of tuples is unpacking: assigning tuple elements to multiple variables in a single statement. This makes code cleaner and more readable, especially when functions return multiple values. Unpacking is so useful that Python developers use it constantly.
The number of variables on the left must match the number of elements in the tuple. Python also supports extended unpacking with the star operator, letting you capture multiple elements into a list while unpacking the rest into individual variables.
1# Unpacking a tuple into separate variables
2user_record = ("alice_42", "alice@example.com", 28)
3username, email, age = user_record
4
5print(f"User: {username}")
6print(f"Email: {email}")
7print(f"Age: {age}")
8
9def get_min_max(numbers):
10 return (min(numbers), max(numbers))
11
12data = [45, 23, 67, 12, 89, 34]
13minimum, maximum = get_min_max(data)
14print(f"Range: {minimum} to {maximum}")
>>>Output
User: alice_42
Email: alice@example.com
Age: 28
Range: 12 to 89

Tuple unpacking is especially common when iterating over dictionary items or when working with functions that return multiple values. The enumerate function, for example, returns tuples of (index, value) that you typically unpack in a for loop.

1# Unpacking in loops - very common pattern
2scores = [85, 92, 78, 95]
3for index, score in enumerate(scores):
4 print(f"Position {index}: {score}")
5
6# Extended unpacking with *
7first, *middle, last = [1, 2, 3, 4, 5]
8print(f"First: {first}, Middle: {middle}, Last: {last}")
>>>Output
Position 0: 85
Position 1: 92
Position 2: 78
Position 3: 95
First: 1, Middle: [2, 3, 4], Last: 5
Debug Challenge

> This code tries to change the first element of a tuple, but tuples are immutable and do not support item assignment.

TypeError: 'tuple' object does not support item assignment

Immutability also makes tuples safe to share across threads without locking. Because no code can modify a tuple after creation, multiple threads can read the same tuple simultaneously without risk of data corruption.

When designing functions that return multiple values, tuples are the idiomatic choice. Functions like min(), max(), divmod(), and many standard library functions return tuples that you unpack at the call site.

Dicts: Key-Value Storage

Daily Life
Interviews

Look up any value by key instantly

Dictionaries are one of Python's most powerful and frequently used data structures. They store key-value pairs, allowing you to look up values by their keys instantly. Think of a dictionary like a real dictionary: you look up a word (key) to find its definition (value). The difference is that Python dictionaries can use almost any immutable type as a key, not just strings.
In data engineering, dictionaries are absolutely everywhere. JSON responses from APIs are dictionaries. Configuration files parse into dictionaries. Database rows are often represented as dictionaries. Caches use dictionaries. Environment variables are accessed through dictionaries. Mastering dictionaries is essential for any Python developer.

Create a dictionary using curly braces {} with key-value pairs separated by colons. Keys must be immutable (strings, numbers, or tuples), while values can be anything - including other dictionaries, lists, or custom objects.

1# User profile from an API response
2user = {
3 "user_id": "u_12345",
4 "name": "Sarah Chen",
5 "email": "sarah@example.com",
6 "is_premium": True,
7 "login_count": 47
8}
9
10print("Name:", user["name"])
11print("Premium status:", user["is_premium"])
12print("Total keys:", len(user))
>>>Output
Name: Sarah Chen
Premium status: True
Total keys: 5
Notice how the dictionary uses descriptive string keys like "user_id" and "name" instead of numeric indices. This makes your code self-documenting - you can tell exactly what each value represents just by looking at its key. Compare user["email"] to user[2] - the dictionary version is much clearer.
The magic of dictionaries is constant-time lookup. Whether your dictionary has 10 items or 10 million, finding a value by its key takes the same amount of time. This is fundamentally different from lists, where searching requires checking each item one by one. This performance characteristic makes dictionaries ideal for building lookup tables and caches.
1# Building a lookup table for fast access
2# Map product IDs to prices
3price_lookup = {
4 "SKU001": 29.99,
5 "SKU002": 49.99,
6 "SKU003": 19.99,
7 "SKU004": 99.99,
8}
9
10# Instant lookup - no searching required
11product_id = "SKU003"
12if product_id in price_lookup:
13 print(f"Price for {product_id}: {price_lookup[product_id]}")
>>>Output
Price for SKU003: 19.99
This lookup table pattern is extremely common. Instead of searching through a list of products to find the price for SKU003, we go directly to it using the key. In a production system processing millions of lookups, this difference between constant time and linear search time is the difference between a responsive application and a slow one.

Iterating Over Dictionaries

You often need to loop through dictionary contents. Python provides several ways to iterate: over just keys, just values, or both key-value pairs together. The items() method returns both, which is usually what you want.
1metrics = {"cpu": 45.2, "memory": 72.8, "disk": 58.1}
2
3# Iterate over keys (default behavior)
4print("Metrics tracked:")
5for metric_name in metrics:
6 print(f" - {metric_name}")
7
8# Iterate over both keys and values
9print("\nCurrent values:")
10for name, value in metrics.items():
11 print(f" {name}: {value}%")
>>>Output
Metrics tracked:
- cpu
- memory
- disk
 
Current values:
cpu: 45.2%
memory: 72.8%
disk: 58.1%

Modifying Dictionaries

Dictionaries are mutable. You can add new key-value pairs, update existing values, and remove entries. Adding a new key or updating an existing one uses the same syntax: assignment with square brackets. If the key exists, the value is updated; if not, a new entry is created.
1metrics = {"requests": 1000, "errors": 5}
2print("Initial:", metrics)
3
4# Add new key
5metrics["latency_ms"] = 45.2
6print("After adding:", metrics)
7
8# Update existing key
9metrics["requests"] = 1050
10print("After update:", metrics)
11
12# Remove a key
13del metrics["errors"]
14print("After delete:", metrics)
>>>Output
Initial: {'requests': 1000, 'errors': 5}
After adding: {'requests': 1000, 'errors': 5, 'latency_ms': 45.2}
After update: {'requests': 1050, 'errors': 5, 'latency_ms': 45.2}
After delete: {'requests': 1050, 'latency_ms': 45.2}

Safe Key Access

Accessing a key that does not exist raises a KeyError. To handle missing keys gracefully, use the get() method, which returns None (or a default value you specify) instead of raising an error.

1config = {"host": "localhost", "port": 8080}
2
3# Risky: raises KeyError if key missing
4# timeout = config["timeout"] # KeyError!
5
6# Safe: returns None if key missing
7timeout = config.get("timeout")
8print("Timeout:", timeout)
9
10# Safe with default value
11timeout = config.get("timeout", 30)
12print("Timeout with default:", timeout)
>>>Output
Timeout: None
Timeout with default: 30
TIP
Always use .get() when a key might not exist. It prevents crashes and makes your code more robust against unexpected data.

Nested Dictionaries

Dictionary values can be any type, including other dictionaries. This creates nested structures, which are extremely common when working with JSON data from APIs. Accessing nested values requires chaining bracket notation or using multiple get() calls for safety.
1# Nested dictionary representing API response
2user_data = {
3 "profile": {
4 "name": "Alice Chen",
5 "settings": {"theme": "dark", "notifications": True}
6 },
7 "stats": {"posts": 42, "followers": 1250}
8}
9
10# Access nested values
11print("Name:", user_data["profile"]["name"])
12print("Theme:", user_data["profile"]["settings"]["theme"])
13
14# Safe nested access
15language = user_data.get("profile", {}).get("settings", {}).get("language", "en")
16print("Language:", language)
>>>Output
Name: Alice Chen
Theme: dark
Language: en

The chained get() pattern is verbose but safe. Each get() returns an empty dictionary if the key is missing, allowing the chain to continue without raising an error. The final get() returns the default value if the entire path does not exist.

Fill in the Blank

> You have a user dictionary with a nested "profile" containing "name". Pick the access method and fallback that safely navigates the nested structure.

user = {"profile": {"name": "Alice"}}
result = user("profile", )("name", "unknown")
print(result)
Dictionaries are the right tool whenever you need to associate keys with values. These are the most common patterns.
LOOKUPSTRUCTCOUNTCACHEJSON
LOOKUP
ID-based access
Find users by ID instantly
STRUCT
Named fields
Organized data with keys
COUNT
Tally items
Count word occurrences
CACHE
Store results
Reuse computed answers
JSON
Parse configs
Read structured API data
Dictionaries shine when your data needs meaningful labels. They make code self-documenting: user["email"] communicates intent far more clearly than user[1], especially when reviewing code written months ago.

Sets: Unique Collections

Daily Life
Interviews

Eliminate duplicates and test membership

Sets are unordered collections of unique elements. When you add a duplicate to a set, it simply ignores it - no error, no warning, just silent deduplication. This makes sets perfect for eliminating duplicates, tracking unique visitors, and performing mathematical set operations like unions and intersections.
Unlike lists and tuples, sets do not maintain any particular order. The elements are stored based on their hash values, which optimizes for fast operations rather than sequence. This trade-off is worth it because sets excel at two critical operations: adding elements and checking if an element exists.

Create a set using curly braces {} with elements (not key-value pairs) or the set() function. Important gotcha: {} alone creates an empty dictionary, not an empty set. Use set() for an empty set.

1# Track unique visitors
2visitor_ids = {"user_1", "user_2", "user_3", "user_1", "user_2"}
3print("Unique visitors:", visitor_ids)
4print("Count:", len(visitor_ids))
5
6# Convert list with duplicates to unique set
7page_views = ["home", "products", "home", "checkout", "products", "home"]
8unique_pages = set(page_views)
9print("Unique pages visited:", unique_pages)
>>>Output
Unique visitors: {'user_3', 'user_1', 'user_2'}
Count: 3
Unique pages visited: {'checkout', 'products', 'home'}
Notice that the order of elements in a set is not guaranteed. Sets prioritize fast operations over maintaining insertion order. When we created the set with duplicate user IDs, the duplicates were automatically removed, leaving only three unique values. This automatic deduplication is incredibly useful when processing data.

If you need both uniqueness and order, you can use a dictionary with None values (since Python 3.7+, dictionaries maintain insertion order), or use the dict.fromkeys() pattern which preserves first-occurrence order.

Fast Membership Testing

The killer feature of sets is O(1) membership testing. Checking if an element exists in a set takes the same time whether the set has 100 elements or 100 million. This makes sets absolutely essential for any operation involving "is this item in my collection?" - a question that comes up constantly in data processing.

Consider a scenario where you need to filter a million records, keeping only those that match a list of 10,000 valid IDs. With a list, each of the million records requires checking up to 10,000 IDs - that is potentially 10 billion comparisons. With a set, each record requires exactly one lookup, giving you one million operations total. The difference can be hours versus seconds.
1# Blacklist of blocked IP addresses
2blocked_ips = {"192.168.1.100", "10.0.0.50", "172.16.0.1"}
3
4# Check incoming requests
5incoming_ip = "192.168.1.100"
6
7if incoming_ip in blocked_ips:
8 print(f"BLOCKED: {incoming_ip}")
9else:
10 print(f"ALLOWED: {incoming_ip}")
11
12# Compare with list performance
13# if incoming_ip in blocked_list:
>>>Output
BLOCKED: 192.168.1.100
Set Membership
  • O(1) constant time
  • Same speed for any size
  • Uses hash table internally
List Membership
  • O(n) linear time
  • Slower as list grows
  • Checks each item sequentially

Set Operations

Sets support mathematical operations that are incredibly useful for data analysis. These operations answer common questions: Which items are in either collection? Which are in both? Which are in one but not the other? Python makes these operations fast and intuitive.

Union combines two sets, giving you all unique elements from both. Intersection finds elements that appear in both sets. Difference finds elements that are in the first set but not the second. These operations run in O(n) time, making them efficient for large datasets.

1# Users who viewed each product category
2electronics_viewers = {"alice", "bob", "charlie", "diana"}
3clothing_viewers = {"bob", "diana", "eve", "frank"}
4
5# Union: users who viewed either category
6all_viewers = electronics_viewers | clothing_viewers
7print("All viewers:", all_viewers)
8
9# Intersection: users who viewed both categories
10both_categories = electronics_viewers & clothing_viewers
11print("Viewed both:", both_categories)
12
13# Difference: only electronics, not clothing
14electronics_only = electronics_viewers - clothing_viewers
15print("Electronics only:", electronics_only)
>>>Output
All viewers: {'alice', 'frank', 'charlie', 'eve', 'diana', 'bob'}
Viewed both: {'diana', 'bob'}
Electronics only: {'alice', 'charlie'}
These operations are perfect for analyzing user behavior, comparing data sets, or finding overlaps between categories. In data engineering, you might use intersection to find customers who appear in both yesterday's and today's data, or difference to find new customers who did not exist yesterday.
Python Quiz

> Find how many viewers visited both product categories and how many visited at least one. Pick the set operator for each question.

a = {"alice", "bob", "charlie"}
b = {"bob", "charlie", "diana"}
overlap = a ___ b
combined = a ___ b
print(len(overlap))
print(len(combined))
&
^
-
|
Set operations are often the most readable way to express data comparisons. Instead of nested loops that check membership, a single operator like & or - communicates the intent immediately and executes much faster.
The difference operator (-) is not symmetric. A - B and B - A give different results. Always think about which set should be the "reference" set and which one you are subtracting from it.
TIP
When creating an empty set, always use set() not {}. Python interprets {} as an empty dictionary, which will cause an AttributeError when you try to call set methods like .add().
Debug Challenge

> This code uses {} to create an empty set, but Python interprets {} as an empty dictionary. Calling .add() on a dict raises an AttributeError.

AttributeError: 'dict' object has no attribute 'add'

Set Operation Symbols
  • | Union - elements in either set
  • & Intersection - elements in both sets
  • - Difference - elements in first but not second
  • ^ Symmetric difference - elements in either but not both
Sets are the right choice whenever your data has no meaningful order and you care about uniqueness. Converting a list to a set is a single operation that deduplicates every element instantly.

frozensets are the immutable equivalent of sets. They support all the same operations as sets but cannot be modified after creation, making them hashable and usable as dictionary keys.

Choosing the Right Type

Daily Life
Interviews

Match any problem to its best structure

Selecting the right data structure is one of the most important skills in programming. The choice affects code clarity, performance, and correctness. A well-chosen data structure makes code simpler and faster. A poorly chosen one leads to complex workarounds and performance problems.
The good news is that choosing becomes intuitive with practice. After working with these four structures for a while, you will instinctively know which one fits each situation. Until then, use a systematic decision framework to guide your choices.

The Decision Framework

Start by asking these questions about your data: Does order matter? Are duplicates allowed? Do you need to look up items by a key? Should the data be modifiable after creation? Your answers point directly to the right structure.
Think about what operations you will perform most frequently. If you primarily access by index, use a list. If you primarily access by key, use a dictionary. If you primarily check membership, use a set. If you need the data to be immutable, use a tuple.
01
list
Ordered and changeable - use when sequence matters and items may be added or removed
02
tuple
Ordered and frozen - use when data must never change after creation
03
dict
Key-value pairs - use when you need to look up values by a unique identifier
04
set
Unique elements only - use when duplicates are unwanted and you need fast membership checks

Real-World Scenarios

Let us apply this framework to common data engineering scenarios. In each case, the data requirements point clearly to one structure. With practice, you will recognize these patterns instantly and choose the right structure without conscious thought.
1# Scenario 1: Processing a queue of tasks
2# Need: ordered, will remove items as processed
3task_queue = ["validate", "transform", "load"]
4
5# Scenario 2: Database connection parameters
6# Need: unchangeable configuration
7db_config = ("prod-server", 5432, "main_db")
8
9# Scenario 3: User session data
10# Need: lookup by session ID
11sessions = {
12 "sess_abc": {"user": "alice", "expires": 3600},
13 "sess_xyz": {"user": "bob", "expires": 1800}
14}
15
16# Scenario 4: Track processed record IDs
17processed_ids = {1001, 1002, 1003}
18
19print("Task queue:", task_queue)
20print("DB config:", db_config)
21print("Sessions:", len(sessions), "active")
22print("Processed:", len(processed_ids), "records")
>>>Output
Task queue: ['validate', 'transform', 'load']
DB config: ('prod-server', 5432, 'main_db')
Sessions: 2 active
Processed: 3 records

Data Pipeline Patterns

In data engineering, you often combine multiple structures. Here is a pattern you will see frequently: using a set for deduplication while building a list of results.
1# Processing events while tracking unique users
2events = [
3 {"user_id": "u1", "action": "click"},
4 {"user_id": "u2", "action": "view"},
5 {"user_id": "u1", "action": "purchase"},
6 {"user_id": "u3", "action": "click"},
7 {"user_id": "u2", "action": "click"},
8]
9
10# Fast duplicate checking
11seen_users = set()
12# Preserve order of first occurrence
13unique_events = []
14
15for event in events:
16 user_id = event["user_id"]
17 if user_id not in seen_users:
18 seen_users.add(user_id)
19 unique_events.append(event)
20
21print(f"Total events: {len(events)}")
22print(f"Unique users: {len(seen_users)}")
23print(f"First events: {unique_events}")
>>>Output
Total events: 5
Unique users: 3
First events: [{'user_id': 'u1', 'action': 'click'}, {'user_id': 'u2', 'action': 'view'}, {'user_id': 'u3', 'action': 'click'}]
This set-plus-list pattern appears everywhere in data processing code. You will use it to deduplicate user events, filter records to unique entries, or ensure that each item is processed exactly once. The key insight is that sets are perfect for tracking what you have seen, while lists are perfect for storing the results in order.
TIP
This pattern combines the fast membership testing of sets with the ordered collection of lists. It is O(n) overall instead of O(n²) that you would get from checking uniqueness with only a list.

Common Mistakes

Even experienced developers make these mistakes. Learning to recognize and avoid them will save you hours of debugging. Some cause immediate errors, while others silently produce incorrect results or slow performance.
Do
  • Use set() for membership tests
  • Use .get() for safe dict access
  • Build new lists with comprehensions
  • Use tuples for fixed data
Don't
  • Search large lists with "in"
  • Access dict keys with [] blindly
  • Modify a list while looping it
  • Use {} for an empty set

Lists for Membership Tests

One of the most common performance mistakes is using a list when you need frequent membership tests. For small lists with a few dozen items, this is fine - the performance difference is negligible. But for thousands or millions of items, the difference becomes dramatic and can make your code unusably slow.
The symptom is code that works correctly but takes far longer than it should. If you find yourself waiting minutes for something that should take seconds, check whether you are doing membership tests against a list. Converting to a set is often all it takes to fix the problem.
Wrong
  • blocked = ["ip1", "ip2", ...]
  • if ip in blocked: # Slow!
  • Checks every item
Correct
  • blocked = {"ip1", "ip2", ...}
  • if ip in blocked: # Fast!
  • Instant hash lookup

Empty Dict vs Empty Set

A common gotcha: {} creates an empty dictionary, not an empty set. To create an empty set, use set().

1# WRONG - this creates a dict, not a set!
2wrong = {}
3print("Type of {}:", type(wrong))
4
5# CORRECT - use set() for empty set
6correct = set()
7print("Type of set():", type(correct))
8
9# Non-empty sets use curly braces fine
10also_correct = {1, 2, 3}
11print("Type of {1,2,3}:", type(also_correct))
>>>Output
Type of {}: <class 'dict'>
Type of set(): <class 'set'>
Type of {1,2,3}: <class 'set'>

Modifying Tuples

Forgetting that tuples are immutable leads to TypeError. If you need to "change" a tuple, you must create a new one with the desired values.
1# WRONG - tuples cannot be modified
2coords = (10, 20)
3# TypeError!
4coords[0] = 15
5
6# CORRECT - create a new tuple
7coords = (15, coords[1])

KeyError on Missing Keys

Accessing a dictionary key that does not exist crashes your program with a KeyError. This is one of the most common runtime errors in Python. When processing external data like JSON from APIs or user input, you can never be certain all expected keys are present.

The solution is defensive coding. Always use the .get() method when a key might be missing, or check for key existence with the in operator before accessing. The .get() method is usually cleaner because it returns a default value in a single expression.

1data = {"name": "Alice", "age": 30}
2
3# WRONG - crashes if key missing
4# data["city"] # KeyError
5
6city = data.get("city", "Unknown")
7print("City:", city)
8
9# ALSO CORRECT - check first
10if "city" in data:
11 print(data["city"])
12else:
13 print("City not provided")
>>>Output
City: Unknown
City not provided

Mutating During Iteration

Modifying a list while looping over it causes unpredictable behavior. Elements get skipped or processed multiple times because the loop indices become misaligned with the changing list. This is a notorious source of subtle bugs.
1# WRONG - modifying while iterating
2numbers = [1, 2, 3, 4, 5]
3for n in numbers:
4 if n % 2 == 0:
5 numbers.remove(n)
6
7# CORRECT - create a new list
8numbers = [1, 2, 3, 4, 5]
9numbers = [n for n in numbers if n % 2 != 0]
10
11# ALSO CORRECT - iterate over a copy
12numbers = [1, 2, 3, 4, 5]
13# [:] creates a shallow copy
14for n in numbers[:]:
15 if n % 2 == 0:
16 numbers.remove(n)

The safest approach is to build a new list using a list comprehension, which is also more readable and often faster. If you must modify in place, iterate over a copy of the list using the slice notation [:] to create a shallow copy.

Fill in the Blank

> You have a list [1, 2, 3, 4, 5] and need to remove the even numbers. Pick the approach that filters safely without mutating the list during iteration.

numbers = [1, 2, 3, 4, 5]
result = 
print(result)

Shallow vs Deep Copy

When you assign a list to a new variable, both variables point to the same list in memory. Modifying one affects the other. This is called aliasing, and it surprises many developers. If you want a truly independent copy, you need to explicitly create one.
1# GOTCHA - assignment creates an alias, not a copy
2original = [1, 2, 3]
3alias = original
4alias.append(4)
5print("Original after alias modification:", original)
6
7# CORRECT - create an actual copy
8original = [1, 2, 3]
9copy = original[:]
10copy.append(4)
11print("Original after copy modification:", original)
12print("Copy:", copy)
>>>Output
Original after alias modification: [1, 2, 3, 4]
Original after copy modification: [1, 2, 3]
Copy: [1, 2, 3, 4]
TIP
For nested structures like lists of lists, you need a deep copy using copy.deepcopy() from the copy module. Shallow copies only copy the outer container, not the inner objects.
Understanding data structures is essential for organizing information effectively in your programs. Put these fundamentals to the test with hands-on challenges in the Python Builder.
PUTTING IT ALL TOGETHER

> You are a junior data engineer at Airbnb writing a pipeline that ingests a nightly batch of booking events, deduplicates listing IDs, maps each host to their revenue, and stores immutable rate-tier boundaries that must never change mid-run.

list holds the ordered sequence of incoming booking events because processing order matters and duplicate bookings for the same listing are valid.
dict maps each host ID to their accumulated revenue so per-host lookups remain O(1) regardless of how many hosts are in the batch.
set deduplicates listing IDs seen in the run so already-processed listings are skipped without an O(n) membership scan of the events list.
tuple stores the immutable rate-tier boundaries so no downstream transformation step can accidentally mutate the pricing thresholds mid-run.
KEY TAKEAWAYS
Lists are ordered, mutable, and allow duplicates - use for sequences
Tuples are ordered, immutable - use for fixed data and dict keys
Dicts map keys to values - use for lookups and structured data
Sets store unique elements - use for deduplication and fast membership
Use set instead of list for "in" checks on large collections
Use .get() for safe dictionary access when keys might be missing
Empty set is set(), not {} (which creates a dict)
Combine structures strategically: sets for deduplication, lists for order
Choose based on: order needed? duplicates allowed? key lookups? mutability?

The right container for your data

Category
Python
Difficulty
beginner
Duration
42 minutes
Challenges
0 hands-on challenges

Topics covered: Lists: Ordered Collections, Tuples: Immutable Sequences, Dicts: Key-Value Storage, Sets: Unique Collections, Choosing the Right Type

Lesson Sections

  1. Lists: Ordered Collections

    Lists are Python's workhorse data structure. They hold items in a specific order, allow duplicates, and can grow or shrink as needed. When you receive a batch of records from an API, process rows from a CSV file, or collect results from a database query, you typically work with lists. Lists are by far the most commonly used data structure in Python. What makes lists so versatile is their flexibility. They can hold any type of data - numbers, strings, other lists, dictionaries, or custom objects.

  2. Tuples: Immutable Sequences

    Tuples look similar to lists but have one critical difference: they cannot be changed after creation. Once you create a tuple, you cannot add, remove, or modify its elements. This immutability is not a limitation - it is a feature that makes your code safer and more predictable. Think about data that should never change: database connection parameters, geographic coordinates, RGB color values, or API response codes. If you accidentally modify such data, bugs can be extremely difficult to track d

  3. Dicts: Key-Value Storage

    Dictionaries are one of Python's most powerful and frequently used data structures. They store key-value pairs, allowing you to look up values by their keys instantly. Think of a dictionary like a real dictionary: you look up a word (key) to find its definition (value). The difference is that Python dictionaries can use almost any immutable type as a key, not just strings. In data engineering, dictionaries are absolutely everywhere. JSON responses from APIs are dictionaries. Configuration files

  4. Sets: Unique Collections

    Sets are unordered collections of unique elements. When you add a duplicate to a set, it simply ignores it - no error, no warning, just silent deduplication. This makes sets perfect for eliminating duplicates, tracking unique visitors, and performing mathematical set operations like unions and intersections. Unlike lists and tuples, sets do not maintain any particular order. The elements are stored based on their hash values, which optimizes for fast operations rather than sequence. This trade-o

  5. Choosing the Right Type

    Selecting the right data structure is one of the most important skills in programming. The choice affects code clarity, performance, and correctness. A well-chosen data structure makes code simpler and faster. A poorly chosen one leads to complex workarounds and performance problems. The good news is that choosing becomes intuitive with practice. After working with these four structures for a while, you will instinctively know which one fits each situation. Until then, use a systematic decision