Data Structures: Beginner

Friendster collapsed under its own success because it stored friend relationships in arrays and performed full scans to check connections, making the site grind to a halt as its user base grew. Redis chose hash maps as its core data structure, achieving constant-time key lookups that scale to millions of queries per second without a performance cliff. The data structure you choose in the first week of a project determines whether your application thrives or crashes under growth. This lesson teaches you how to match the right data structure to the right problem before it becomes a production crisis.

Lists: Ordered Collections

Daily Life

Interviews

Store and retrieve ordered data

Lists are Python's workhorse data structure. They hold items in a specific order, allow duplicates, and can grow or shrink as needed. When you receive a batch of records from an API, process rows from a CSV file, or collect results from a database query, you typically work with lists. Lists are by far the most commonly used data structure in Python.

What makes lists so versatile is their flexibility. They can hold any type of data - numbers, strings, other lists, dictionaries, or custom objects. You can mix types within the same list, though in practice keeping types consistent makes code easier to understand and maintain.

Create a list using square brackets []. Items are separated by commas. The order you specify is the order they are stored, and that order is preserved throughout the life of the list.

	# A list of transaction amounts
	transactions = [150.00, 89.99, 42.50, 200.00, 89.99]
	print("Transactions:", transactions)
	print("Count:", len(transactions))

	print("First transaction:", transactions[0])
	print("Last transaction:", transactions[-1])

>>>Output

Transactions: [150.0, 89.99, 42.5, 200.0, 89.99]

Count: 5

First transaction: 150.0

Last transaction: 89.99

Indexing is how you access individual elements. The first element is at index 0, the second at index 1, and so on. Python also supports negative indexing: -1 refers to the last element, -2 to the second-to-last, and so on. This is incredibly useful when you need to access elements from the end of a list without knowing its length.

Lists are mutable, meaning you can modify them after creation. You can append new items, insert at specific positions, remove items, and change existing values. This flexibility makes lists ideal for building up results incrementally, which is a common pattern in data processing.

	# Building a list of processed records
	processed_ids = []

	# Simulate processing incoming data
	for record_id in [101, 102, 103]:
	processed_ids.append(record_id)
	print(f"Processed record {record_id}")

	print("All processed:", processed_ids)

>>>Output

Processed record 101

Processed record 102

Processed record 103

All processed: [101, 102, 103]

The append method adds an item to the end of the list. This is one of the most common operations you will perform. Lists grow dynamically - you do not need to specify a size upfront, and Python handles the memory management for you. This makes lists perfect for situations where you do not know in advance how many items you will have.

	# Slicing: extract portions of a list
	metrics = [10, 20, 30, 40, 50, 60, 70]

	# Get first three elements
	print("First three:", metrics[0:3])

	# Get elements from index 2 to 5
	print("Middle:", metrics[2:5])

	# Get last three elements
	print("Last three:", metrics[-3:])

	# Skip every other element
	print("Every other:", metrics[::2])

>>>Output

First three: [10, 20, 30]

Middle: [30, 40, 50]

Last three: [50, 60, 70]

Every other: [10, 30, 50, 70]

Slicing is a powerful feature that lets you extract portions of a list using the syntax list[start:stop:step]. The start index is included, but the stop index is excluded. If you omit start, it defaults to the beginning. If you omit stop, it goes to the end. The optional step parameter lets you skip elements.

Fill in the Blank

> You have a list of five metrics [10, 20, 30, 40, 50] and need to extract the middle three values. Pick the slice that captures exactly those elements.

metrics = [10, 20, 30, 40, 50]
print(metrics)

Lists are the go-to choice whenever your data has a natural ordering. Here are the scenarios where lists outperform every other option.

Order matters

First in, first out processing where sequence is meaningful

Duplicates allowed

Track every occurrence, like repeated transactions or events

Dynamic size

Items will be added or removed as your program runs

Position-based access

Retrieve the 1st, 5th, or last item by numeric index instantly

Incremental building

Append results one at a time inside a processing loop

Essential List Methods

Lists come with a rich set of built-in methods for manipulation. Understanding these methods lets you write cleaner, more efficient code. Here are the most commonly used methods that you will encounter in almost every Python project.

	scores = [85, 92, 78, 95, 88]

	# Add elements
	scores.append(90)
	scores.insert(0, 100)
	print("After adding:", scores)

	# Remove elements
	scores.remove(78)
	last = scores.pop()
	print("After removing:", scores)
	print("Popped value:", last)

	# Find and count
	print("Index of 95:", scores.index(95))
	print("Count of 92:", scores.count(92))

>>>Output

After adding: [100, 85, 92, 78, 95, 88, 90]

After removing: [100, 85, 92, 95, 88]

Popped value: 90

Index of 95: 3

Count of 92: 1

Python Quiz

> Add an element to the end of a list, then remove and capture the last element. Pick the method that grows the list and the one that shrinks it while returning the removed value.

data = [10, 20, 30, 40]
data.___(50)
last = data.___()
print(last)
print(len(data))

append

insert

extend

pop

remove

List Performance Overview

Understanding performance helps you write efficient code. Lists excel at accessing items by index and appending to the end - both operations happen in O(1) constant time, meaning they are equally fast regardless of list size. However, searching for a specific value requires checking each item one by one, which becomes slow for large lists.

Inserting or removing from the beginning or middle of a list is slow because all subsequent elements must be shifted. If you frequently need to add or remove from both ends, consider using a deque from the collections module instead.

•Fast Operations

Access by index: list[0]
Append to end: list.append(x)
Pop from end: list.pop()
Get length: len(list)

•Slow Operations

Search: x in list
Insert at start: list.insert(0, x)
Remove by value: list.remove(x)
Insert in middle

TIP

If you frequently need to check whether an item exists in a collection, consider using a set instead of a list. Sets are optimized for membership testing and can check membership in constant time.

The implementation of Python lists explains why some operations are fast and others are not.

Tuples: Immutable Sequences

Daily Life

Interviews

Lock down data that must not change

Tuples look similar to lists but have one critical difference: they cannot be changed after creation. Once you create a tuple, you cannot add, remove, or modify its elements. This immutability is not a limitation - it is a feature that makes your code safer and more predictable.

Think about data that should never change: database connection parameters, geographic coordinates, RGB color values, or API response codes. If you accidentally modify such data, bugs can be extremely difficult to track down. Tuples prevent this entire category of bugs by making modification impossible.

Create a tuple using parentheses () or just commas. Access elements the same way as lists, using square bracket indexing. You can iterate over tuples, slice them, and use all the read-only operations that work on lists.

	# Database connection settings (should never change)
	db_config = ("prod-db.company.com", 5432, "analytics")
	print("Host:", db_config[0])
	print("Port:", db_config[1])
	print("Database:", db_config[2])

	# Geographic coordinates
	location = (37.7749, -122.4194)
	print(f"Latitude: {location[0]}, Longitude: {location[1]}")

>>>Output

Host: prod-db.company.com

Port: 5432

Database: analytics

Latitude: 37.7749, Longitude: -122.4194

Notice that we still use square brackets to access tuple elements, just like lists. The difference is only in what operations are allowed. Reading is fine; writing is forbidden. Attempting to modify a tuple raises a TypeError, and Python enforces this at runtime.

	coordinates = (10, 20)
	# TypeError: 'tuple' object does not support item assignment
	coordinates[0] = 15

	# If you need to "change" a tuple, create a new one
	new_coords = (15, coordinates[1])

If you find yourself needing to "modify" a tuple, the solution is to create a new tuple with the desired values. This pattern is common in functional programming and ensures that any code holding a reference to the original tuple is not affected by your changes.

Tuples Instead of Lists?

Immutability provides several important benefits that make tuples worth using even though lists are more flexible. Understanding these benefits helps you make informed decisions about which structure to use.

First, immutability prevents bugs. If data should not change, making it a tuple ensures it cannot change accidentally - not by your code, not by library code, not by anyone. This is a form of defensive programming that catches errors at runtime rather than letting them silently corrupt your data.

Second, tuples are hashable, which means they can be used as dictionary keys. Lists cannot be dictionary keys because they are mutable - if you could change a list after using it as a key, the dictionary would become corrupted. This makes tuples essential for certain data structures.

Third, tuples are slightly more memory-efficient and faster to create than lists. For small sequences that you create many times, this can add up. Python can also optimize certain operations on tuples because it knows they will not change.

	# Tuples as dictionary keys (lists cannot do this)
	# Map city coordinates to population
	city_populations = {
	(40.7128, -74.0060): 8_336_817,
	(34.0522, -118.2437): 3_979_576,
	(37.7749, -122.4194): 873_965,
	}

	sf_coords = (37.7749, -122.4194)
	print(f"SF population: {city_populations[sf_coords]:,}")

>>>Output

SF population: 873,965

•Use Lists When

Items will be added/removed
Order may change (sorting)
Building results incrementally

•Use Tuples When

Data should never change
Need dictionary keys
Returning multiple values

Tuple Unpacking

One of the most elegant features of tuples is unpacking: assigning tuple elements to multiple variables in a single statement. This makes code cleaner and more readable, especially when functions return multiple values. Unpacking is so useful that Python developers use it constantly.

The number of variables on the left must match the number of elements in the tuple. Python also supports extended unpacking with the star operator, letting you capture multiple elements into a list while unpacking the rest into individual variables.

	# Unpacking a tuple into separate variables
	user_record = ("alice_42", "alice@example.com", 28)
	username, email, age = user_record

	print(f"User: {username}")
	print(f"Email: {email}")
	print(f"Age: {age}")

	def get_min_max(numbers):
	return (min(numbers), max(numbers))

	data = [45, 23, 67, 12, 89, 34]
	minimum, maximum = get_min_max(data)
	print(f"Range: {minimum} to {maximum}")

>>>Output

User: alice_42

Email: alice@example.com

Age: 28

Range: 12 to 89

Tuple unpacking is especially common when iterating over dictionary items or when working with functions that return multiple values. The enumerate function, for example, returns tuples of (index, value) that you typically unpack in a for loop.

	# Unpacking in loops - very common pattern
	scores = [85, 92, 78, 95]
	for index, score in enumerate(scores):
	print(f"Position {index}: {score}")

	# Extended unpacking with *
	first, *middle, last = [1, 2, 3, 4, 5]
	print(f"First: {first}, Middle: {middle}, Last: {last}")

>>>Output

Position 0: 85

Position 1: 92

Position 2: 78

Position 3: 95

First: 1, Middle: [2, 3, 4], Last: 5

Debug Challenge

> This code tries to change the first element of a tuple, but tuples are immutable and do not support item assignment.

TypeError: 'tuple' object does not support item assignment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99



point = (3, 7)
point[0] = 5
print(point)
point = (3, 7)
point[0] = 5
print(point)

Immutability also makes tuples safe to share across threads without locking. Because no code can modify a tuple after creation, multiple threads can read the same tuple simultaneously without risk of data corruption.

When designing functions that return multiple values, tuples are the idiomatic choice. Functions like min(), max(), divmod(), and many standard library functions return tuples that you unpack at the call site.

Dicts: Key-Value Storage

Daily Life

Interviews

Look up any value by key instantly

Dictionaries are one of Python's most powerful and frequently used data structures. They store key-value pairs, allowing you to look up values by their keys instantly. Think of a dictionary like a real dictionary: you look up a word (key) to find its definition (value). The difference is that Python dictionaries can use almost any immutable type as a key, not just strings.

In data engineering, dictionaries are absolutely everywhere. JSON responses from APIs are dictionaries. Configuration files parse into dictionaries. Database rows are often represented as dictionaries. Caches use dictionaries. Environment variables are accessed through dictionaries. Mastering dictionaries is essential for any Python developer.

Create a dictionary using curly braces {} with key-value pairs separated by colons. Keys must be immutable (strings, numbers, or tuples), while values can be anything - including other dictionaries, lists, or custom objects.

	# User profile from an API response
	user = {
	"user_id": "u_12345",
	"name": "Sarah Chen",
	"email": "sarah@example.com",
	"is_premium": True,
	"login_count": 47
	}

	print("Name:", user["name"])
	print("Premium status:", user["is_premium"])
	print("Total keys:", len(user))

>>>Output

Name: Sarah Chen

Premium status: True

Total keys: 5

Notice how the dictionary uses descriptive string keys like "user_id" and "name" instead of numeric indices. This makes your code self-documenting - you can tell exactly what each value represents just by looking at its key. Compare user["email"] to user[2] - the dictionary version is much clearer.

The magic of dictionaries is constant-time lookup. Whether your dictionary has 10 items or 10 million, finding a value by its key takes the same amount of time. This is fundamentally different from lists, where searching requires checking each item one by one. This performance characteristic makes dictionaries ideal for building lookup tables and caches.

	# Building a lookup table for fast access
	# Map product IDs to prices
	price_lookup = {
	"SKU001": 29.99,
	"SKU002": 49.99,
	"SKU003": 19.99,
	"SKU004": 99.99,
	}

	# Instant lookup - no searching required
	product_id = "SKU003"
	if product_id in price_lookup:
	print(f"Price for {product_id}: {price_lookup[product_id]}")

>>>Output

Price for SKU003: 19.99

This lookup table pattern is extremely common. Instead of searching through a list of products to find the price for SKU003, we go directly to it using the key. In a production system processing millions of lookups, this difference between constant time and linear search time is the difference between a responsive application and a slow one.

Iterating Over Dictionaries

You often need to loop through dictionary contents. Python provides several ways to iterate: over just keys, just values, or both key-value pairs together. The items() method returns both, which is usually what you want.

	metrics = {"cpu": 45.2, "memory": 72.8, "disk": 58.1}

	# Iterate over keys (default behavior)
	print("Metrics tracked:")
	for metric_name in metrics:
	print(f" - {metric_name}")

	# Iterate over both keys and values
	print("\nCurrent values:")
	for name, value in metrics.items():
	print(f" {name}: {value}%")

>>>Output

Metrics tracked:

  - cpu

  - memory

  - disk

Current values:

  cpu: 45.2%

  memory: 72.8%

  disk: 58.1%

Modifying Dictionaries

Dictionaries are mutable. You can add new key-value pairs, update existing values, and remove entries. Adding a new key or updating an existing one uses the same syntax: assignment with square brackets. If the key exists, the value is updated; if not, a new entry is created.

	metrics = {"requests": 1000, "errors": 5}
	print("Initial:", metrics)

	# Add new key
	metrics["latency_ms"] = 45.2
	print("After adding:", metrics)

	# Update existing key
	metrics["requests"] = 1050
	print("After update:", metrics)

	# Remove a key
	del metrics["errors"]
	print("After delete:", metrics)

>>>Output

Initial: {'requests': 1000, 'errors': 5}

After adding: {'requests': 1000, 'errors': 5, 'latency_ms': 45.2}

After update: {'requests': 1050, 'errors': 5, 'latency_ms': 45.2}

After delete: {'requests': 1050, 'latency_ms': 45.2}

Safe Key Access

Accessing a key that does not exist raises a KeyError. To handle missing keys gracefully, use the get() method, which returns None (or a default value you specify) instead of raising an error.

	config = {"host": "localhost", "port": 8080}

	# Risky: raises KeyError if key missing
	# timeout = config["timeout"] # KeyError!

	# Safe: returns None if key missing
	timeout = config.get("timeout")
	print("Timeout:", timeout)

	# Safe with default value
	timeout = config.get("timeout", 30)
	print("Timeout with default:", timeout)

>>>Output

Timeout: None

Timeout with default: 30

TIP

Always use .get() when a key might not exist. It prevents crashes and makes your code more robust against unexpected data.

Nested Dictionaries

Dictionary values can be any type, including other dictionaries. This creates nested structures, which are extremely common when working with JSON data from APIs. Accessing nested values requires chaining bracket notation or using multiple get() calls for safety.

	# Nested dictionary representing API response
	user_data = {
	"profile": {
	"name": "Alice Chen",
	"settings": {"theme": "dark", "notifications": True}
	},
	"stats": {"posts": 42, "followers": 1250}
	}

	# Access nested values
	print("Name:", user_data["profile"]["name"])
	print("Theme:", user_data["profile"]["settings"]["theme"])

	# Safe nested access
	language = user_data.get("profile", {}).get("settings", {}).get("language", "en")
	print("Language:", language)

>>>Output

Name: Alice Chen

Theme: dark

Language: en

The chained get() pattern is verbose but safe. Each get() returns an empty dictionary if the key is missing, allowing the chain to continue without raising an error. The final get() returns the default value if the entire path does not exist.

Fill in the Blank

> You have a user dictionary with a nested "profile" containing "name". Pick the access method and fallback that safely navigates the nested structure.

user = {"profile": {"name": "Alice"}}
result = user("profile", )("name", "unknown")
print(result)

Dictionaries are the right tool whenever you need to associate keys with values. These are the most common patterns.

LOOKUPSTRUCTCOUNTCACHEJSON

LOOKUP

ID-based access

Find users by ID instantly

STRUCT

Named fields

Organized data with keys

COUNT

Tally items

Count word occurrences

CACHE

Store results

Reuse computed answers

JSON

Parse configs

Read structured API data

Dictionaries shine when your data needs meaningful labels. They make code self-documenting: user["email"] communicates intent far more clearly than user[1], especially when reviewing code written months ago.

Sets: Unique Collections

Daily Life

Interviews

Eliminate duplicates and test membership

Sets are unordered collections of unique elements. When you add a duplicate to a set, it simply ignores it - no error, no warning, just silent deduplication. This makes sets perfect for eliminating duplicates, tracking unique visitors, and performing mathematical set operations like unions and intersections.

Unlike lists and tuples, sets do not maintain any particular order. The elements are stored based on their hash values, which optimizes for fast operations rather than sequence. This trade-off is worth it because sets excel at two critical operations: adding elements and checking if an element exists.

Create a set using curly braces {} with elements (not key-value pairs) or the set() function. Important gotcha: {} alone creates an empty dictionary, not an empty set. Use set() for an empty set.

	# Track unique visitors
	visitor_ids = {"user_1", "user_2", "user_3", "user_1", "user_2"}
	print("Unique visitors:", visitor_ids)
	print("Count:", len(visitor_ids))

	# Convert list with duplicates to unique set
	page_views = ["home", "products", "home", "checkout", "products", "home"]
	unique_pages = set(page_views)
	print("Unique pages visited:", unique_pages)

>>>Output

Unique visitors: {'user_3', 'user_1', 'user_2'}

Count: 3

Unique pages visited: {'checkout', 'products', 'home'}

Notice that the order of elements in a set is not guaranteed. Sets prioritize fast operations over maintaining insertion order. When we created the set with duplicate user IDs, the duplicates were automatically removed, leaving only three unique values. This automatic deduplication is incredibly useful when processing data.

If you need both uniqueness and order, you can use a dictionary with None values (since Python 3.7+, dictionaries maintain insertion order), or use the dict.fromkeys() pattern which preserves first-occurrence order.

Fast Membership Testing

The killer feature of sets is O(1) membership testing. Checking if an element exists in a set takes the same time whether the set has 100 elements or 100 million. This makes sets absolutely essential for any operation involving "is this item in my collection?" - a question that comes up constantly in data processing.

Consider a scenario where you need to filter a million records, keeping only those that match a list of 10,000 valid IDs. With a list, each of the million records requires checking up to 10,000 IDs - that is potentially 10 billion comparisons. With a set, each record requires exactly one lookup, giving you one million operations total. The difference can be hours versus seconds.

	# Blacklist of blocked IP addresses
	blocked_ips = {"192.168.1.100", "10.0.0.50", "172.16.0.1"}

	# Check incoming requests
	incoming_ip = "192.168.1.100"

	if incoming_ip in blocked_ips:
	print(f"BLOCKED: {incoming_ip}")
	else:
	print(f"ALLOWED: {incoming_ip}")

	# Compare with list performance
	# if incoming_ip in blocked_list:

>>>Output

BLOCKED: 192.168.1.100

•Set Membership

O(1) constant time
Same speed for any size
Uses hash table internally

•List Membership

O(n) linear time
Slower as list grows
Checks each item sequentially

Set Operations

Sets support mathematical operations that are incredibly useful for data analysis. These operations answer common questions: Which items are in either collection? Which are in both? Which are in one but not the other? Python makes these operations fast and intuitive.

Union combines two sets, giving you all unique elements from both. Intersection finds elements that appear in both sets. Difference finds elements that are in the first set but not the second. These operations run in O(n) time, making them efficient for large datasets.

	# Users who viewed each product category
	electronics_viewers = {"alice", "bob", "charlie", "diana"}
	clothing_viewers = {"bob", "diana", "eve", "frank"}

	# Union: users who viewed either category
	all_viewers = electronics_viewers \| clothing_viewers
	print("All viewers:", all_viewers)

	# Intersection: users who viewed both categories
	both_categories = electronics_viewers & clothing_viewers
	print("Viewed both:", both_categories)

	# Difference: only electronics, not clothing
	electronics_only = electronics_viewers - clothing_viewers
	print("Electronics only:", electronics_only)

>>>Output

All viewers: {'alice', 'frank', 'charlie', 'eve', 'diana', 'bob'}

Viewed both: {'diana', 'bob'}

Electronics only: {'alice', 'charlie'}

These operations are perfect for analyzing user behavior, comparing data sets, or finding overlaps between categories. In data engineering, you might use intersection to find customers who appear in both yesterday's and today's data, or difference to find new customers who did not exist yesterday.

Python Quiz

> Find how many viewers visited both product categories and how many visited at least one. Pick the set operator for each question.

a = {"alice", "bob", "charlie"}
b = {"bob", "charlie", "diana"}
overlap = a ___ b
combined = a ___ b
print(len(overlap))
print(len(combined))

Set operations are often the most readable way to express data comparisons. Instead of nested loops that check membership, a single operator like & or - communicates the intent immediately and executes much faster.

The difference operator (-) is not symmetric. A - B and B - A give different results. Always think about which set should be the "reference" set and which one you are subtracting from it.

TIP

When creating an empty set, always use set() not {}. Python interprets {} as an empty dictionary, which will cause an AttributeError when you try to call set methods like .add().

Debug Challenge

> This code uses {} to create an empty set, but Python interprets {} as an empty dictionary. Calling .add() on a dict raises an AttributeError.

AttributeError: 'dict' object has no attribute 'add'

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99



unique_ids = {}
unique_ids.add("user_1")
print(type(unique_ids))
unique_ids = {}
unique_ids.add("user_1")
print(type(unique_ids))

Set Operation Symbols

| Union - elements in either set
& Intersection - elements in both sets
- Difference - elements in first but not second
^ Symmetric difference - elements in either but not both

Sets are the right choice whenever your data has no meaningful order and you care about uniqueness. Converting a list to a set is a single operation that deduplicates every element instantly.

frozensets are the immutable equivalent of sets. They support all the same operations as sets but cannot be modified after creation, making them hashable and usable as dictionary keys.

Choosing the Right Type

Daily Life

Interviews

Match any problem to its best structure

Selecting the right data structure is one of the most important skills in programming. The choice affects code clarity, performance, and correctness. A well-chosen data structure makes code simpler and faster. A poorly chosen one leads to complex workarounds and performance problems.

The good news is that choosing becomes intuitive with practice. After working with these four structures for a while, you will instinctively know which one fits each situation. Until then, use a systematic decision framework to guide your choices.

The Decision Framework

Start by asking these questions about your data: Does order matter? Are duplicates allowed? Do you need to look up items by a key? Should the data be modifiable after creation? Your answers point directly to the right structure.

Think about what operations you will perform most frequently. If you primarily access by index, use a list. If you primarily access by key, use a dictionary. If you primarily check membership, use a set. If you need the data to be immutable, use a tuple.

list

Ordered and changeable - use when sequence matters and items may be added or removed

tuple

Ordered and frozen - use when data must never change after creation

dict

Key-value pairs - use when you need to look up values by a unique identifier

set

Unique elements only - use when duplicates are unwanted and you need fast membership checks

Real-World Scenarios

Let us apply this framework to common data engineering scenarios. In each case, the data requirements point clearly to one structure. With practice, you will recognize these patterns instantly and choose the right structure without conscious thought.

	# Scenario 1: Processing a queue of tasks
	# Need: ordered, will remove items as processed
	task_queue = ["validate", "transform", "load"]

	# Scenario 2: Database connection parameters
	# Need: unchangeable configuration
	db_config = ("prod-server", 5432, "main_db")

	# Scenario 3: User session data
	# Need: lookup by session ID
	sessions = {
	"sess_abc": {"user": "alice", "expires": 3600},
	"sess_xyz": {"user": "bob", "expires": 1800}
	}

	# Scenario 4: Track processed record IDs
	processed_ids = {1001, 1002, 1003}

	print("Task queue:", task_queue)
	print("DB config:", db_config)
	print("Sessions:", len(sessions), "active")
	print("Processed:", len(processed_ids), "records")

>>>Output

Task queue: ['validate', 'transform', 'load']

DB config: ('prod-server', 5432, 'main_db')

Sessions: 2 active

Processed: 3 records

Data Pipeline Patterns

In data engineering, you often combine multiple structures. Here is a pattern you will see frequently: using a set for deduplication while building a list of results.

	# Processing events while tracking unique users
	events = [
	{"user_id": "u1", "action": "click"},
	{"user_id": "u2", "action": "view"},
	{"user_id": "u1", "action": "purchase"},
	{"user_id": "u3", "action": "click"},
	{"user_id": "u2", "action": "click"},
	]

	# Fast duplicate checking
	seen_users = set()
	# Preserve order of first occurrence
	unique_events = []

	for event in events:
	user_id = event["user_id"]
	if user_id not in seen_users:
	seen_users.add(user_id)
	unique_events.append(event)

	print(f"Total events: {len(events)}")
	print(f"Unique users: {len(seen_users)}")
	print(f"First events: {unique_events}")

>>>Output

Total events: 5

Unique users: 3

First events: [{'user_id': 'u1', 'action': 'click'}, {'user_id': 'u2', 'action': 'view'}, {'user_id': 'u3', 'action': 'click'}]

This set-plus-list pattern appears everywhere in data processing code. You will use it to deduplicate user events, filter records to unique entries, or ensure that each item is processed exactly once. The key insight is that sets are perfect for tracking what you have seen, while lists are perfect for storing the results in order.

TIP

This pattern combines the fast membership testing of sets with the ordered collection of lists. It is O(n) overall instead of O(n²) that you would get from checking uniqueness with only a list.

Common Mistakes

Even experienced developers make these mistakes. Learning to recognize and avoid them will save you hours of debugging. Some cause immediate errors, while others silently produce incorrect results or slow performance.

✓Do

Use set() for membership tests
Use .get() for safe dict access
Build new lists with comprehensions
Use tuples for fixed data

✗Don't

Search large lists with "in"
Access dict keys with [] blindly
Modify a list while looping it
Use {} for an empty set

Lists for Membership Tests

One of the most common performance mistakes is using a list when you need frequent membership tests. For small lists with a few dozen items, this is fine - the performance difference is negligible. But for thousands or millions of items, the difference becomes dramatic and can make your code unusably slow.

The symptom is code that works correctly but takes far longer than it should. If you find yourself waiting minutes for something that should take seconds, check whether you are doing membership tests against a list. Converting to a set is often all it takes to fix the problem.

•Wrong

blocked = ["ip1", "ip2", ...]
if ip in blocked: # Slow!
Checks every item

•Correct

blocked = {"ip1", "ip2", ...}
if ip in blocked: # Fast!
Instant hash lookup

Empty Dict vs Empty Set

A common gotcha: {} creates an empty dictionary, not an empty set. To create an empty set, use set().

	# WRONG - this creates a dict, not a set!
	wrong = {}
	print("Type of {}:", type(wrong))

	# CORRECT - use set() for empty set
	correct = set()
	print("Type of set():", type(correct))

	# Non-empty sets use curly braces fine
	also_correct = {1, 2, 3}
	print("Type of {1,2,3}:", type(also_correct))

>>>Output

Type of {}: <class 'dict'>

Type of set(): <class 'set'>

Type of {1,2,3}: <class 'set'>

Modifying Tuples

Forgetting that tuples are immutable leads to TypeError. If you need to "change" a tuple, you must create a new one with the desired values.

	# WRONG - tuples cannot be modified
	coords = (10, 20)
	# TypeError!
	coords[0] = 15

	# CORRECT - create a new tuple
	coords = (15, coords[1])

KeyError on Missing Keys

Accessing a dictionary key that does not exist crashes your program with a KeyError. This is one of the most common runtime errors in Python. When processing external data like JSON from APIs or user input, you can never be certain all expected keys are present.

The solution is defensive coding. Always use the .get() method when a key might be missing, or check for key existence with the in operator before accessing. The .get() method is usually cleaner because it returns a default value in a single expression.

	data = {"name": "Alice", "age": 30}

	# WRONG - crashes if key missing
	# data["city"] # KeyError

	city = data.get("city", "Unknown")
	print("City:", city)

	# ALSO CORRECT - check first
	if "city" in data:
	print(data["city"])
	else:
	print("City not provided")

>>>Output

City: Unknown

City not provided

Mutating During Iteration

Modifying a list while looping over it causes unpredictable behavior. Elements get skipped or processed multiple times because the loop indices become misaligned with the changing list. This is a notorious source of subtle bugs.

	# WRONG - modifying while iterating
	numbers = [1, 2, 3, 4, 5]
	for n in numbers:
	if n % 2 == 0:
	numbers.remove(n)

	# CORRECT - create a new list
	numbers = [1, 2, 3, 4, 5]
	numbers = [n for n in numbers if n % 2 != 0]

	# ALSO CORRECT - iterate over a copy
	numbers = [1, 2, 3, 4, 5]
	# [:] creates a shallow copy
	for n in numbers[:]:
	if n % 2 == 0:
	numbers.remove(n)

The safest approach is to build a new list using a list comprehension, which is also more readable and often faster. If you must modify in place, iterate over a copy of the list using the slice notation [:] to create a shallow copy.

Fill in the Blank

> You have a list [1, 2, 3, 4, 5] and need to remove the even numbers. Pick the approach that filters safely without mutating the list during iteration.

numbers = [1, 2, 3, 4, 5]
result = 
print(result)

Shallow vs Deep Copy

When you assign a list to a new variable, both variables point to the same list in memory. Modifying one affects the other. This is called aliasing, and it surprises many developers. If you want a truly independent copy, you need to explicitly create one.

	# GOTCHA - assignment creates an alias, not a copy
	original = [1, 2, 3]
	alias = original
	alias.append(4)
	print("Original after alias modification:", original)

	# CORRECT - create an actual copy
	original = [1, 2, 3]
	copy = original[:]
	copy.append(4)
	print("Original after copy modification:", original)
	print("Copy:", copy)

>>>Output

Original after alias modification: [1, 2, 3, 4]

Original after copy modification: [1, 2, 3]

Copy: [1, 2, 3, 4]

TIP

For nested structures like lists of lists, you need a deep copy using copy.deepcopy() from the copy module. Shallow copies only copy the outer container, not the inner objects.

Understanding data structures is essential for organizing information effectively in your programs. Put these fundamentals to the test with hands-on challenges in the Python Builder.

❯❯❯PUTTING IT ALL TOGETHER

> You are a junior data engineer at Airbnb writing a pipeline that ingests a nightly batch of booking events, deduplicates listing IDs, maps each host to their revenue, and stores immutable rate-tier boundaries that must never change mid-run.

list holds the ordered sequence of incoming booking events because processing order matters and duplicate bookings for the same listing are valid.

dict maps each host ID to their accumulated revenue so per-host lookups remain O(1) regardless of how many hosts are in the batch.

set deduplicates listing IDs seen in the run so already-processed listings are skipped without an O(n) membership scan of the events list.

tuple stores the immutable rate-tier boundaries so no downstream transformation step can accidentally mutate the pricing thresholds mid-run.

KEY TAKEAWAYS

Lists are ordered, mutable, and allow duplicates - use for sequences

Tuples are ordered, immutable - use for fixed data and dict keys

Dicts map keys to values - use for lookups and structured data

Sets store unique elements - use for deduplication and fast membership

Use set instead of list for "in" checks on large collections

Use .get() for safe dictionary access when keys might be missing

Empty set is set(), not {} (which creates a dict)

Combine structures strategically: sets for deduplication, lists for order

Choose based on: order needed? duplicates allowed? key lookups? mutability?

The right container for your data

Category: Python
Difficulty: beginner
Duration: 42 minutes
Challenges: 0 hands-on challenges

Topics covered: Lists: Ordered Collections, Tuples: Immutable Sequences, Dicts: Key-Value Storage, Sets: Unique Collections, Choosing the Right Type

Lesson Sections

Lists: Ordered Collections (concepts: pyListCreate)
Lists are Python's workhorse data structure. They hold items in a specific order, allow duplicates, and can grow or shrink as needed. When you receive a batch of records from an API, process rows from a CSV file, or collect results from a database query, you typically work with lists. Lists are by far the most commonly used data structure in Python. What makes lists so versatile is their flexibility. They can hold any type of data - numbers, strings, other lists, dictionaries, or custom objects.
Tuples: Immutable Sequences (concepts: pyTuples)
Tuples look similar to lists but have one critical difference: they cannot be changed after creation. Once you create a tuple, you cannot add, remove, or modify its elements. This immutability is not a limitation - it is a feature that makes your code safer and more predictable. Think about data that should never change: database connection parameters, geographic coordinates, RGB color values, or API response codes. If you accidentally modify such data, bugs can be extremely difficult to track d
Dicts: Key-Value Storage (concepts: pyDictCreate)
Dictionaries are one of Python's most powerful and frequently used data structures. They store key-value pairs, allowing you to look up values by their keys instantly. Think of a dictionary like a real dictionary: you look up a word (key) to find its definition (value). The difference is that Python dictionaries can use almost any immutable type as a key, not just strings. In data engineering, dictionaries are absolutely everywhere. JSON responses from APIs are dictionaries. Configuration files
Sets: Unique Collections (concepts: pySets)
Sets are unordered collections of unique elements. When you add a duplicate to a set, it simply ignores it - no error, no warning, just silent deduplication. This makes sets perfect for eliminating duplicates, tracking unique visitors, and performing mathematical set operations like unions and intersections. Unlike lists and tuples, sets do not maintain any particular order. The elements are stored based on their hash values, which optimizes for fast operations rather than sequence. This trade-o
Choosing the Right Type (concepts: pyDataTypes)
Selecting the right data structure is one of the most important skills in programming. The choice affects code clarity, performance, and correctness. A well-chosen data structure makes code simpler and faster. A poorly chosen one leads to complex workarounds and performance problems. The good news is that choosing becomes intuitive with practice. After working with these four structures for a while, you will instinctively know which one fits each situation. Until then, use a systematic decision