Functions: Advanced

Celery, the distributed task queue used by Shopify and GitHub, uses closures and higher-order functions to let developers define retry logic, rate limiting, and error handling as function wrappers that apply transparently to any task. Shopify's backend processes over 70 million orders per day by passing tasks through chains of higher-order functions without any of the individual business logic functions knowing about retry or rate limiting at all. The closures and function composition patterns in this lesson are what make that architecture possible.

Lambda Functions

Daily Life

Interviews

Write inline functions for transforms

Sometimes you need a simple function for a single purpose, and defining it with def feels like overkill. Python provides lambda functions for exactly this situation. A lambda is an anonymous function defined in a single expression - no name, no multi-line body, just input and output.

The term "lambda" comes from lambda calculus, a mathematical system for expressing computation developed in the 1930s. In practice, lambdas are simply a concise way to write small functions inline, especially useful when passing functions to other functions. You will see lambdas everywhere in professional Python codebases.

Lambda functions are particularly common in data engineering workflows. When you use pandas to transform columns, filter rows, or apply custom logic, you often need a small function for just that one operation. Writing a full function definition with def, a name, and a return statement feels excessive for something as simple as "multiply by two" or "extract the first character." Lambdas solve this problem elegantly.

Lambda Syntax

A lambda has the form lambda arguments: expression. Unlike regular functions, lambdas automatically return the result of their single expression - no return keyword needed:

	def square(x):
	return x * x

	# Equivalent lambda expression
	square_lambda = lambda x: x * x

	# Both work exactly the same
	print("Regular:", square(5))
	print("Lambda:", square_lambda(5))

	# Lambda with multiple arguments
	add = lambda a, b: a + b
	print("Add 3 + 7:", add(3, 7))

	# Lambda with no arguments
	get_pi = lambda: 3.14159
	print("Pi:", get_pi())

>>>Output

Regular: 25

Lambda: 25

Add 3 + 7: 10

Pi: 3.14159

Notice how the lambda version is more compact - three lines become one. But this compactness comes with a limitation: lambdas can only contain a single expression. They cannot have multiple statements, loops, or complex control flow. If you need any of those, you must use a regular function definition.

The key distinction is between expressions and statements. An expression evaluates to a value: 2 + 2, x * y, name.upper(). A statement performs an action: if/else blocks, for loops, variable assignments with =. Lambdas can only contain expressions. This constraint keeps them simple and predictable - you always know a lambda will return a single computed value.

Lambdas with Built-ins

Lambdas shine brightest when passed to built-in functions like sorted(), filter(), and map(). The key parameter accepts a function that extracts a comparison value:

	# Sort users by age (second element)
	users = [("Alice", 30), ("Bob", 25), ("Charlie", 35)]
	by_age = sorted(users, key=lambda user: user[1])
	print("By age:", by_age)

	# Sort strings by length
	words = ["python", "is", "awesome"]
	by_length = sorted(words, key=lambda w: len(w))
	print("By length:", by_length)

	# Sort by absolute value
	numbers = [-5, 2, -1, 4, -3]
	by_abs = sorted(numbers, key=lambda x: abs(x))
	print("By absolute:", by_abs)

	# Reverse sort by second element
	data = [("A", 100), ("B", 50), ("C", 75)]
	descending = sorted(data, key=lambda x: x[1], reverse=True)
	print("Descending:", descending)

>>>Output

By age: [('Bob', 25), ('Alice', 30), ('Charlie', 35)]

By length: ['is', 'python', 'awesome']

By absolute: [-1, 2, -3, 4, -5]

Descending: [('A', 100), ('C', 75), ('B', 50)]

Without the lambda, Python would compare tuples element by element. The lambda lets you control exactly what gets compared - age, length, absolute value, or any computed property. This pattern is so common that you will use it almost every time you sort anything more complex than a simple list of numbers or strings.

The key insight is that the key parameter expects a function that takes one element and returns a comparable value. Python calls this function once for each element, then sorts based on the returned values. The lambda is the perfect tool for defining this extraction logic inline.

Lambdas for Transformation

Data engineers frequently use lambdas with map() and filter() to transform collections. These patterns translate directly to pandas and Spark operations:

	# Transform: extract values from records
	records = [{"id": 1, "value": 100}, {"id": 2, "value": 200}]
	values = list(map(lambda r: r["value"], records))
	print("Values:", values)

	# Filter: keep only positive numbers
	numbers = [10, -5, 20, -3, 15, -8]
	positive = list(filter(lambda x: x > 0, numbers))
	print("Positive:", positive)

	# Combine: filter then transform
	prices = [10.50, 25.00, 5.99, 50.00, 15.00]
	# Get prices over 15, then apply 10% discount
	discounted = list(map(
	lambda p: round(p * 0.9, 2),
	filter(lambda p: p > 15, prices)
	))
	print("Discounted (>15):", discounted)

>>>Output

Values: [100, 200]

Positive: [10, 20, 15]

Discounted (>15): [22.5, 45.0]

TIP

List comprehensions often replace map() and filter() in Python. But lambdas are still essential when working with pandas .apply() or Spark .map() operations.

Understanding map() and filter() with lambdas prepares you for data frameworks that use the same concepts at scale. In Apache Spark, you write transformations like rdd.map(lambda x: x * 2) that get distributed across a cluster. In pandas, you write df["column"].apply(lambda x: x.upper()). The syntax is nearly identical - master it once, use it everywhere.

When to Use Lambdas

Lambdas are powerful but can hurt readability if overused. Here's how to decide:

•Use Lambdas

Simple one-line operations
Immediate, one-time use
Sorting keys and callbacks
Quick transformations in .apply()
Obvious logic that needs no name

•Use Named Functions

Complex logic or multiple steps
Reused in multiple places
Needs documentation or testing
Debugging is important
Others need to understand it

Try using different lambda expressions as the sorting key. See how each one changes the sort order of the data.

Fill in the Blank

> You have a list of (name, age) tuples and want to sort them. Pick a lambda key to control whether they are ordered by name, age, or name length.

data = [("Bob", 30), ("Alice", 25), ("Eve", 35)]
result = sorted(data, key=lambda x: )
print(result)

Common Lambda Pitfall

Creating lambdas in a loop is a classic Python gotcha. The lambda captures the variable reference, not its current value:

	# BROKEN: All lambdas use the final value of i
	funcs = []
	for i in range(3):
	# Captures reference to i, not value
	funcs.append(lambda: i)

	# All return 2 (the final value of i)
	print("Broken:", [f() for f in funcs])

	# FIXED: Use default argument to capture current value
	funcs_fixed = []
	for i in range(3):
	# Captures current value of i
	funcs_fixed.append(lambda i=i: i)

	print("Fixed:", [f() for f in funcs_fixed])

>>>Output

Broken: [2, 2, 2]

Fixed: [0, 1, 2]

Using i=i as a default argument forces Python to copy the current value of i at the moment the lambda is created. This is essential knowledge for interview questions.

Functions as Objects

Daily Life

Interviews

Pass and store functions as values

In Python, functions are "first-class citizens" - they are objects like integers, strings, or lists. You can store them in variables, pass them to other functions, return them from functions, and store them in data structures. This concept might seem abstract at first, but it unlocks powerful patterns that are fundamental to Python programming.

Understanding first-class functions is crucial for callbacks in async code, strategy patterns in pipeline design, and decorator patterns used throughout Python frameworks. When you use Flask to define a route with @app.route, when you register event handlers in a GUI, when you configure pandas aggregations - all of these rely on treating functions as values.

Most importantly for data engineers, this concept is the foundation of functional programming patterns. Writing code that transforms data by composing functions - rather than mutating state - leads to pipelines that are easier to test, debug, and parallelize. Understanding functions as objects is the first step toward this style.

Functions Are Values

A function name without parentheses refers to the function object itself. greet is the function, while greet() calls it:

	def greet(name):
	return "Hello, " + name

	# Assign function to variable
	say_hello = greet

	# Both names now refer to the same function
	print(greet("Alice"))
	print(say_hello("Bob"))

	# Prove they're the same object
	print("Same function?", greet is say_hello)

	# Functions have attributes
	print("Name:", greet.__name__)
	print("Type:", type(greet))

>>>Output

Hello, Alice

Hello, Bob

Same function? True

Name: greet

Type: <class 'function'>

Functions in Collections

Since functions are objects, you can store them in lists, dictionaries, or any data structure. This enables powerful dispatch patterns:

	def add(a, b):
	return a + b

	def subtract(a, b):
	return a - b

	def multiply(a, b):
	return a * b

	# Dictionary of operations
	operations = {
	"+": add,
	"-": subtract,
	"*": multiply,
	}

	# Dispatch: call the right function
	def calculate(a, op, b):
	if op in operations:
	return operations[op](a, b)
	return "Unknown operation"

	print("10 + 5 =", calculate(10, "+", 5))
	print("10 - 5 =", calculate(10, "-", 5))
	print("10 * 5 =", calculate(10, "*", 5))

>>>Output

10 + 5 = 15

10 - 5 = 5

10 * 5 = 50

This dispatch pattern replaces long if/elif chains. Adding a new operation means adding one dictionary entry - no changes to calculate(). This is the "open-closed principle" in action: open for extension, closed for modification.

Data engineers use this pattern constantly. Imagine processing different file formats: instead of a giant if/elif checking for CSV, JSON, Parquet, etc., you maintain a dictionary mapping format names to loader functions. Adding support for a new format means adding one entry to the dictionary. The main code never changes.

Functions as Arguments

Functions that accept other functions as parameters are called "higher-order functions." They let you customize behavior without changing code:

	def apply_to_all(items, transform):
	"""Apply transform function to each item."""
	return [transform(item) for item in items]

	def double(x):
	return x * 2

	def square(x):
	return x * x

	def make_negative(x):
	return -abs(x)

	numbers = [1, 2, 3, 4, 5]

	print("Original:", numbers)
	print("Doubled:", apply_to_all(numbers, double))
	print("Squared:", apply_to_all(numbers, square))
	print("Negative:", apply_to_all(numbers, make_negative))

	# Also works with lambdas
	print("Plus 10:", apply_to_all(numbers, lambda x: x + 10))

>>>Output

Original: [1, 2, 3, 4, 5]

Doubled: [2, 4, 6, 8, 10]

Squared: [1, 4, 9, 16, 25]

Negative: [-1, -2, -3, -4, -5]

Plus 10: [11, 12, 13, 14, 15]

Function Factories

Functions can create and return new functions. This is called a "function factory" and is the foundation of decorators and closures:

	def make_multiplier(factor):
	"""Multiply by factor."""
	def multiplier(x):
	return x * factor
	return multiplier

	# Create specialized functions
	double = make_multiplier(2)
	triple = make_multiplier(3)
	by_ten = make_multiplier(10)

	# Each remembers its factor
	print("double(5):", double(5))
	print("triple(5):", triple(5))
	print("by_ten(5):", by_ten(5))

	# Create a validator factory
	def make_range_checker(min_val, max_val):
	def check(value):
	return min_val <= value <= max_val
	return check

	valid_percentage = make_range_checker(0, 100)
	print("50 valid %?", valid_percentage(50))
	print("150 valid %?", valid_percentage(150))

>>>Output

double(5): 10

triple(5): 15

by_ten(5): 50

50 valid %? True

150 valid %? False

Each returned function "closes over" its configuration values. The double function always uses factor 2, triple uses 3. This is closure in action - the inner function remembers the outer function's variables even after the outer function has finished executing.

Function factories are incredibly useful for creating configured versions of operations. Need a validator for percentages (0-100) and another for ages (0-150)? Create both from the same make_range_checker factory. Need discount calculators for different customer tiers? Create them from a make_discount function. The pattern eliminates duplicate code while keeping each function simple and focused.

f = func[f1, f2]apply(f)return f.__name__

f = func

Assign

Store in any variable

[f1, f2]

Collect

Store in lists or dicts

apply(f)

Pass as arg

Give to other functions

return f

Return it

Build function factories

.__name__

Inspect

Read function attributes

Python Quiz

> Look up a function from a dispatch dictionary and check its type. Pick the dict method that retrieves a value safely, and the built-in that reveals what kind of object a function is.

ops = {
    "+": lambda a, b: a + b,
    "-": lambda a, b: a - b
}
func = ops.___("+")
result = func(10, 3)
print(result)
print(___(func))

get

pop

keys

type

len

Treating functions as first-class objects is the foundation of Python's flexibility. Once you see functions as values that can be stored and passed around, patterns like callbacks, strategies, and decorators become natural.

Dispatch dictionaries replace long if/elif chains with a data structure. Adding a new operation means adding one entry to the dictionary rather than modifying conditional logic throughout the function.

Function factories create specialized functions with configuration baked in. The returned function closes over its configuration values, making each generated function independent and predictable.

Helper Decomposition

Daily Life

Interviews

Break large functions into testable parts

Real-world functions often start simple then grow unwieldy. Helper decomposition is the practice of breaking large functions into smaller, focused helpers. Each helper does one thing well, making code easier to test, debug, and maintain.

This pattern is essential in data engineering. An ETL function that extracts, validates, transforms, and loads data should not be a single 200-line function. Breaking it into helpers makes each step testable and the flow clear. When a bug appears, you can quickly identify which helper is responsible.

The principle is "single responsibility" - each function should do one thing and do it well. A function called validate_user_age should only validate age, not also format names or calculate statistics. When functions have single responsibilities, they become reusable building blocks that you can combine in different ways for different tasks.

Signs You Need to Decompose

Certain warning signs indicate that a function has grown too large and should be split into smaller, focused helpers.

Too long

Function exceeds 20-30 lines and is hard to follow at a glance.

Repeated patterns

You see the same logic copied in multiple places within the function.

Multiple tasks

The function validates, transforms, and aggregates all in one body.

Hard to name

You struggle to describe what the function does in a short name.

Complex test setup

Testing the function requires building elaborate mock data and fixtures.

Before: Monolithic Function

Consider this function that processes user records. It does validation, transformation, and aggregation all in one:

	def process_users_bad(users):
	results = []
	total_age = 0
	for user in users:
	if user.get("name") and user.get("age"):
	age = user["age"]
	if isinstance(age, str):
	age = int(age)
	if 0 < age < 150:
	name = user["name"].strip().title()
	results.append({"name": name, "age": age})
	total_age += age
	average = total_age / len(results) if results else 0
	return {"users": results, "avg": average}

	users = [
	{"name": "alice", "age": "25"},
	{"name": "bob", "age": 30},
	{"name": "", "age": 20},
	{"name": "charlie", "age": -5},
	]
	print(process_users_bad(users))

>>>Output

{'users': [{'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 30}], 'avg': 27.5}

This function works, but it is hard to test individual behaviors. What if age validation rules change? What if name formatting needs adjustment? Changes ripple through the entire function. To test age validation alone, you would need to construct full user dictionaries and parse through the entire output - far too much work for a simple unit test.

Another problem is readability. A new developer reading this code must trace through the entire loop to understand what it does. The business logic (what constitutes a valid age, how names should be formatted) is buried inside procedural code. Extracting these rules into named functions makes them explicit and self-documenting.

After: Decomposed Helpers

Breaking the function into focused helpers makes each piece testable and the main function a clear orchestration:

	def is_valid_user(user):
	return bool(user.get("name") and user.get("age"))

	def normalize_age(age):
	if isinstance(age, str):
	age = int(age)
	if 0 < age < 150:
	return age
	return None

	def transform_user(user):
	age = normalize_age(user["age"])
	if age is None:
	return None
	return {"name": user["name"].title(), "age": age}

	def calculate_average(users):
	if not users:
	return 0
	total = 0
	for user in users:
	total += user["age"]
	return total / len(users)

	def process_users(raw_users):
	valid_users = []
	for user in raw_users:
	if is_valid_user(user):
	transformed = transform_user(user)
	if transformed is not None:
	valid_users.append(transformed)
	return {"users": valid_users, "avg": calculate_average(valid_users)}

	users = [
	{"name": "alice", "age": "25"},
	{"name": "bob", "age": 30},
	{"name": "", "age": 20},
	]
	print(process_users(users))

>>>Output

{'users': [{'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 30}], 'avg': 27.5}

TIP

Each helper can be unit tested independently. Testing normalize_age() with edge cases is simple. Testing the monolithic version requires constructing full user records for every test.

Private Helpers Convention

By convention, helper functions meant only for internal use start with an underscore: _validate_input. This signals "don't call this directly" to other developers:

	def _parse_date(date_str):
	"""Private helper - parse date string."""
	parts = date_str.split("-")
	return {"year": int(parts[0]), "month": int(parts[1])}

	def _validate_record(record):
	"""Private helper - check required fields."""
	required = ["id", "date", "amount"]
	return all(field in record for field in required)

	def process_transactions(records):
	"""Public function - processes transaction records."""
	valid = [r for r in records if _validate_record(r)]
	for record in valid:
	record["parsed_date"] = _parse_date(record["date"])
	return valid

	transactions = [
	{"id": 1, "date": "2024-01-15", "amount": 100},
	{"id": 2, "amount": 50},
	]
	result = process_transactions(transactions)
	print("Processed:", result)

>>>Output

Processed: [{'id': 1, 'date': '2024-01-15', 'amount': 100, 'parsed_date': {'year': 2024, 'month': 1}}]

The underscore prefix like _parse_date is purely convention - Python doesn't enforce it. But it's a clear signal that these helpers are implementation details, not part of the public API.

•Well-Decomposed Code

Each function has one purpose
Functions are 5-20 lines
Easy to write unit tests
Changes are localized
Self-documenting through names

•Monolithic Code

Functions do many things
Functions span 100+ lines
Testing requires complex setup
Changes cause ripple effects
Needs extensive comments

Memoization with Dicts

Daily Life

Interviews

Cache results to skip repeated work

Memoization is caching the results of expensive function calls. When the function is called again with the same arguments, you return the cached result instead of recomputing. This can dramatically improve performance for functions called repeatedly with the same inputs. The name comes from "memo" as in memorandum - you are writing down results for future reference.

Data engineers use memoization constantly. Looking up dimension data, parsing configuration, validating schemas - these operations are often repeated with identical inputs. Caching avoids redundant database queries, file reads, or computations. A function that takes 100ms to query a database can return instantly on subsequent calls with the same parameters.

The key insight is that pure functions - functions that always return the same output for the same input and have no side effects - are perfect candidates for memoization. If get_user_by_id(42) returns the same user object every time, there is no reason to recompute or re-query it. Store the result and reuse it.

Basic Memoization Pattern

The simplest approach uses a dictionary as a cache. Check if the input is in the cache; if not, compute and store the result:

	cache = {}

	def factorial(n):
	if n in cache:
	print(f"Cache hit: {n}")
	return cache[n]
	print(f"Computing: {n}")
	result = 1 if n <= 1 else n * factorial(n - 1)
	cache[n] = result
	return result

	print("First call:", factorial(5))
	print()
	print("Second call:", factorial(5))

>>>Output

Computing: 5

Computing: 4

Computing: 3

Computing: 2

Computing: 1

First call: 120

Cache hit: 5

Second call: 120

The second call to factorial_memo(5) returns instantly from the cache. For expensive operations like database lookups or API calls, this difference can be massive. Imagine a data pipeline processing a million records, each needing to look up the same hundred configuration values. Without memoization, that is a hundred million lookups. With memoization, it is just a hundred.

Encapsulated Memoization

A cleaner pattern keeps the cache inside the function using a mutable default argument or closure. This avoids polluting the global namespace:

	def get_user_name(user_id, _cache={}):
	"""Lookup user name with built-in cache."""
	if user_id in _cache:
	return _cache[user_id]

	# Simulate expensive database lookup
	print(f" DB lookup for user {user_id}")
	names = {1: "Alice", 2: "Bob", 3: "Charlie"}
	name = names.get(user_id, "Unknown")
	_cache[user_id] = name
	return name

	# First lookups hit the "database"
	print("First lookups:")
	print(get_user_name(1))
	print(get_user_name(2))
	print(get_user_name(1))
	print()
	print("Second round (all cached):")
	print(get_user_name(1))
	print(get_user_name(2))

>>>Output

First lookups:

  DB lookup for user 1

Alice

  DB lookup for user 2

Bob

Alice

Second round (all cached):

Alice

Bob

Memoizing Multi-Arg Calls

For functions with multiple arguments, use a tuple of arguments as the cache key:

	def power(base, exp, _cache={}):
	"""Calculate base^exp with memoization."""
	key = (base, exp)
	if key in _cache:
	return _cache[key]

	print(f" Computing {base}^{exp}")
	result = base ** exp
	_cache[key] = result
	return result

	print("Computing powers:")
	print(power(2, 10))
	print(power(3, 5))
	print(power(2, 10))
	print(power(2, 8))
	print(power(3, 5))

>>>Output

Computing powers:

  Computing 2^10

1024

  Computing 3^5

243

1024

  Computing 2^8

256

243

Config Lookup Example

A real-world pattern: caching expensive configuration lookups that happen repeatedly during data processing:

	def get_column_mapping(table_name, _cache={}):
	"""Get column mapping for a table."""
	if table_name in _cache:
	return _cache[table_name]

	# Simulate reading from config file or database
	print(f" Loading config for {table_name}")
	configs = {
	"users": {"id": "user_id", "name": "user_name"},
	"orders": {"id": "order_id", "total": "order_total"},
	}
	mapping = configs.get(table_name, {})
	_cache[table_name] = mapping
	return mapping

	def transform_record(table, record):
	"""Transform a record using cached column mapping."""
	mapping = get_column_mapping(table)
	return {mapping.get(k, k): v for k, v in record.items()}

	# Process multiple records - config loaded once
	print("Processing records:")
	records = [
	{"id": 1, "name": "Alice"},
	{"id": 2, "name": "Bob"},
	{"id": 3, "name": "Charlie"},
	]
	for r in records:
	print(transform_record("users", r))

>>>Output

Processing records:

  Loading config for users

{'user_id': 1, 'user_name': 'Alice'}

{'user_id': 2, 'user_name': 'Bob'}

{'user_id': 3, 'user_name': 'Charlie'}

The config is loaded once on the first record, then cached. Without memoization, processing a million records would mean a million config lookups.

TIP

For production code, consider functools.lru_cache which provides memoization with automatic cache size limits. But understanding dict-based memoization is essential for interviews and custom caching needs.

Python Quiz

> A memoized Fibonacci function checks the cache before computing. Pick the keyword that tests cache membership, and the built-in that counts how many results were cached.

cache = {}

def fib(n):
    if n ___ cache:
        return cache[n]
    if n <= 1:
        return n
    cache[n] = fib(n - 1) + fib(n - 2)
    return cache[n]

print(fib(6))
print(___(cache))

not

len

sum

Memoization is most valuable for pure functions - functions that always return the same output for the same input. If a function has side effects or depends on external state, caching its results can cause incorrect behavior.

The mutable default argument _cache={} persists between calls because Python evaluates default arguments once at definition time. This behavior is normally a pitfall, but for caching it is exploited deliberately to maintain state across calls.

Python's standard library provides functools.lru_cache as a production-quality memoization decorator. Understanding manual dict-based caching first makes it easier to reason about what lru_cache does internally and when to use it.

Recursion Basics

Daily Life

Interviews

Traverse nested data of any depth

Recursion is when a function calls itself. This technique elegantly solves problems that can be broken into smaller versions of the same problem. While it might seem strange at first, recursion is natural for tree traversal, nested data processing, and divide-and-conquer algorithms. Once you understand it, certain problems become almost trivial to solve.

Data engineers encounter recursion when traversing nested JSON from APIs, processing file system hierarchies, flattening deeply nested structures, and implementing certain algorithms. Parsing a JSON response with unknown nesting depth? Recursion handles it naturally. Walking a directory tree to find all files matching a pattern? Recursion is the obvious solution.

Recursion is also a favorite topic in technical interviews because it tests your ability to think about problems abstractly. The key mental shift is trusting that your function works correctly for smaller inputs - then using that assumption to solve the larger problem. This leap of faith is what makes recursion click.

The Two Parts of Recursion

Every recursive function must have two parts: a base case that stops the recursion, and a recursive case that calls itself with a smaller problem:

	def countdown(n):
	"""Count to 1."""
	# Base case
	if n <= 0:
	print("Done!")
	return

	# Recursive case
	print(n)
	countdown(n - 1)

	countdown(5)

>>>Output

5

4

3

2

1

Done!

Each call to countdown passes a smaller number. Eventually n reaches 0, hitting the base case and stopping. Without the base case, the function would call itself forever (until Python raises a RecursionError).

Return Values in Recursion

Recursive functions often compute and return values. Each call waits for its recursive call to return before computing its result:

	def factorial(n):
	"""Calculate n!."""
	# Base case
	if n <= 1:
	return 1
	# n! = n * (n-1)!
	return n * factorial(n - 1)

	# Trace: factorial(5)
	# 5 * factorial(4)
	# 5 * 4 * factorial(3)
	# 5 * 4 * 3 * factorial(2)
	# 5 * 4 * 3 * 2 * 1 = 120

	print("5! =", factorial(5))
	print("4! =", factorial(4))
	print("10! =", factorial(10))

>>>Output

5! = 120

4! = 24

10! = 3628800

Base case

Identify the condition that stops the recursion and returns directly.

Move toward base

Each recursive call must use a smaller or simpler input than the current one.

Trust the call

Assume the recursive call works correctly for the smaller problem.

Combine results

Merge the recursive result with the current work to build the answer.

Test small first

Verify with trivial inputs like 0, 1, and 2 before trying larger values.

Recursion for Nested Data

Recursion shines when processing nested structures of unknown depth. This is exactly what data engineers face with JSON from APIs:

	def sum_nested(data):
	"""Sum nested numbers."""
	total = 0
	for item in data:
	if isinstance(item, list):
	# Recurse into list
	total += sum_nested(item)
	else:
	# It's a number
	total += item
	return total

	# Arbitrary nesting depth
	nested = [1, [2, 3], [4, [5, 6]], 7]
	print("Sum:", sum_nested(nested))

	# Deeply nested
	deep = [[[1, 2], [3]], [[4, 5]]]
	print("Deep sum:", sum_nested(deep))

>>>Output

Sum: 28

Deep sum: 15

Flattening Nested Lists

A common data engineering task is flattening nested structures into a single list. Data often arrives nested from APIs or hierarchical databases, but processing requires flat lists. Recursion handles any nesting depth automatically - you do not need to know how deep the nesting goes:

	def flatten(nested):
	"""Flatten nested lists."""
	result = []
	for item in nested:
	if isinstance(item, list):
	# Recurse deeper
	result.extend(flatten(item))
	else:
	# Base: add item directly
	result.append(item)
	return result

	data = [1, [2, [3, 4]], [5, 6], [[7]]]
	print("Flattened:", flatten(data))

	# Works with mixed types
	mixed = ["a", ["b", ["c", "d"]], "e"]
	print("Mixed:", flatten(mixed))

>>>Output

Flattened: [1, 2, 3, 4, 5, 6, 7]

Mixed: ['a', 'b', 'c', 'd', 'e']

Values in Nested Dicts

Another practical pattern is searching for a key in a nested dictionary structure. When processing API responses, the data you need is often buried several levels deep. Rather than writing response["data"]["user"]["profile"]["email"] and hoping each key exists, you can use a recursive search that finds the key wherever it lives:

	def find_key(data, target_key):
	if isinstance(data, dict):
	if target_key in data:
	return data[target_key]
	for value in data.values():
	result = find_key(value, target_key)
	if result is not None:
	return result
	elif isinstance(data, list):
	for item in data:
	result = find_key(item, target_key)
	if result is not None:
	return result
	return None

	response = {"data": {"user": {"profile": {"email": "a@b.com"}}}}
	print("Email:", find_key(response, "email"))
	print("Missing:", find_key(response, "phone"))

>>>Output

Email: a@b.com

Missing: None

Recursion vs Iteration

Many problems can be solved with either recursion or loops. Each has trade-offs:

•Recursion

Natural for trees and nested data
Matches mathematical definitions
Code can be more elegant
Uses call stack memory
Risk of stack overflow

•Iteration (loops)

Better for linear sequences
More memory efficient
No stack overflow risk
Can be harder for nested data
Often more performant

Python places a hard limit on how deep recursion can go, which is important to know.

TIP

Python has a default recursion limit of 1000. For very deep recursion, use sys.setrecursionlimit() or convert to iteration with an explicit stack.

Common Mistakes

Even experienced developers make these mistakes with advanced function patterns:

✓Do

Keep lambdas to simple one-line expressions
Always define a base case for recursion
Use tuple() to create hashable cache keys
Balance decomposition: 5-20 lines per function

✗Don't

Write complex multi-step logic in a lambda
Forget to capture loop variables in closures
Use mutable types like lists as dictionary keys
Split into so many functions you lose readability

Mistake: Complex Lambdas

Lambdas should be simple one-liners. When a lambda grows complex, it becomes harder to read than a named function.

	# BAD: Lambda too complex - hard to read
	process = lambda x: x.strip().lower().replace(" ", "_") if x else ""

	# GOOD: Named function is clearer
	def normalize_string(s):
	"""Normalize a string to snake_case."""
	if not s:
	return ""
	return s.strip().lower().replace(" ", "_")

	# Both work, but the function is more readable
	test = " Hello World "
	print("Lambda:", process(test))
	print("Function:", normalize_string(test))

>>>Output

Lambda: hello_world

Function: hello_world

Mistake: Missing Base Case

Recursive functions must have a base case that stops the recursion. Without one, the function calls itself forever until Python crashes.

	# BAD: No base case
	# def count_forever(n):
	# print(n)


	# GOOD: Always have a base case
	def count_to_limit(n, limit):
	"""Count n to limit."""
	if n > limit:
	return
	print(n)
	count_to_limit(n + 1, limit)

	count_to_limit(1, 3)

>>>Output

1

2

3

Unhashable Cache Keys

When implementing caching, only hashable types (strings, numbers, tuples) can be dictionary keys. Lists and other mutable types cause errors.

	# BAD: Lists can't be dict keys
	# cache = {}
	# cache[[1, 2, 3]] = "result" # TypeError!

	# GOOD: Convert to tuple for cache key
	def process_items(items, _cache={}):
	key = tuple(items)
	if key in _cache:
	return _cache[key]

	result = sum(items) * 2
	_cache[key] = result
	return result

	print(process_items([1, 2, 3]))
	print(process_items([1, 2, 3]))

>>>Output

12

12

Debugging Recursion

When a recursive function misbehaves, the first thing to check is whether each call actually moves toward the base case. If the recursive argument goes the wrong direction, the function will call itself until Python raises a RecursionError.

This recursive function has a bug that causes infinite recursion. Can you spot and remove the extra tile?

Debug Challenge

> This recursive factorial function never reaches its base case because each call passes n + 1 instead of moving toward n <= 1.

RecursionError: factorial calls itself with n + 1 instead of n - 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99




def factorial(n):
  if n <= 1:
    return 1
  return n * factorial(n + 1)
def factorial(n):
  if n <= 1:
    return 1
  return n * factorial(n + 1)

You have learned lambdas, first-class functions, helper decomposition, memoization, and recursion. Now apply these patterns to a real architecture decision. In data pipelines, choosing the right function pattern can mean the difference between a system that scales gracefully and one that collapses under load.

Each function pattern has a specific role: lambdas handle simple one-liners, named helpers bring clarity to multi-step logic, memoization eliminates repeated computation, and recursion navigates unknown nesting depth. Choosing the wrong pattern creates code that is technically correct but expensive or impossible to maintain.

Production pipelines combine all of these patterns. A well-designed pipeline reads like a recipe: transform, validate, score, and store. Each step is a focused named function, making the data flow explicit and every step independently testable.

ETL Pipeline ArchitectureStep 1

Your team processes 10 million customer records nightly. The pipeline must validate records, normalize names, compute loyalty tier scores, and handle nested JSON addresses. The current monolithic function takes 4 hours and is impossible to debug. You need to redesign it.

customer_records

customer_id	raw_name	tier_id	address_json
c_001	alice SMITH	gold	{"city": "Seattle"}
c_002	BOB jones	silver	{"loc": {"city": "NYC"}}
c_003	carol DAVIS	gold	{"addr": {"city": "LA"}}

Jul 2026

The monolithic process_all_records() function is 300 lines long and handles validation, normalization, scoring, and address parsing in one body. How do you restructure it?

The best architectural decisions come from understanding the tradeoffs of each approach before you commit. Decomposition, memoization, and recursion are not competing ideas -- they solve different problems and compose naturally in the same pipeline.

When you review production pipeline code, look for patterns that solve the wrong problem: lambdas used where named functions would aid debugging, list scans where dictionary lookups would be faster, and hardcoded paths where recursion would handle arbitrary depth.

Function mastery is ultimately about matching the right abstraction to the problem at hand. With practice, you will recognize immediately which pattern fits each layer of a data system.

❯❯❯PUTTING IT ALL TOGETHER

> You are a senior data engineer at Databricks building a caching and retry system for expensive external API calls inside a data pipeline that must stay within strict per-request latency budgets.

lambda functions inline short transformations like key extraction directly inside sorted() and filter() calls without defining a named function.

Functions as objects let you pass retry handlers and fallback strategies into pipeline stages as configurable callbacks.

Helper decomposition splits the fetch, validate, and transform steps into focused functions that can be tested and swapped independently.

Memoization with a dict caches prior API responses by argument key so repeated calls return immediately without hitting the network again.

KEY TAKEAWAYS

lambda creates anonymous functions - use for sorting keys, callbacks, and simple transformations

Lambda syntax: lambda args: expression - single expression only, no statements

Functions are first-class objects: assign to variables, store in collections, pass as arguments

Function factories return new functions - the foundation of closures and decorators

Helper decomposition breaks complex functions into focused, testable pieces

Prefix private helpers with underscore: _validate_input()

Memoization caches results using a dict - use for expensive repeated computations

Every recursive function needs a base case and must move toward it

Recursion is natural for nested structures - JSON traversal, tree processing

Use tuple() to convert lists to hashable cache keys

Functions that create functions

Category: Python
Difficulty: advanced
Duration: 40 minutes
Challenges: 0 hands-on challenges

Topics covered: Lambda Functions, Functions as Objects, Helper Decomposition, Memoization with Dicts, Recursion Basics

Lesson Sections

Lambda Functions (concepts: pyLambda)
The term "lambda" comes from lambda calculus, a mathematical system for expressing computation developed in the 1930s. In practice, lambdas are simply a concise way to write small functions inline, especially useful when passing functions to other functions. You will see lambdas everywhere in professional Python codebases. Lambda Syntax Notice how the lambda version is more compact - three lines become one. But this compactness comes with a limitation: lambdas can only contain a single expressio
Functions as Objects (concepts: pyMapFilter)
In Python, functions are "first-class citizens" - they are objects like integers, strings, or lists. You can store them in variables, pass them to other functions, return them from functions, and store them in data structures. This concept might seem abstract at first, but it unlocks powerful patterns that are fundamental to Python programming. Understanding first-class functions is crucial for callbacks in async code, strategy patterns in pipeline design, and decorator patterns used throughout
Helper Decomposition (concepts: pyFuncDef)
Real-world functions often start simple then grow unwieldy. Helper decomposition is the practice of breaking large functions into smaller, focused helpers. Each helper does one thing well, making code easier to test, debug, and maintain. This pattern is essential in data engineering. An ETL function that extracts, validates, transforms, and loads data should not be a single 200-line function. Breaking it into helpers makes each step testable and the flow clear. When a bug appears, you can quickl
Memoization with Dicts (concepts: pyFunctools)
Memoization is caching the results of expensive function calls. When the function is called again with the same arguments, you return the cached result instead of recomputing. This can dramatically improve performance for functions called repeatedly with the same inputs. The name comes from "memo" as in memorandum - you are writing down results for future reference. Data engineers use memoization constantly. Looking up dimension data, parsing configuration, validating schemas - these operations
Recursion Basics (concepts: pyRecursion)
Recursion is when a function calls itself. This technique elegantly solves problems that can be broken into smaller versions of the same problem. While it might seem strange at first, recursion is natural for tree traversal, nested data processing, and divide-and-conquer algorithms. Once you understand it, certain problems become almost trivial to solve. Data engineers encounter recursion when traversing nested JSON from APIs, processing file system hierarchies, flattening deeply nested structur