Async Python for Data Engineers: Advanced
asyncio.TaskGroup: Structured Concurrency
asyncio.gather has a critical problem: task leaks. If one task raises an exception, gather raises the exception immediately — but the other tasks continue running in the background. They have no parent scope, no owner, no cancellation. They consume resources silently until they finish (or hang forever). asyncio.TaskGroup (Python 3.11+) solves this with structured concurrency: child tasks cannot outlive their parent scope.
- •All tasks start immediately in parallel
- •If one raises: gather raises, others keep running (leaked)
- •return_exceptions=True: exceptions become values, all run to completion
- •No automatic cancellation on failure
- •Task leaks: orphaned coroutines consuming resources
- •Available in all Python 3 versions
- •All tasks start in the group, all cancelled if any fails
- •If one raises: ALL others are immediately cancelled
- •ExceptionGroup collects ALL exceptions that occurred
- •Structured: child tasks cannot outlive the group scope
- •No task leaks: scope exit guarantees cleanup
- •Preferred for all new async code on 3.11+
Event Loop Internals: What Staff Candidates Know
Staff-level candidates know the event loop architecture well enough to debug problems that look like async issues but are actually event loop lifecycle issues. The three most common: nesting event loops in Jupyter/notebooks, running async code from a sync context, and creating tasks that outlive their loop.
Testing Async Code: pytest-asyncio and AsyncMock
Async testing is a strong-hire signal in DE interviews. It shows you've maintained an async codebase, not just written async scripts. The key tools are pytest-asyncio for running async test functions, and AsyncMock for mocking async functions and clients.
Concurrency Decision Matrix: Async vs Threading vs Multiprocessing vs Spark
This is the most common staff-level async DE interview question. The interviewer describes a workload and asks which concurrency model you'd choose. The strong-hire answer gives specific criteria, not just the tool name.
| Approach | Best For | Why | Limitation |
|---|---|---|---|
| asyncio | Many concurrent I/O ops (API calls, DB queries) | Single thread, zero OS overhead, thousands of concurrent connections | Blocking calls kill everything; CPU-bound work doesn't benefit; GIL still applies |
| threading | Can't refactor to async; concurrent.futures interop; legacy sync libs | GIL released during I/O; preemptive; simpler migration from sync code | GIL limits CPU parallelism; race conditions with shared state; higher overhead than async |
| multiprocessing | CPU-bound work (Pandas transforms, compression, ML inference) | Each process has own GIL bypass; true parallelism; isolated memory | High overhead per process; IPC serialization cost; not for I/O-bound work |
| Spark / Dask | Dataset doesn't fit on one machine; distributed fault-tolerance needed | Horizontal scaling; built-in fault tolerance; SQL/DataFrame API | Overhead for small data; cluster management; no benefit under ~10GB |
Shared State and Async Safety
asyncio's single-threaded cooperative model means most shared state is safe — coroutines can only switch at await points, so there are no preemptive race conditions between coroutines. BUT: this breaks down the moment you introduce threads (via asyncio.to_thread or ThreadPoolExecutor). Staff-level candidates understand exactly where the safety boundary is.
No task leaks. No orphaned coroutines. No surprise hangs.
- Category
- Python
- Difficulty
- advanced
- Duration
- 42 minutes
- Challenges
- 0 hands-on challenges
Topics covered: asyncio.TaskGroup: Structured Concurrency, Event Loop Internals: What Staff Candidates Know, Testing Async Code: pytest-asyncio and AsyncMock, Concurrency Decision Matrix: Async vs Threading vs Multiprocessing vs Spark, Shared State and Async Safety