Data Engineering Lessons

136+ interactive data engineering lessons with real code execution. Learn SQL queries, Python for data engineering, and data modeling through hands-on practice. Every lesson includes challenges you solve by writing and running real code against live databases.

Data Modeling Lessons (11)

Keys & Identity - 22 min
Every record deserves a fingerprint
Topics: The Problem of Identity, Primary Keys: Data Identity, Foreign Keys, Composite Keys, Key Generation Strategies
Schema Types - 22 min
Choosing the right box for every value
Topics: The FLOAT Money Bug, String Types & Platform Traps, Temporal Types & DST, ENUM Traps, Type Review Framework
Relationships - 18 min
How tables talk to each other
Topics: What Are Relationships?, Cardinality Explained, Required vs Optional, Self-Referential Tables, Complex Patterns
Normalization - 15 min
Why copying data breaks everything
Topics: Data Gets Out of Sync, First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), Identifying Normal Form
Beyond 3NF - 19 min
Beyond third normal form
Topics: Boyce-Codd Normal Form, Fourth Normal Form (4NF), Fifth Normal Form (5NF), Strategic Denormalization, Denormalization Patterns
Star Schemas - 30 min
Stars, snowflakes, and facts between
Topics: The Star Schema, Types of Fact Tables, Types of Dimensions, Defining the Grain, Surrogate Keys
Nested Data - 15 min
When flat tables meet nested reality
Topics: The Nesting Decision, STRUCT: Embedded Objects, ARRAY: Ordered Collections, MAP: Dynamic Key-Value Pairs, Columnar Storage & Nesting
Event Streams - 27 min
Data that never forgets
Topics: Event-Driven Architecture, Immutable Append-Only Logs, Event Sourcing, Clickstream Modeling, Handling Late-Arriving Data
Pre-Aggregation - 23 min
Pre-computing answers before anyone asks
Topics: Why Pre-Aggregate?, Metric Types and Additivity, OLAP Cubes & Rollups, Granularity Design, Refresh Strategies & Materialized Views
Design Patterns - 21 min
Blueprints for building data systems
Topics: Medallion Architecture, Data Vault, One Big Table (OBT), Semantic Layers, Pipeline DAG Design

Pipeline Architecture Lessons (41)

How Data Moves: Beginner - beginner - 20 min
Nail the batch vs streaming question and defend your choice
Topics: Batch Processing, Stream Processing, File Ingestion, API Ingestion, Batch vs Streaming
How Data Moves: Intermediate - intermediate - 25 min
Survive the follow-up probes on batch, streaming, and hybrid
Topics: Batch Mechanics, Stream Guarantees, File Format Depth, API Patterns, Hybrid Architectures
How Data Moves: Advanced - advanced - 30 min
Handle the depth probes: idempotency, backpressure, and cost
Topics: Idempotent Pipelines, Backpressure, Late-Arriving Data, Dead Letter Queues, Cost of Freshness
Where Data Lives: Beginner - beginner - 20 min
Answer the storage questions: Parquet, partitioning, lake vs warehouse
Topics: Columnar vs Row, Compression, Partitioning, Lake vs Warehouse, Table Formats
Where Data Lives: Intermediate - intermediate - 25 min
Survive the storage follow-ups: encoding, small files, schema evolution
Topics: Encoding Types, The Small File Problem, Predicate Pushdown, Storage Tiering, Schema Evolution
Keeping Data Fresh: Beginner - beginner - 20 min
Answer the incremental loading question that follows every pipeline design
Topics: Full vs Incremental Loading, Change Data Capture, Slowly Changing Dimensions, Schema Evolution, Backfilling
Keeping Data Fresh: Intermediate - intermediate - 25 min
Master the incremental loading patterns that interviewers probe hardest
Topics: Merge Strategies, CDC Patterns, SCD in Pipelines, Schema Migration, Partition-Level Backfill
Distributed Compute: Beginner - beginner - 20 min
Answer the Spark architecture question that appears in every technical screen
Topics: Spark Execution Model, Distributed Primitives, Shuffle Operations, Memory Management, Small File Problem
Streaming Systems: Beginner - beginner - 20 min
Answer the Kafka and streaming questions with confidence
Topics: Event Platforms, Event-Driven Architecture, Late-Arriving Data, Dead Letter Queues, Micro-Batch vs True Streaming
Streaming Systems: Intermediate - intermediate - 25 min
Master offset management, consumer groups, and streaming failure modes
Topics: Consumer Groups and Offsets, Event Sourcing Patterns, Windowing and Watermarks, DLQ Patterns, Spark Streaming vs Flink

Python Lessons (42)

Python Foundations: Beginner - beginner - 18 min
Your first lines of Python start here
Topics: Variables and Assignment, Data Types, Print Statements, Basic Operators, Comments
Python Foundations: Intermediate - intermediate - 20 min
Decisions, loops, and reusable logic
Topics: Conditional Statements, Loops, Functions, Return Values, Variable Scope
Python Foundations: Advanced - advanced - 19 min
Lambdas, comprehensions, and more
Topics: Lambda Functions, List Comprehensions, Decorators, Generators, Context Managers
Python Expressions: Beginner - beginner - 27 min
Where every Python journey begins
Topics: How Computers Store Data, Variables and Naming, Assignment vs. Equality, Data Types and Strings, Operators and Readability
Python Expressions: Intermediate - intermediate - 38 min
Making decisions with data
Topics: Type Conversions, Comparison Operators, Logical Operators, Multiple Assignment, None and Identity
Python Expressions: Advanced - advanced - 40 min
Patterns for technical interviews
Topics: Multiple Assignment, Short-circuit Evaluation, Truthy and Falsy Values, Ternary Expressions, Walrus Operator (:=)
Control Flow: Beginner - beginner - 37 min
Making decisions in code
Topics: The if Statement, Branching with if-else, Chaining if-elif-else, Combining with and/or, Execution Flow
Control Flow: Intermediate - intermediate - 33 min
Writing cleaner conditional logic
Topics: Guard Clauses, Chained Comparisons, Pattern Matching with match-case, Conditional Assignment, Edge Case Handling
Control Flow: Advanced - advanced - 28 min
Elegant patterns for complex decisions
Topics: Boolean Simplification, De Morgan's Laws, State Machine Patterns, Dict-Based Dispatch, Decision Table Lookups
Loops: Beginner - beginner - 39 min
Repeating actions efficiently
Topics: Iterating with for Loops, range() Function, Loops with while, Using break and continue, Loop Variable Scope

Spark Lessons (12)

How a Spark Job Runs - beginner - 12 min
Your query is a promise. Something has to keep it.
Topics: The Cluster: Who Plans, Who Works, Partitions: The Unit of Parallelism, Transformations vs Actions, Cores and Slots, A Job's Life, End to End
How a Spark Job Runs: Stages and Plans - intermediate - 12 min
The boundaries between stages are where the cost lives.
Topics: Job, Stage, Task, Why Stages Exist At All, Reading Parallelism, Where the Driver Lives, spark-submit and the Config Surface
How a Spark Job Runs: Scheduler Internals - advanced - 14 min
The failure edges separate tuning from understanding.
Topics: DAGScheduler vs TaskScheduler, Task Failure and Retry, Speculative Execution, The Driver as Bottleneck, Locality and Scheduling Delay
Lazy Until You Ask - beginner - 13 min
You wrote a recipe. Nothing cooks until you call an action.
Topics: Nothing Runs Until an Action, Why Laziness Makes Spark Fast, The Action Catalog: What Actually Triggers a Run, The collect() Trap, Re-Execution: The Chain Runs Again Every Time
Reading the Plan: DAG, Stages, and explain() - intermediate - 14 min
The shape of the graph is the map of where your time goes.
Topics: The DAG: Your Plan as a Graph, Counting Stages Is Counting Shuffles, Logical vs Physical Plan: Reading explain(), Pipelining: Why Narrow Ops Are Nearly Free, DAG vs Lineage: The Plan and the Recovery History
Lineage as Fault Tolerance - advanced - 15 min
A partition is never data Spark trusts to survive. It is a recipe Spark can rebuild.
Topics: Lineage-Based Recovery: Rebuild, Don't Re-Read, The Recompute Cost: When Lineage Gets Expensive, Checkpointing: Cutting the Lineage, Cache vs Checkpoint vs Persist: Which Solves What, Determinism: Why Recompute-Based Recovery Can Break
Narrow, Wide, and the Shuffle - beginner - 13 min
One category is free. The other can run your whole bill.
Topics: Narrow Transformations: Each Piece Stays Home, Wide Transformations: When Rows Must Come Together, What 'Shuffle' Actually Means, Why Wide Is Expensive and Narrow Is Nearly Free, Spotting the Shuffle in Your Own Code
Inside the Shuffle - intermediate - 14 min
Two halves, a write and a read, with disk and the network in between.
Topics: The Shuffle Write: Staging Data by Key, The Shuffle Read: Fetching Across the Network, Spill: When the Shuffle Runs Out of Memory, The 200 Knob: spark.sql.shuffle.partitions, Why the Shuffle Dominates Runtime
Shuffle Internals and Elimination - advanced - 15 min
The cheapest shuffle is the one you engineered away.
Topics: Sort-Based Shuffle: One File, Not N Squared, The External Shuffle Service: Surviving a Dead Executor, Pricing a Shuffle: Bytes Moved to Wall-Clock, Eliminating a Shuffle, The Shuffle Tuning Knobs
The Optimizer Works For You - beginner - 13 min
You stopped telling Spark how, and started telling it what.
Topics: Declare What, Not How: Why DataFrames Beat RDDs, The Optimizer Exists: Your Query Is Rewritten, The RDD Escape Hatch and Its Cost, DataFrame, Dataset, RDD: Three APIs, Three Trade-offs, Seeing the Optimization Happen with explain()

SQL Lessons (30)

Query Structure: Beginner - beginner - 9 min
Your first SQL query — demystified
Topics: Tables, rows, and columns, SELECT and FROM basics, Selecting all columns (*), AS aliases for columns, Expressions in SELECT
Query Structure: Intermediate - intermediate - 28 min
CTEs: subqueries are a cry for help
Topics: CTEs (WITH clause), Query Execution Order, Subqueries for temp results, UNION and UNION ALL, ORDER BY and LIMIT
Query Structure: Advanced - advanced - 31 min
SQL operators nobody warned you about
Topics: Correlated subqueries, EXCEPT and EXCEPT ALL, INTERSECT and INTERSECT ALL, UNNEST for arrays, SELECT Without FROM
Data Types: Beginner - beginner - 19 min
INT, VARCHAR, and the lies we tell
Topics: Why data types matter, INTEGER for whole numbers, STRING types (VARCHAR), BOOLEAN for true/false, CAST for type conversion
Data Types: Intermediate - intermediate - 19 min
Where pennies vanish and NULLs defy
Topics: BOOLEAN and NULL logic, DECIMAL precision and scale, TIMESTAMP vs TIMESTAMP WITH TIME ZONE, Time zones and UTC handling, TRY_CAST and implicit casts
Data Types: Advanced - advanced - 21 min
Arrays, maps, and type optimization at scale
Topics: Type optimization at scale, Storage calculations and VARCHAR, MAP data type for key-value pairs, Accessing nested data (UNNEST), Compression and error handling
Filtering: Beginner - beginner - 35 min
WHERE: your database bouncer
Topics: WHERE clause for filtering rows, Equals and not equals (=, !=), Comparison operators (<, >), IN and NOT IN for list matching, AND for combining conditions
Filtering: Intermediate - intermediate - 24 min
Boolean logic: it's complicated
Topics: OR and CASE expressions, Operator precedence, LIKE for pattern matching, LIMIT and OFFSET for pagination, BETWEEN for range filtering
Filtering: Advanced - advanced - 19 min
Subqueries, regex, and other crimes
Topics: Correlated subqueries (EXISTS), NOT EXISTS for missing rows, NOT IN vs NULL gotchas, REGEXP_LIKE patterns, Regex operators and patterns
Aggregating: Beginner - beginner - 38 min
A million rows walk into a SUM...
Topics: GROUP BY for categorizing data, COUNT variations (*,col,DISTINCT), SUM and AVG calculations, MIN and MAX for extremes, HAVING for filtered groups

Learn SQL, Python, and Data Modeling Interactively