Data Engineering Tools Hub
Tutorials, interview questions, and runnable practice for the tools that come up in real data engineering interviews. Everything is free and framed around what interviewers actually test.
Apache Spark and PySpark
Python API for Apache Spark: when to use it, how it maps to Spark SQL, and what interviewers test.
From SparkSession to write: DataFrame basics, transformations, and actions.
The PySpark questions that come up in real L4 to L6 data engineering interviews.
Shuffle, skew, AQE, broadcast joins, and the questions that separate seniors from staff.
Advanced Spark: Catalyst, physical plans, executor tuning, and incident debugging.
AI 4-phase Spark interview simulation with code execution and verdict.
Hands-on PySpark problems grouped by category with real execution and grading.
PySpark problems by difficulty, with runnable code and tests.
Broadcast joins, anti joins, and multi-column join patterns.
Shuffle cost, skew, and the aggregation patterns interviewers probe.
dropDuplicates vs window dedup: when each is correct.
Performance notes and NOT IN pitfalls with NULLs.
The functions you reach for in interviews, organized by task.
Spark SQL functions reference for data engineers.
Syntax, execution plan, and use cases for anti joins.
Transformation (dbt)
Orchestration (Airflow)
Streaming (Kafka)
Warehouse and Lakehouse
Run a Real Interview
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition