Data Engineering Tools Hub

Tutorials, interview questions, and runnable practice for the tools that come up in real data engineering interviews. Everything is free and framed around what interviewers actually test.

Apache Spark and PySpark

What Is PySpark?→

Python API for Apache Spark: when to use it, how it maps to Spark SQL, and what interviewers test.

PySpark Tutorial→

From SparkSession to write: DataFrame basics, transformations, and actions.

PySpark Interview Questions→

The PySpark questions that come up in real L4 to L6 data engineering interviews.

Spark Interview Questions→

Shuffle, skew, AQE, broadcast joins, and the questions that separate seniors from staff.

Spark Interview Questions for L5 to L7→

Advanced Spark: Catalyst, physical plans, executor tuning, and incident debugging.

Spark Mock Interview→

AI 4-phase Spark interview simulation with code execution and verdict.

PySpark Practice Problems→

Hands-on PySpark problems grouped by category with real execution and grading.

PySpark Coding Practice→

PySpark problems by difficulty, with runnable code and tests.

PySpark Joins→

Broadcast joins, anti joins, and multi-column join patterns.

PySpark GroupBy→

Shuffle cost, skew, and the aggregation patterns interviewers probe.

PySpark Drop Duplicates→

dropDuplicates vs window dedup: when each is correct.

PySpark isin()→

Performance notes and NOT IN pitfalls with NULLs.

PySpark Functions Cheat Sheet→

The functions you reach for in interviews, organized by task.

Spark SQL Functions→

Spark SQL functions reference for data engineers.

Spark SQL LEFT ANTI JOIN→

Syntax, execution plan, and use cases for anti joins.

Transformation (dbt)

dbt Tutorial→

Beginner to interview-ready: models, tests, sources, materializations, and project structure.

dbt Interview Questions→

The dbt questions interviewers use to check real experience vs tutorial knowledge.

Orchestration (Airflow)

Airflow DAG Reference→

DAG patterns, task dependencies, XCom, and operators for data engineering interviews.

Airflow Interview Questions→

Scheduler, executor, SLA, and backfill questions that actually get asked.

Streaming (Kafka)

Kafka Interview Questions→

Topics, partitions, consumer groups, exactly-once, and ordering guarantees.

Warehouse and Lakehouse

Snowflake Interview Questions→

Warehouses, micro-partitions, clustering, time travel, and cost tuning.

Databricks Interview Questions→

Delta Lake, Unity Catalog, workflow, and photon-era Databricks questions.

02 / Why practice

Run a Real Interview

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Start a Mock Interview