Tool Tutorial

dbt Tutorial for Data Engineers

Before dbt launched in 2016, transformation layers lived inside stored procedures, Airflow PythonOperators calling raw SQL, or hand-rolled Jinja templating stitched together with cron. Tristan Handy's team at Fishtown Analytics looked at that mess and asked a simple question: what if a SELECT statement was a first-class artifact with tests, lineage, and version control baked in? The answer became dbt. Ten years later it's the default transformation layer for nearly every warehouse-native stack. This tutorial walks the concepts in roughly the order they were introduced, so you understand why each one had to exist.

2016

dbt first release

41%

SQL share of DE interviews

6

Core concepts to master

854

SQL problems you can run

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

What Is dbt?

Understanding what dbt replaced tells you what it is. Before 2016, transformation logic lived in three painful places. dbt collapsed all three into SELECT statements with lineage and tests.

dbt Transforms Data Inside Your Warehouse

dbt (data build tool) runs SQL transformations directly in your data warehouse. Unlike traditional ETL tools that extract data, transform it in an external system, and load it back, dbt follows the ELT pattern: your data is already in the warehouse, and dbt transforms it in place. This means you write standard SQL (with some Jinja templating), and dbt handles compiling it, running it in the right order, and managing dependencies. You do not need to learn a new language or framework. If you know SQL, you can use dbt.

SQL + Jinja = Reusable Transformations

dbt models are SQL SELECT statements with optional Jinja templating. Jinja lets you add variables, conditionals, loops, and macros to your SQL. A simple example: {{ ref('stg_orders') }} references another dbt model and automatically creates a dependency. dbt figures out the build order from these references. Macros let you write reusable SQL snippets: a date_spine macro that generates a calendar table, a deduplicate macro that removes duplicates using a standard pattern. This is how teams standardize their SQL without copying and pasting.

dbt Is Not an Orchestrator

dbt does not schedule itself. It does not extract data from sources. It does not load data into the warehouse. It only handles the T in ELT: transforming data that is already in the warehouse. In production, dbt runs inside an orchestrator like Airflow, Dagster, or Prefect. The orchestrator handles scheduling, dependency management with non-dbt tasks, and alerting. Understanding this boundary is important for interviews: dbt is a transformation tool, not a pipeline tool.

Core dbt Concepts

The six building blocks of every dbt project.

Models

A model is a SQL SELECT statement saved as a .sql file in the models/ directory. When you run dbt run, dbt compiles each model (resolving Jinja templates), determines the build order from ref() dependencies, and executes the SQL in your warehouse. Each model creates a table or view (configurable). Models are the core building block: staging models clean raw data, intermediate models join and transform, and mart models serve final outputs to analysts. The naming convention matters: stg_ for staging, int_ for intermediate, fct_ for fact tables, dim_ for dimensions.

SELECT user_id, MIN(created_at) AS first_order_date FROM {{ ref('stg_orders') }} GROUP BY user_id

Tests

dbt has two types of tests. Schema tests are declared in YAML and check properties of a column: unique, not_null, accepted_values, and relationships (referential integrity). Custom tests are SQL queries that return rows where the test fails (if the query returns zero rows, the test passes). Running dbt test after dbt run catches data quality issues before downstream consumers see them. Tests are the reason dbt gained traction: they bring software engineering testing practices to SQL transformations.

models/schema.yml: columns: - name: user_id, tests: [unique, not_null]

Sources

Sources define raw data tables that dbt reads from but does not create. You declare sources in YAML files with the database, schema, and table name. Staging models reference sources using {{ source('stripe', 'payments') }} instead of hardcoding table names. This indirection means you can change the source location in one place without updating every model that reads from it. Sources also support freshness checks: dbt can warn you if a source table has not been updated recently.

{{ source('stripe', 'payments') }} in a staging model references the raw payments table

Documentation

dbt generates documentation from your project automatically. Model descriptions, column descriptions, and test results are compiled into a static website (dbt docs generate, dbt docs serve). The documentation includes a DAG visualization showing how models depend on each other. Column-level lineage shows where each column originates. For teams, this means documentation stays in sync with the code because it is generated from the same YAML files that define tests and sources.

dbt docs generate && dbt docs serve (opens a browser with full project documentation)

Materializations

Materializations control how dbt builds a model in the warehouse. Table: drops and recreates the table on every run (good for small to medium models). View: creates a SQL view (good for simple transformations, no storage cost, but query time on read). Incremental: appends only new or updated rows (good for large tables where rebuilding is expensive). Ephemeral: inlines the SQL as a CTE in downstream models (no table or view created). Choosing the right materialization is a common interview topic.

{{ config(materialized='incremental', unique_key='order_id') }}

Seeds

Seeds are CSV files in your dbt project that dbt loads into your warehouse as tables. Use seeds for small, static reference data: country codes, status mappings, employee lists. Do not use seeds for large datasets or data that changes frequently. Seeds are version-controlled with your dbt project, which means changes to reference data go through the same code review process as model changes.

data/country_codes.csv loaded with dbt seed, referenced as {{ ref('country_codes') }}

dbt Project Structure

How a well-organized dbt project looks.

models/staging/

Staging models that clean and rename raw source data. One model per source table.

models/intermediate/

Models that join staging models, apply business logic, and prepare data for marts.

models/marts/

Final models consumed by analysts. Fact and dimension tables, aggregated metrics.

models/schema.yml

YAML files defining tests, documentation, and column descriptions for each model.

macros/

Reusable SQL snippets (Jinja macros) shared across models.

seeds/

CSV files with static reference data loaded into the warehouse.

tests/

Custom SQL tests that go beyond built-in schema tests.

dbt_project.yml

Project configuration: name, version, model paths, materializations, and variables.

dbt in Data Engineering Interviews

How dbt knowledge shows up in interview rounds.

dbt in system design interviews

When designing a data warehouse or analytics platform in an interview, mentioning dbt shows you understand the modern data stack. Describe the role of dbt in the transformation layer: raw data lands in staging tables (loaded by Fivetran, Airbyte, or custom pipelines), dbt transforms it into clean models, and BI tools query the mart layer. This demonstrates you think about the full pipeline, not just individual queries.

dbt vs stored procedures

Interviewers sometimes ask why teams choose dbt over stored procedures. The answer is version control, testing, documentation, and modularity. dbt projects live in Git, changes go through code review, tests run in CI/CD, and documentation is auto-generated. Stored procedures can do the same transformations, but they lack these software engineering practices. This comparison shows you understand the tradeoffs, not just the tool.

Incremental models and interview questions

Incremental materializations are the most interview-relevant dbt concept. They test your understanding of idempotency, late-arriving data, and the is_incremental() macro. An interviewer might ask: 'How would you handle a fact table with 10 billion rows that gets 5 million new rows per day?' Knowing that dbt incremental models use a unique_key for upserts and a timestamp filter for new rows puts you ahead of candidates who only know full table refreshes.

dbt Tutorial FAQ

Do I need to know dbt for data engineering interviews?+
It depends on the role. Roles titled 'Analytics Engineer' almost always test dbt knowledge. DE roles at companies using the modern data stack (Snowflake/BigQuery + Fivetran + dbt + Looker/Metabase) frequently ask about dbt. DE roles at companies with custom Python/Spark pipelines may not mention dbt at all. Check the job description. If dbt is listed, expect at least one question about models, tests, or materializations.
What is the difference between dbt Core and dbt Cloud?+
dbt Core is the open-source command-line tool. You install it locally, run dbt run and dbt test from the terminal, and manage deployment yourself (usually via Airflow or CI/CD). dbt Cloud is a hosted platform by dbt Labs that provides a web IDE, scheduling, CI/CD, and a documentation portal. For interviews, dbt Core concepts are what matter. dbt Cloud adds convenience but does not change the fundamentals.
Can dbt replace Airflow?+
No. dbt handles SQL transformations inside the warehouse. Airflow handles orchestration: scheduling, dependency management across different systems, and running non-SQL tasks (Python scripts, API calls, file transfers). In production, dbt typically runs as a task inside an Airflow DAG. They solve different problems and complement each other.
How long does it take to learn dbt?+
If you already know SQL, you can write your first dbt model in 30 minutes. Understanding models, tests, sources, and materializations takes a weekend of focused learning. Becoming comfortable with Jinja macros, incremental models, and project structure takes 1 to 2 weeks of practice. The learning curve is gentle because dbt builds on SQL knowledge you already have.

Ten Years Later, It's Still SELECT Statements

dbt's superpower is making SQL version-controlled and testable. Your superpower is writing the SQL correctly in the first place.

Practice SQL