Before dbt launched in 2016, transformation layers lived inside stored procedures, Airflow PythonOperators calling raw SQL, or hand-rolled Jinja templating stitched together with cron. Tristan Handy's team at Fishtown Analytics looked at that mess and asked a simple question: what if a SELECT statement was a first-class artifact with tests, lineage, and version control baked in? The answer became dbt. Ten years later it's the default transformation layer for nearly every warehouse-native stack. This tutorial walks the concepts in roughly the order they were introduced, so you understand why each one had to exist.
dbt first release
SQL share of DE interviews
Core concepts to master
SQL problems you can run
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Understanding what dbt replaced tells you what it is. Before 2016, transformation logic lived in three painful places. dbt collapsed all three into SELECT statements with lineage and tests.
dbt (data build tool) runs SQL transformations directly in your data warehouse. Unlike traditional ETL tools that extract data, transform it in an external system, and load it back, dbt follows the ELT pattern: your data is already in the warehouse, and dbt transforms it in place. This means you write standard SQL (with some Jinja templating), and dbt handles compiling it, running it in the right order, and managing dependencies. You do not need to learn a new language or framework. If you know SQL, you can use dbt.
dbt models are SQL SELECT statements with optional Jinja templating. Jinja lets you add variables, conditionals, loops, and macros to your SQL. A simple example: {{ ref('stg_orders') }} references another dbt model and automatically creates a dependency. dbt figures out the build order from these references. Macros let you write reusable SQL snippets: a date_spine macro that generates a calendar table, a deduplicate macro that removes duplicates using a standard pattern. This is how teams standardize their SQL without copying and pasting.
dbt does not schedule itself. It does not extract data from sources. It does not load data into the warehouse. It only handles the T in ELT: transforming data that is already in the warehouse. In production, dbt runs inside an orchestrator like Airflow, Dagster, or Prefect. The orchestrator handles scheduling, dependency management with non-dbt tasks, and alerting. Understanding this boundary is important for interviews: dbt is a transformation tool, not a pipeline tool.
The six building blocks of every dbt project.
A model is a SQL SELECT statement saved as a .sql file in the models/ directory. When you run dbt run, dbt compiles each model (resolving Jinja templates), determines the build order from ref() dependencies, and executes the SQL in your warehouse. Each model creates a table or view (configurable). Models are the core building block: staging models clean raw data, intermediate models join and transform, and mart models serve final outputs to analysts. The naming convention matters: stg_ for staging, int_ for intermediate, fct_ for fact tables, dim_ for dimensions.
SELECT user_id, MIN(created_at) AS first_order_date FROM {{ ref('stg_orders') }} GROUP BY user_iddbt has two types of tests. Schema tests are declared in YAML and check properties of a column: unique, not_null, accepted_values, and relationships (referential integrity). Custom tests are SQL queries that return rows where the test fails (if the query returns zero rows, the test passes). Running dbt test after dbt run catches data quality issues before downstream consumers see them. Tests are the reason dbt gained traction: they bring software engineering testing practices to SQL transformations.
models/schema.yml: columns: - name: user_id, tests: [unique, not_null]Sources define raw data tables that dbt reads from but does not create. You declare sources in YAML files with the database, schema, and table name. Staging models reference sources using {{ source('stripe', 'payments') }} instead of hardcoding table names. This indirection means you can change the source location in one place without updating every model that reads from it. Sources also support freshness checks: dbt can warn you if a source table has not been updated recently.
{{ source('stripe', 'payments') }} in a staging model references the raw payments tabledbt generates documentation from your project automatically. Model descriptions, column descriptions, and test results are compiled into a static website (dbt docs generate, dbt docs serve). The documentation includes a DAG visualization showing how models depend on each other. Column-level lineage shows where each column originates. For teams, this means documentation stays in sync with the code because it is generated from the same YAML files that define tests and sources.
dbt docs generate && dbt docs serve (opens a browser with full project documentation)Materializations control how dbt builds a model in the warehouse. Table: drops and recreates the table on every run (good for small to medium models). View: creates a SQL view (good for simple transformations, no storage cost, but query time on read). Incremental: appends only new or updated rows (good for large tables where rebuilding is expensive). Ephemeral: inlines the SQL as a CTE in downstream models (no table or view created). Choosing the right materialization is a common interview topic.
{{ config(materialized='incremental', unique_key='order_id') }}Seeds are CSV files in your dbt project that dbt loads into your warehouse as tables. Use seeds for small, static reference data: country codes, status mappings, employee lists. Do not use seeds for large datasets or data that changes frequently. Seeds are version-controlled with your dbt project, which means changes to reference data go through the same code review process as model changes.
data/country_codes.csv loaded with dbt seed, referenced as {{ ref('country_codes') }}How a well-organized dbt project looks.
models/staging/Staging models that clean and rename raw source data. One model per source table.
models/intermediate/Models that join staging models, apply business logic, and prepare data for marts.
models/marts/Final models consumed by analysts. Fact and dimension tables, aggregated metrics.
models/schema.ymlYAML files defining tests, documentation, and column descriptions for each model.
macros/Reusable SQL snippets (Jinja macros) shared across models.
seeds/CSV files with static reference data loaded into the warehouse.
tests/Custom SQL tests that go beyond built-in schema tests.
dbt_project.ymlProject configuration: name, version, model paths, materializations, and variables.
How dbt knowledge shows up in interview rounds.
When designing a data warehouse or analytics platform in an interview, mentioning dbt shows you understand the modern data stack. Describe the role of dbt in the transformation layer: raw data lands in staging tables (loaded by Fivetran, Airbyte, or custom pipelines), dbt transforms it into clean models, and BI tools query the mart layer. This demonstrates you think about the full pipeline, not just individual queries.
Interviewers sometimes ask why teams choose dbt over stored procedures. The answer is version control, testing, documentation, and modularity. dbt projects live in Git, changes go through code review, tests run in CI/CD, and documentation is auto-generated. Stored procedures can do the same transformations, but they lack these software engineering practices. This comparison shows you understand the tradeoffs, not just the tool.
Incremental materializations are the most interview-relevant dbt concept. They test your understanding of idempotency, late-arriving data, and the is_incremental() macro. An interviewer might ask: 'How would you handle a fact table with 10 billion rows that gets 5 million new rows per day?' Knowing that dbt incremental models use a unique_key for upserts and a timestamp filter for new rows puts you ahead of candidates who only know full table refreshes.
dbt's superpower is making SQL version-controlled and testable. Your superpower is writing the SQL correctly in the first place.
Practice SQL