Data Engineering Interview Prep

dbt Interview Questions for Data Engineers (2026)

dbt has become the standard transformation layer in the modern data stack. Interviewers test whether you understand materializations, incremental models, testing strategies, and how to structure a dbt project that scales. These are the questions that come up.

Covers dbt Core 1.7+, dbt Cloud, and the patterns that separate production-quality dbt projects from tutorial-level code.

What Interviewers Expect

dbt questions test whether you have built and maintained a real dbt project or just run through the tutorial. The depth of your answers about incremental models, testing, and project structure reveals your experience level immediately.

Junior candidates should explain what dbt does, how ref() works, and the difference between table and view materializations. You should be able to write a basic model and schema test.

Mid-level candidates need to design incremental models with proper merge strategies, implement snapshot SCDs, write custom tests, and explain project structure conventions (staging/intermediate/marts).

Senior candidates must discuss CI/CD for dbt, cross-project dependencies (dbt Mesh), performance optimization for large models, and how to manage schema evolution across environments.

Core Concepts Interviewers Test

Models, Sources, and ref()

Models are SQL SELECT statements that dbt materializes into tables or views. Sources are external tables declared in YAML that dbt does not manage. ref() creates dependencies between models and enables dbt to build the DAG. Interviewers test whether you understand that ref() is not just a convenience function: it controls execution order and enables environment-aware schema resolution.

Materializations

Table: full rebuild every run. View: no data stored, query runs on read. Incremental: appends or merges only new/changed rows. Ephemeral: inlined as a CTE, no database object created. Interviewers ask when each is appropriate. The answer depends on data volume, query frequency, freshness requirements, and compute cost.

Incremental Models

Incremental models process only new or changed data by filtering with an is_incremental() block. The merge strategy determines how new rows interact with existing ones: append, merge (upsert), delete+insert, or insert_overwrite. Interviewers love incremental models because they expose whether you understand idempotency, late-arriving data, and the tradeoff between efficiency and correctness.

Snapshots (Slowly Changing Dimensions)

Snapshots track how source data changes over time by maintaining valid_from and valid_to timestamps. dbt supports timestamp and check strategies for detecting changes. Interviewers ask about SCD Type 2 implementation: how dbt closes old records and opens new ones when a tracked column changes.

Testing and Data Quality

dbt tests are assertions about your data: unique, not_null, accepted_values, and relationships are built-in. Custom tests are SQL queries that return rows that fail the assertion. Interviewers test whether you build testing into your pipeline or treat it as an afterthought. Strong candidates describe a testing strategy that catches bad data before it reaches dashboards.

Jinja and Macros

dbt uses Jinja templating for dynamic SQL: loops, conditionals, and reusable macros. Interviewers ask you to write a macro that generates SQL dynamically (pivot columns, environment-specific logic, or DRY patterns). The key is knowing when Jinja adds clarity versus when it makes SQL unreadable.

dbt Interview Questions with Guidance

Q1

Explain the difference between table, view, incremental, and ephemeral materializations. Give a use case for each.

A strong answer includes:

Table: dimension tables or aggregations queried frequently; full rebuild is acceptable because data volume is manageable. View: lightweight transformations where you want the freshest data and query latency is not critical. Incremental: large fact tables where rebuilding from scratch takes too long; only new or changed rows are processed. Ephemeral: staging transformations that are referenced by multiple downstream models but do not need their own database object. A strong answer discusses the compute cost tradeoff: tables are expensive to build but cheap to query; views are cheap to build but expensive to query.

Q2

Walk through how you would build an incremental model for a fact_orders table that receives 10M rows per day.

A strong answer includes:

Define the model with materialized='incremental' and unique_key='order_id'. In the is_incremental() block, filter for rows where updated_at > the max updated_at in the existing table. Use the merge strategy for upserts so that updated orders replace old versions. Handle late-arriving data by subtracting a lookback window (e.g., 3 days) from the filter. A strong answer discusses what happens on the first run (full load), how to handle schema changes (full refresh), and the tradeoff between lookback window size and processing cost.

Q3

What is the purpose of dbt tests? Describe your testing strategy for a production dbt project.

A strong answer includes:

dbt tests are data quality assertions that run after model builds. Strategy: every model gets unique and not_null tests on primary keys. Foreign key relationships are tested with the relationships test. Business logic is tested with custom tests (e.g., revenue must be positive, date ranges must not overlap). Use dbt-expectations or dbt-utils for advanced patterns. Run tests in CI before merging PRs. A strong answer mentions test severity levels (warn vs error), test selection with tags, and the difference between schema tests and data tests.

Q4

How do you handle schema changes in dbt incremental models?

A strong answer includes:

By default, dbt does not add new columns to incremental models. Set on_schema_change to 'append_new_columns' to add new columns with null backfill, 'sync_all_columns' to add new columns and drop removed ones, or 'fail' to halt on schema change. A full refresh (dbt run with the full-refresh flag) rebuilds from scratch and picks up all schema changes. A strong answer discusses the operational impact: who monitors for schema changes, how breaking changes in source systems propagate through the DAG, and why a full refresh is sometimes the safest option.

Q5

Explain dbt sources and the source() function. How do they differ from ref()?

A strong answer includes:

Sources are external tables not managed by dbt, declared in YAML with database, schema, and table names. source() references them and enables freshness checks (dbt source freshness). ref() references dbt-managed models and creates build dependencies. The distinction matters because dbt can rebuild models but cannot rebuild sources. A strong answer mentions source freshness as an alerting mechanism: if a source has not been updated in the expected window, downstream models should not run on stale data.

Q6

Write a Jinja macro that generates a CASE statement to map status codes to human-readable labels.

A strong answer includes:

A strong answer writes a macro that accepts a column name and a dictionary of mappings, then generates CASE WHEN {{column}} = 'A' THEN 'Active' for each entry. The macro should handle a default/else case. Bonus: parameterize the else value. The key insight interviewers look for is whether you use Jinja for DRY patterns (this mapping used in 10 models) versus overengineering simple SQL. If the mapping is used in one model, inline SQL is better than a macro.

Q7

How do you structure a dbt project? Describe your directory organization and naming conventions.

A strong answer includes:

Standard structure: staging/ (one-to-one with source tables, light renaming and type casting), intermediate/ (business logic joins and aggregations), marts/ (final models consumed by BI tools). Each layer has its own schema.yml for documentation and tests. Naming convention: stg_<source>__<table>, int_<entity>_<verb>, fct_<entity> for facts, dim_<entity> for dimensions. A strong answer mentions using dbt packages (dbt_utils, codegen), custom schema macros for environment routing, and the importance of keeping staging models thin.

Q8

What is the difference between dbt Core and dbt Cloud? When would you choose each?

A strong answer includes:

dbt Core is the open-source CLI. You manage scheduling (Airflow, cron), environments, and CI/CD yourself. dbt Cloud is a managed SaaS: it provides a web IDE, job scheduling, CI on PR, documentation hosting, and environment management. Choose Core when you have existing orchestration (Airflow), want full control, or have budget constraints. Choose Cloud when you want faster onboarding, built-in CI, and managed infrastructure. A strong answer notes that dbt Cloud's semantic layer and dbt Mesh (cross-project references) are Cloud-only features that matter for large organizations.

Q9

How do you implement CI/CD for a dbt project?

A strong answer includes:

On every pull request: run dbt build (models + tests) against a CI schema using a slim CI approach (only modified models and their downstream dependencies, identified by dbt ls with the state:modified+ selector). On merge to main: deploy to production by running the full build or changed models only. Use defer to reference production models for unmodified upstream dependencies. A strong answer mentions state comparison (manifest.json diffing), the importance of deterministic environments (pinned dbt and package versions), and cost management for CI runs against large warehouses.

Q10

Explain dbt snapshots. How would you implement SCD Type 2 for a customer dimension table?

A strong answer includes:

Configure a snapshot with strategy='timestamp' using the updated_at column, or strategy='check' specifying which columns to monitor. dbt adds dbt_valid_from and dbt_valid_to columns. When a tracked column changes, dbt closes the old record (sets dbt_valid_to) and inserts a new record (dbt_valid_from = now, dbt_valid_to = null). To query the current state: WHERE dbt_valid_to IS NULL. To query historical state: WHERE some_date BETWEEN dbt_valid_from AND COALESCE(dbt_valid_to, '9999-12-31'). A strong answer discusses snapshot scheduling frequency and the risk of missing intermediate state changes.

Worked Example: Incremental Model with Lookback

This is the pattern interviewers expect when they ask you to build an incremental model that handles late-arriving data.

-- models/marts/fct_orders.sql
{{
  config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge',
    on_schema_change='append_new_columns'
  )
}}

SELECT
    order_id,
    customer_id,
    order_date,
    total_amount,
    status,
    updated_at
FROM {{ ref('stg_orders') }}

{% if is_incremental() %}
  -- Lookback window handles late-arriving updates
  WHERE updated_at > (
    SELECT DATEADD(day, -3, MAX(updated_at))
    FROM {{ this }}
  )
{% endif %}

The 3-day lookback window means dbt reprocesses the last 3 days of data on every run. This catches orders that were updated after initial load. The merge strategy swaps old rows for updated ones by matching on order_id. The tradeoff: processing 3 extra days of data versus missing late updates.

Common Mistakes in dbt Interviews

Building incremental models without a unique_key, resulting in duplicate rows on every run instead of upserts

Using ref() inside Jinja loops without understanding that it creates dependencies the DAG must resolve

Skipping tests on staging models because 'the data comes from a trusted source' and then discovering bad data in production marts

Writing complex Jinja macros when a simple SQL CASE statement would be clearer and easier to maintain

Not using dbt source freshness checks, allowing pipelines to run on stale data without alerting

Putting business logic in staging models instead of keeping them as thin, one-to-one mappings of source tables

dbt Interview Questions FAQ

How common are dbt questions in data engineering interviews?+
Very common for roles at companies using the modern data stack (Snowflake/BigQuery + dbt + Airflow/Dagster). If the job description mentions dbt, expect 3 to 5 questions. Even if it does not mention dbt specifically, knowing it demonstrates familiarity with analytics engineering best practices.
Do I need to know SQL well to answer dbt interview questions?+
Yes. dbt is SQL-first. Every model is a SELECT statement. If you struggle with window functions, CTEs, or join semantics, dbt questions will expose that. Practice SQL fundamentals first, then layer dbt concepts on top.
Should I learn dbt Core or dbt Cloud for interview prep?+
Learn dbt Core first. It teaches you the fundamentals without the managed service abstractions. Once you understand profiles.yml, dbt_project.yml, model files, and the CLI commands, dbt Cloud concepts are easy to learn. Most interview questions are about dbt concepts, not the specific interface.
What dbt packages should I know about for interviews?+
Know dbt_utils (surrogate_key, pivot, unpivot, star), dbt_expectations (Great Expectations-style tests), and codegen (auto-generate model and source YAML). Mentioning packages shows you understand the ecosystem and do not reinvent common patterns.

Practice dbt and SQL Interview Questions

dbt is SQL. Sharpen your SQL skills with real challenges and build the transformation expertise interviewers look for.