Certifications
The complete study guide for the Databricks Data Engineer Associate certification. Exam breakdown, study plan, practice questions, and the real overlap between cert topics and data engineering interview questions.
Key numbers for the Databricks Certified Data Engineer Associate exam. Know these before you start studying.
45
Questions
Multiple choice
90 min
Duration
Proctored
~70%
Passing Score
Scaled scoring
$200
Cost
USD per attempt
2 years
Validity
Then recertify
Remote
Format
Online proctored
Weight your study time proportionally. The domain percentages tell you exactly how many questions to expect from each area.
Spark SQL queries, Python DataFrame transformations, reading/writing data, ELT patterns with notebooks and jobs. This is the heaviest section because it tests hands-on engineering skills: writing correct SQL against Delta tables, understanding lazy evaluation, and knowing when to use SQL vs Python for a transformation.
Workspace architecture, cluster types, repos, notebooks, DBFS, Unity Catalog basics. Covers the Databricks runtime environment, job clusters vs all-purpose clusters, and how the platform layers compute over cloud storage. Expect questions about when to use SQL warehouses vs interactive clusters.
Unity Catalog, data access controls, row/column-level security, data discovery, and lineage. Tests whether you understand three-level namespaces (catalog.schema.table), managing permissions with GRANT statements, and how Unity Catalog tracks column-level lineage automatically.
Structured Streaming, Auto Loader, COPY INTO, trigger modes, watermarking, and checkpointing. Focuses on the practical patterns for ingesting data incrementally rather than full-reload: when to use Auto Loader vs COPY INTO, how checkpointing enables exactly-once semantics, and streaming trigger intervals.
Delta Live Tables (DLT), Workflows/Jobs, multi-task jobs, error handling, monitoring. Tests your understanding of orchestrating production workloads: defining expectations in DLT, setting up retry policies, and configuring alerts for pipeline failures.
These ten concepts appear repeatedly across exam domains. Deep understanding of each is non-negotiable for passing.
Delta Lake provides ACID guarantees on top of cloud object storage through a transaction log (_delta_log). Every write creates a new JSON commit file. Readers never see partial writes. This is the foundation of the Lakehouse and appears across multiple exam domains.
Query previous versions of a Delta table using VERSION AS OF or TIMESTAMP AS OF. Used for auditing, rollbacks, and reproducing ML training datasets. The exam tests syntax and practical scenarios like recovering accidentally deleted data.
The workhorse of upsert patterns. Matches source rows to target rows and executes INSERT, UPDATE, or DELETE in a single atomic operation. Know the syntax cold: MERGE INTO target USING source ON condition WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT.
Z-ordering colocates related data in the same files for faster reads on filtered queries. Liquid clustering is the newer, automatic replacement that adapts to query patterns. Know the difference: Z-ordering requires OPTIMIZE, liquid clustering is incremental.
Incrementally ingests new files from cloud storage using file notification or directory listing mode. Handles schema evolution automatically. The exam contrasts Auto Loader with COPY INTO: Auto Loader scales better for high-volume, continuous ingestion.
Centralized governance layer for data and AI assets. Provides three-level namespace (catalog.schema.table), fine-grained access control, automated lineage, and data sharing. The exam tests GRANT/REVOKE syntax and understanding of securable objects.
Databricks Workflows orchestrate multi-task jobs with dependency graphs. Tasks can be notebooks, Python scripts, dbt models, or DLT pipelines. Know job cluster vs all-purpose cluster cost implications and how to configure retries.
Spark Structured Streaming treats a stream as an unbounded DataFrame. Key concepts: triggers (availableNow, processingTime), output modes (append, complete, update), watermarking for late data, and checkpointing for fault tolerance.
Declarative framework for building ETL pipelines. Define tables with @dlt.table decorators and quality expectations with @dlt.expect. DLT manages orchestration, error handling, and lineage. Know the difference between streaming and materialized views in DLT.
The Bronze (raw), Silver (cleaned), Gold (business-level) layering pattern. Not Databricks-specific, but central to the certification. Know when to apply each layer: Bronze preserves raw fidelity, Silver enforces schema and deduplication, Gold aggregates for consumption.
A structured timeline for candidates with prior data engineering experience. Allocate 1 to 2 hours daily for the best results.
The best reason to pursue this cert: most of what you study maps directly to questions asked at top data engineering interviews. This is not a vanity credential.
Cert Topic
Interview Topic
Companies
Scenario-based questions matching the exam format. Each includes guidance on the reasoning behind the correct approach.
A data engineer needs to ingest JSON files that arrive continuously in a cloud storage directory. The schema of these files occasionally changes with new fields added. Which approach best handles this requirement?
Auto Loader with schema evolution enabled (cloudFiles.schemaEvolutionMode = addNewColumns). COPY INTO does not natively handle schema evolution. Manual schema definition breaks when new fields appear. Auto Loader also scales better for continuous ingestion by tracking which files have already been processed via checkpointing.
A pipeline appends data to a Delta table daily. After a bad upstream push, 50,000 incorrect rows were written yesterday. The team needs to restore the table to its state before the bad write. What is the most efficient approach?
Use Delta Lake time travel: RESTORE TABLE table_name TO VERSION AS OF (version_before_bad_write). This is an atomic operation that creates a new version pointing to the old data files. Alternatives like DELETE with filters work but are slower and require knowing exactly which rows are bad.
A team wants to implement SCD Type 1 (overwrite with latest value) for a customer dimension table. Source data arrives as daily CSV extracts. Which SQL pattern is most appropriate?
MERGE INTO customer_dim USING daily_extract ON customer_dim.customer_id = daily_extract.customer_id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *. This handles both updates to existing customers and inserts for new customers in one atomic operation. The exam expects you to distinguish SCD Type 1 (overwrite) from Type 2 (add new row with versioning).
A Structured Streaming job reads from a Kafka topic and writes to a Delta table. The job must process all available data once per hour rather than continuously. Which trigger configuration should be used?
Use trigger(availableNow=True) in a scheduled job that runs hourly. This processes all available data since the last checkpoint, then stops. trigger(processingTime='1 hour') keeps the cluster running between batches, wasting resources. trigger(once=True) is deprecated in favor of availableNow, which also handles multiple batches more efficiently.
A data engineer is creating a Delta Live Tables pipeline. One table must have no null values in the order_id column, and rows violating this rule should be quarantined rather than dropped or failing the pipeline. How should this be implemented?
Use @dlt.expect_or_drop('valid_order_id', 'order_id IS NOT NULL') to drop invalid rows, then capture quarantined rows with a separate DLT table reading from the event log. Alternatively, @dlt.expect alone tracks violations without dropping. The key distinction: expect (warn), expect_or_drop (filter), expect_or_fail (abort).
A query on a large Delta table filters by region and date columns. The table is partitioned by date but queries are still slow when filtering by region. What optimization would improve read performance without repartitioning?
Apply Z-ordering on the region column: OPTIMIZE table_name ZORDER BY (region). Z-ordering colocates rows with similar region values in the same data files, allowing data skipping to prune more files. If the table supports it, liquid clustering (CLUSTER BY) is the modern alternative that optimizes incrementally during writes.
A workspace has three teams that need different levels of access to a shared dataset. Team A needs full access, Team B needs read-only access, and Team C should only see aggregated views, not raw data. How should this be configured in Unity Catalog?
Grant ALL PRIVILEGES on the table to Team A, SELECT to Team B, and create an aggregated view with SELECT access for Team C. Unity Catalog enforces access at the view level, so Team C sees only the aggregated output. Row-level and column-level security through dynamic views add further granularity. The exam tests GRANT syntax and securable object hierarchy.
A data engineer is building an ELT pipeline that reads JSON data, flattens nested arrays, and writes to a Delta table. The JSON contains an array field called items. Which Spark SQL function is most appropriate for flattening?
Use EXPLODE(items) to create one row per array element. For arrays of structs, EXPLODE followed by dot notation (items.field_name) extracts nested fields. LATERAL VIEW EXPLODE is the Hive-style syntax. The exam also tests POSEXPLODE (includes array index) and handling of null arrays (EXPLODE_OUTER to preserve parent rows).
A production job using an all-purpose cluster costs significantly more than expected. The job runs on a schedule three times per day and does not require interactive access. What change would reduce costs?
Switch from an all-purpose cluster to a job cluster. Job clusters are created for the job run and terminated after completion, costing 2 to 4 times less per DBU. All-purpose clusters stay running and are designed for interactive development. The exam frequently tests cost optimization through cluster type selection.
A pipeline needs to process data from a source table that receives both inserts and updates. The pipeline should only process rows that changed since the last run. Which feature of Delta Lake enables this efficiently?
Use Change Data Feed (CDF) on the source table. Enable with ALTER TABLE SET TBLPROPERTIES (delta.enableChangeDataFeed = true). Then read changes with spark.readStream.option('readChangeFeed', 'true'). CDF provides _change_type (insert, update_preimage, update_postimage, delete) for each changed row, enabling efficient incremental processing.
Confusing Auto Loader with COPY INTO
Auto Loader uses file notification or directory listing with checkpointing for continuous ingestion. COPY INTO is a SQL command that reprocesses files idempotently. Auto Loader scales better for large, continuous workloads. COPY INTO is simpler for periodic, smaller batches.
Using all-purpose clusters for production jobs
Job clusters are 2 to 4 times cheaper per DBU and are purpose-built for automated workloads. All-purpose clusters are for interactive development. The exam tests this cost distinction directly.
Memorizing syntax without understanding when to use each tool
The exam is scenario-based. Knowing MERGE syntax is not enough. You need to know when MERGE is the right choice vs INSERT OVERWRITE, when Auto Loader beats COPY INTO, and when DLT is preferable to a notebook job.
Ignoring the Data Governance domain
At 19% of the exam, governance is not optional. Unity Catalog, GRANT/REVOKE, three-level namespaces, and data access policies are heavily tested. Many candidates underweight this domain and lose easy points.
Studying only SQL and skipping Python
The ELT domain (29%) tests both Spark SQL and Python. You need to be comfortable with PySpark DataFrame operations, UDFs, and Python-based notebook workflows alongside SQL.
Not taking enough practice exams
The exam uses specific scenario-based framing. Practice exams calibrate your understanding of how Databricks phrases questions and which details they consider the best answer among plausible options.
DataDriven covers SQL, Python, data modeling, and pipeline design at interview difficulty. The same topics you study for the cert, tested the way interviewers frame them.
Start Practicing