Databricks-specific interview problems for data engineer roles. Delta Lake MERGE INTO and optimize and Z-order. Time travel for backfill. Photon vectorized execution engine. Unity Catalog for governance. Auto Loader for streaming ingestion. Structured Streaming on Delta tables. The patterns Databricks data engineer candidates need beyond generic Spark.

Databricks data engineer interviews layer Databricks-specific technology on top of generic Spark expertise. Candidates fluent in PySpark, Spark SQL, and Spark optimization still need to know Delta Lake, Photon, Unity Catalog, Auto Loader, and Databricks SQL. The interview round assumes the candidate will join a Databricks team and ship production code on the Databricks Runtime.

Delta Lake is the table format. ACID transactions on object storage. Schema evolution and enforcement. Time travel via VERSION AS OF or TIMESTAMP AS OF. MERGE INTO for upsert with idempotent re-runs. OPTIMIZE for file compaction (small-file problem on streaming writes). Z-ORDER BY for multi-dimensional clustering (improves predicate pushdown on multiple columns). The data engineer interview question typically combines: design an ingestion pipeline that writes hourly to a Delta table, MERGE INTO for upsert idempotency, OPTIMIZE on a daily schedule for compaction, Z-ORDER BY user_id and event_date for query performance.

Photon is the vectorized execution engine. Native C++ code generation for query execution, replacing Tungsten's JVM-bytecode approach for supported operators. 2-3x speedup typical on workloads dominated by SQL aggregations and joins. Not all operators are Photon-supported; UDFs fall back to Tungsten. The data engineer interview question: when does Photon help, when does it not, and how to verify Photon is being used (look for Photon-prefixed operators in the explain plan).

Unity Catalog is Databricks's governance layer. Three-level namespace: catalog.schema.table. Centralized access control, column-level and row-level filters, audit logging, lineage. The data engineer interview question: how does Unity Catalog handle a multi-tenant setup, what is the difference between catalogs and schemas, how do row-level filters work for data masking.

Auto Loader is the streaming ingestion connector. cloudFiles source detects new files in cloud storage (S3, ADLS, GCS) and streams them to a Delta table. Two file detection modes: directory listing (cheap but slower for very large directories) and file notification (uses cloud notification services for immediate detection at scale). The data engineer interview question: when to use Auto Loader versus Kafka Connect for file-based ingestion, how to handle schema evolution with cloudFiles.

Structured Streaming on Delta is the production streaming pattern. Read from Kafka or Auto Loader as source, micro-batch or continuous trigger, write to Delta with merge or append, watermark for late-arriving handling, checkpoint for fault tolerance. The data engineer interview question: design a streaming pipeline that reads from Kafka, dedups on composite key, MERGE INTO a Delta gold table, handles late events up to 7 days.

Databricks-specific PySpark questions also appear. The DBUtils library (dbutils.fs, dbutils.secrets, dbutils.notebook). Magic commands in notebooks (%sql, %python, %md). Cluster types (all-purpose for interactive, jobs for production, SQL warehouses for BI). Databricks Runtime versions and what they include. The data engineer who has built production pipelines on Databricks knows these; the candidate without that experience often stumbles on the workflow questions.

Databricks Interview Problems

Databricks-specific interview problems for data engineer interview prep.

Common questions

What is Delta Lake?
Databricks's open-source table format providing ACID transactions on object storage (S3, ADLS, GCS). Features: schema evolution and enforcement, time travel via VERSION AS OF or TIMESTAMP AS OF, MERGE INTO for upsert, OPTIMIZE for file compaction, Z-ORDER BY for multi-dimensional clustering. The default table format on Databricks.
What is Photon and when does it help?
Databricks's vectorized execution engine. Native C++ code generation for supported operators (SQL aggregations, joins, certain transformations), replacing Tungsten's JVM bytecode for those operators. 2-3x speedup typical. Not all operators are Photon-supported; UDFs fall back to Tungsten. Verify Photon is being used by looking for Photon-prefixed operators in EXPLAIN.
What is Unity Catalog?
Databricks's governance layer. Three-level namespace: catalog.schema.table. Centralized access control with column-level and row-level filters, audit logging, lineage tracking, data discovery. Replaces the older Hive metastore approach with workspace-level isolation.
What is Auto Loader and when is it used?
Databricks's streaming ingestion connector for file-based sources. cloudFiles source detects new files in cloud storage and streams them to a Delta table. Two detection modes: directory listing (cheap, slower for very large directories) and file notification (immediate detection at scale via cloud notification services). Use Auto Loader for file-based ingestion; Kafka Connect for event-stream-based.
How does MERGE INTO work on a Delta table?
MERGE INTO target USING source ON target.pk = source.pk WHEN MATCHED AND source.op = 'DELETE' THEN DELETE WHEN MATCHED THEN UPDATE SET col = source.col WHEN NOT MATCHED THEN INSERT (pk, col) VALUES (source.pk, source.col). Idempotent on re-run if source is deterministic. The DeltaTable.forPath().merge() PySpark API does the same thing programmatically.
What is Z-ORDER BY in Delta?
A multi-dimensional clustering technique that co-locates related rows in the same files. OPTIMIZE table ZORDER BY (user_id, event_date) reorders the table's files so queries filtering on user_id, event_date, or both touch fewer files. Useful for tables with multi-dimensional access patterns. Trade-off: OPTIMIZE itself takes time and compute.
How does a data engineer use time travel on Delta?
SELECT * FROM table VERSION AS OF 100 or SELECT * FROM table TIMESTAMP AS OF '2026-05-27'. Reads the table state at the specified version or time. Useful for debugging (what did the table look like before the bug?), backfill (re-process from a past state), and audit. Time travel retention is configured per table (default 30 days).
How does Structured Streaming write to Delta?
df.writeStream.format('delta').option('checkpointLocation', '/checkpoint/path').outputMode('append').trigger(processingTime='1 minute').start('/delta/path'). Append mode for inserts only. Update mode for MERGE-INTO-like semantics. Checkpoint location stores progress for fault tolerance. Trigger controls micro-batch frequency.