Question 1

What is Delta Lake?

Accepted Answer

Databricks's open-source table format providing ACID transactions on object storage (S3, ADLS, GCS). Features: schema evolution and enforcement, time travel via VERSION AS OF or TIMESTAMP AS OF, MERGE INTO for upsert, OPTIMIZE for file compaction, Z-ORDER BY for multi-dimensional clustering. The default table format on Databricks.

Question 2

What is Photon and when does it help?

Accepted Answer

Databricks's vectorized execution engine. Native C++ code generation for supported operators (SQL aggregations, joins, certain transformations), replacing Tungsten's JVM bytecode for those operators. 2-3x speedup typical. Not all operators are Photon-supported; UDFs fall back to Tungsten. Verify Photon is being used by looking for Photon-prefixed operators in EXPLAIN.

Question 3

What is Unity Catalog?

Accepted Answer

Databricks's governance layer. Three-level namespace: catalog.schema.table. Centralized access control with column-level and row-level filters, audit logging, lineage tracking, data discovery. Replaces the older Hive metastore approach with workspace-level isolation.

Question 4

What is Auto Loader and when is it used?

Accepted Answer

Databricks's streaming ingestion connector for file-based sources. cloudFiles source detects new files in cloud storage and streams them to a Delta table. Two detection modes: directory listing (cheap, slower for very large directories) and file notification (immediate detection at scale via cloud notification services). Use Auto Loader for file-based ingestion; Kafka Connect for event-stream-based.

Question 5

How does MERGE INTO work on a Delta table?

Accepted Answer

MERGE INTO target USING source ON target.pk = source.pk WHEN MATCHED AND source.op = 'DELETE' THEN DELETE WHEN MATCHED THEN UPDATE SET col = source.col WHEN NOT MATCHED THEN INSERT (pk, col) VALUES (source.pk, source.col). Idempotent on re-run if source is deterministic. The DeltaTable.forPath().merge() PySpark API does the same thing programmatically.

Question 6

What is Z-ORDER BY in Delta?

Accepted Answer

A multi-dimensional clustering technique that co-locates related rows in the same files. OPTIMIZE table ZORDER BY (user_id, event_date) reorders the table's files so queries filtering on user_id, event_date, or both touch fewer files. Useful for tables with multi-dimensional access patterns. Trade-off: OPTIMIZE itself takes time and compute.

Question 7

How does a data engineer use time travel on Delta?

Accepted Answer

SELECT * FROM table VERSION AS OF 100 or SELECT * FROM table TIMESTAMP AS OF '2026-05-27'. Reads the table state at the specified version or time. Useful for debugging (what did the table look like before the bug?), backfill (re-process from a past state), and audit. Time travel retention is configured per table (default 30 days).

Question 8

How does Structured Streaming write to Delta?

Accepted Answer

df.writeStream.format('delta').option('checkpointLocation', '/checkpoint/path').outputMode('append').trigger(processingTime='1 minute').start('/delta/path'). Append mode for inserts only. Update mode for MERGE-INTO-like semantics. Checkpoint location stores progress for fault tolerance. Trigger controls micro-batch frequency.

Databricks Interview Problems

Databricks Interview Problems

PySpark (12)