Data Engineering Interview Prep
Databricks interviews go beyond Spark fundamentals. Interviewers test your knowledge of the lakehouse architecture: Delta Lake transactions, Unity Catalog governance, Delta Live Tables, and the platform-specific features that separate Databricks from raw Spark.
Covers Delta Lake 3.x, Unity Catalog, DLT, Auto Loader, Photon, and the production patterns that interviewers expect.
Databricks questions test whether you have built production pipelines on the platform or just run notebooks in a tutorial. The strongest candidates explain Delta Lake mechanics, governance patterns, and cost optimization with confidence.
Junior candidates should explain Delta Lake (what it adds over Parquet), the medallion architecture (bronze/silver/gold), and the difference between notebooks and jobs.
Mid-level candidates need to discuss Auto Loader, structured streaming with Delta, OPTIMIZE and VACUUM, and Unity Catalog basics. You should be able to design a medallion pipeline and explain your data quality strategy.
Senior candidates must design lakehouse architectures with governance, discuss DLT pipeline design, compare liquid clustering with Z-ordering, implement CI/CD with Databricks Asset Bundles, and articulate when Databricks is the right choice versus Snowflake or other platforms.
Delta Lake adds ACID transactions, schema enforcement, and time travel to data lakes. It stores data as Parquet files with a transaction log (_delta_log) that tracks every change. Interviewers test whether you understand that Delta Lake is not a database: it is an open storage format. The transaction log enables features like MERGE, UPDATE, DELETE, and concurrent reads/writes that raw Parquet cannot support.
Unity Catalog provides centralized governance across all Databricks workspaces. It manages a three-level namespace (catalog.schema.table), fine-grained access control, data lineage, and audit logging. Interviewers ask about the difference between the legacy Hive metastore (workspace-level) and Unity Catalog (account-level), and how it enables data sharing across teams without copying data.
DLT is a declarative framework for building ETL pipelines. You define expectations (data quality rules) and transformations, and DLT handles orchestration, error handling, and monitoring. Interviewers test whether you understand the difference between DLT's streaming tables (continuous ingestion) and materialized views (batch recomputation), and how expectations quarantine bad records.
Notebooks are interactive environments for development and exploration. Jobs are scheduled, parameterized executions of notebooks or Python/JAR files. Interviewers ask why production pipelines should use jobs (not notebooks): jobs have retries, alerting, cluster policies, dependency management, and audit trails. A common mistake is deploying notebooks as production workloads without proper error handling.
Databricks Structured Streaming processes data incrementally as it arrives. It integrates with Delta Lake for exactly-once guarantees using checkpointing. Trigger modes control processing frequency: trigger.availableNow processes all available data then stops (replaces trigger.once), while continuous triggers process with low latency. Interviewers want to hear about watermarks, output modes, and the trigger.availableNow pattern for cost-efficient streaming.
Photon is Databricks' C++ vectorized execution engine that replaces the JVM-based Spark SQL engine for supported operations. It dramatically accelerates scan-heavy and aggregation-heavy queries. Interviewers ask when Photon helps (SQL workloads, Delta Lake operations) and when it does not (Python UDFs, ML training). Z-ordering and liquid clustering optimize data layout for query performance.
Explain Delta Lake's transaction log. How does it enable ACID transactions on a data lake?
Every write to a Delta table creates a new JSON file in the _delta_log directory. Each log entry records which Parquet files were added and removed. Reads reconstruct the table state by replaying the log. ACID is achieved through optimistic concurrency control: writers check that no conflicting changes occurred since they read the log. If a conflict is detected, the write fails and can be retried. A strong answer mentions checkpoint files (every 10 commits by default) that compact the log for faster reads, and explains that this design enables time travel by reading the log at a specific version.
What is the difference between Z-ordering and liquid clustering? When would you choose each?
Z-ordering reorganizes data within files so that rows with similar values in the Z-ordered columns are co-located. It runs as a manual OPTIMIZE operation and can be expensive. Liquid clustering (introduced in Delta Lake 3.0) is an incremental, automatic approach: it clusters data on write and during OPTIMIZE without requiring a full rewrite. Choose Z-ordering for tables on older Delta versions. Choose liquid clustering for new tables because it handles partial updates, does not require choosing the number of files, and supports column changes without a full rewrite. A strong answer notes that Z-ordering requires periodic OPTIMIZE runs while liquid clustering integrates into the write path.
How does Unity Catalog change data governance compared to the Hive metastore?
The Hive metastore is workspace-scoped: each workspace has its own metastore, making cross-workspace data sharing difficult. Unity Catalog is account-scoped: one catalog serves all workspaces. It adds fine-grained access control (column-level, row-level), automatic lineage tracking, and centralized audit logs. A strong answer discusses the three-level namespace (catalog.schema.table), external locations for managing access to raw storage, and how Unity Catalog enables data mesh patterns where teams own their data products while governance remains centralized.
Walk through how you would build a medallion architecture pipeline on Databricks.
Bronze layer: ingest raw data from sources (Kafka, files, APIs) into Delta tables with minimal transformation. Append-only, preserving the raw record plus metadata (ingestion timestamp, source). Silver layer: clean, deduplicate, and apply schema enforcement. MERGE operations handle late-arriving data. Data quality checks reject or quarantine bad records. Gold layer: business-level aggregations and dimension tables optimized for BI tool consumption. A strong answer discusses using Auto Loader for bronze ingestion, DLT for orchestrating all three layers, and the specific Delta Lake features at each layer (schema evolution at bronze, MERGE at silver, Z-ordering at gold).
Explain Auto Loader. How does it differ from a manual listing of cloud storage files?
Auto Loader (cloudFiles) discovers new files in cloud storage incrementally using file notification (event-based, default) or directory listing (polling). File notification mode uses cloud events (S3 SQS, Azure Event Grid) to detect new files without scanning the directory. This matters for directories with millions of files where listing is slow and expensive. Auto Loader tracks processed files in a checkpoint, ensuring exactly-once processing. A strong answer compares it to spark.read: Auto Loader handles new file discovery, schema inference, schema evolution, and checkpoint management automatically.
You have a Delta table with 10TB of data and queries are slow. Walk through your optimization approach.
Step 1: Check the query profile for full table scans. Add partition pruning filters if the table is partitioned. Step 2: If the table is not partitioned, add liquid clustering on the most common filter columns. Step 3: Run OPTIMIZE to compact small files (the small file problem is the most common cause of slow Delta queries). Step 4: Check if Photon is enabled for the cluster. Step 5: Review the query itself for inefficiencies: unnecessary joins, SELECT *, or missing predicate pushdown. A strong answer mentions VACUUM to clean up old files (but warns about the interaction with time travel retention) and ANALYZE TABLE to update statistics for the optimizer.
What are Delta Live Tables expectations? How do they handle data quality?
Expectations are data quality rules defined inline with DLT pipelines: @expect('valid_amount', 'amount > 0') warns on violations, @expect_or_drop removes violating rows, @expect_or_fail stops the pipeline. Expectations produce metrics (number of passing/failing records) visible in the DLT UI. A strong answer discusses the quarantine pattern: routing failed records to a separate table for investigation rather than dropping them silently, and how expectations replace external data quality tools like Great Expectations for DLT pipelines.
Compare Databricks SQL warehouses with interactive clusters. When do you use each?
SQL warehouses are optimized for SQL workloads: they use Photon, scale automatically based on query concurrency, and are billed per-query or per-hour. Interactive clusters support notebooks, Python, Scala, and arbitrary workloads. Use SQL warehouses for BI tool connections, dashboards, and ad-hoc SQL queries. Use interactive clusters for development, ML training, and PySpark jobs. A strong answer mentions serverless SQL warehouses (faster startup, managed by Databricks) versus classic warehouses (you manage the configuration) and the cost implications of each.
How do you implement CI/CD for Databricks jobs?
Store notebooks and job definitions in Git (Databricks Repos or external). Use the Databricks CLI or REST API to deploy jobs programmatically. In CI: run unit tests on transformation logic, validate notebook syntax, and optionally run integration tests against a dev workspace. In CD: promote job definitions from dev to staging to production using environment-specific configurations (cluster sizes, catalogs, schemas). A strong answer mentions the Databricks Asset Bundles (DABs) framework for declarative job/pipeline definitions, and the importance of separating code (notebooks/scripts) from infrastructure (cluster configs, permissions) in version control.
Explain the difference between trigger.once, trigger.availableNow, and continuous trigger modes in Structured Streaming.
trigger.once (deprecated) processes all available data in one micro-batch, then stops. trigger.availableNow processes all available data across multiple micro-batches (better for large backlogs), then stops. Continuous trigger processes data as it arrives with minimal latency but uses the cluster continuously. For cost-efficient streaming, use trigger.availableNow scheduled via a job: it processes new data, stops, and the cluster can terminate. A strong answer discusses how this pattern gives near-real-time freshness (e.g., every 15 minutes) without the cost of an always-on cluster, and notes that trigger.availableNow respects maxFilesPerTrigger for rate limiting.
This is the standard bronze ingestion pattern interviewers expect. Auto Loader discovers new files, infers schema, and writes to Delta with exactly-once guarantees.
# Bronze ingestion with Auto Loader
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/mnt/checkpoints/orders_schema")
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.load("/mnt/raw/orders/")
.withColumn("_ingested_at", F.current_timestamp())
.withColumn("_source_file", F.input_file_name())
.writeStream
.format("delta")
.option("checkpointLocation", "/mnt/checkpoints/orders")
.option("mergeSchema", "true")
.trigger(availableNow=True)
.toTable("bronze.raw_orders")
)
# After ingestion, optimize the bronze table
spark.sql("""
OPTIMIZE bronze.raw_orders
WHERE _ingested_at >= current_date() - INTERVAL 1 DAY
""")Key details: schemaEvolutionMode handles new fields in the source JSON without breaking the pipeline. The _ingested_at and _source_file metadata columns enable debugging and reprocessing. trigger(availableNow=True) processes all available files then stops, keeping cluster costs low. The OPTIMIZE at the end compacts small files created during ingestion.
Confusing Delta Lake (open-source storage format) with Databricks (commercial platform). Delta Lake runs on any Spark deployment, not just Databricks
Not running OPTIMIZE on Delta tables, allowing small files to accumulate and degrade query performance dramatically
Using notebooks for production workloads without job scheduling, retries, or alerting, then being surprised when failures go unnoticed
Ignoring Unity Catalog in architecture discussions, which signals unfamiliarity with current Databricks best practices
Setting VACUUM retention shorter than the longest running query or streaming checkpoint, causing FileNotFoundException errors
Treating Databricks as 'just Spark' without understanding Photon, Auto Loader, DLT, and other platform-specific features that interviewers expect you to know
Databricks runs on Spark and SQL. Build fluency with real challenges and walk into your interview ready to discuss lakehouse architecture, Delta Lake, and production pipeline design.