Data Engineering Interview Prep
Snowflake dominates the cloud data warehouse market, and interviews reflect that. You will face questions about the architecture, performance tuning, data loading, cost management, and the features that make Snowflake different from traditional warehouses.
Covers micro-partitions, clustering keys, time travel, Snowpipe, streams/tasks, and the cost optimization strategies interviewers want to hear about.
Snowflake questions go beyond SQL syntax. Interviewers test whether you understand Snowflake's architecture and can use its features to build reliable, cost-effective pipelines.
Junior candidates should explain the three-layer architecture, the difference between warehouses and databases, and how to load data with COPY INTO. You should know what time travel is and when to use it.
Mid-level candidates need to discuss micro-partition pruning, clustering keys, Snowpipe versus batch loading, and streams/tasks for CDC. You should be able to optimize a slow query using the query profile.
Senior candidates must design cost-efficient architectures: warehouse sizing strategies, multi-cluster scaling, role-based access control, data sharing, and governance features. You should articulate the tradeoffs between Snowflake and alternatives like Databricks or BigQuery.
Snowflake separates storage (cloud object storage), compute (virtual warehouses), and cloud services (metadata, authentication, optimization). This separation means you can scale compute independently of storage. Interviewers test whether you understand the implications: multiple teams can query the same data with different warehouse sizes without contention, and you only pay for compute when queries are running.
Virtual warehouses are clusters of compute resources. They come in T-shirt sizes (XS through 6XL), each doubling the resources and cost of the previous size. Multi-cluster warehouses auto-scale by adding clusters when queries queue. Interviewers ask about warehouse sizing, auto-suspend (stop paying when idle), and the difference between scaling up (bigger warehouse) and scaling out (more clusters).
Snowflake stores data in immutable micro-partitions of 50 to 500MB (compressed). Each micro-partition stores column-level min/max metadata. When you filter on a column, Snowflake skips entire micro-partitions where the filter value falls outside the min/max range. This is partition pruning. Interviewers test whether you understand that natural ordering affects pruning efficiency and that clustering keys improve it.
Clustering keys define the sort order of data within micro-partitions. Good clustering keys match your most common filter columns (typically date or region). Automatic clustering maintains the sort order as new data arrives. Interviewers ask about the tradeoff: clustering improves read performance but costs compute credits for ongoing maintenance. It is only worth it for large tables (multi-TB) with predictable query patterns.
Time travel lets you query historical data versions: SELECT * FROM table AT(TIMESTAMP => '2026-01-14 10:00:00'). The retention period is configurable (0 to 90 days, default 1 day for standard edition). After time travel expires, data enters the 7-day fail-safe period (accessible only by Snowflake support). Interviewers ask about the storage cost implications and how to use time travel for data recovery.
CLONE creates a metadata-only copy of a table, schema, or database. No data is duplicated; the clone references the same micro-partitions. Changes to either the original or clone create new micro-partitions only for the modified data. Interviewers test whether you understand that cloning is instant and nearly free, and that it is the standard approach for creating dev/test environments from production data.
Explain Snowflake's architecture and how it differs from traditional data warehouses.
Snowflake separates storage, compute, and cloud services into independent layers. Traditional warehouses (Teradata, Oracle) tightly couple compute and storage on the same nodes. This separation means Snowflake can scale compute without moving data, run multiple workloads concurrently without contention, and charge separately for storage and compute. A strong answer mentions that the cloud services layer handles query optimization, metadata management, and access control, and that it runs 24/7 (unlike warehouses which can be suspended).
You have a query that scans 500GB but only needs data from the last 7 days. The table has 3 years of data. How do you optimize it?
First, check if the table is clustered by date. If not, add a clustering key on the date column. This reorganizes micro-partitions so that date ranges are co-located, enabling partition pruning. The query should filter on the date column in the WHERE clause so Snowflake can skip micro-partitions outside the 7-day range. A strong answer quantifies the improvement: with good clustering, the query might scan 3GB instead of 500GB. Also mention checking the query profile to verify partition pruning is working (Partitions scanned vs Partitions total).
What is the difference between scaling up and scaling out in Snowflake? When would you choose each?
Scaling up means using a larger warehouse (e.g., Medium to Large). This helps queries that are complex or scan large amounts of data by providing more compute per query. Scaling out means adding clusters to a multi-cluster warehouse. This helps when many concurrent queries are queuing because each cluster handles queries independently. Choose scaling up for slow individual queries. Choose scaling out for high concurrency workloads where queries are fast but wait in line. A strong answer mentions that scaling up does not always help: if a query is waiting on a single partition scan, more nodes do not help.
Explain Snowpipe and how it differs from the COPY INTO command for data loading.
COPY INTO is a batch operation: you run it manually or on a schedule to load files from a stage into a table. Snowpipe is a continuous, event-driven loader: it watches a stage for new files (via cloud notifications like S3 SQS) and loads them automatically within minutes. COPY INTO uses your warehouse compute; Snowpipe uses Snowflake-managed serverless compute. A strong answer discusses the cost model: Snowpipe charges per-file processing fees that can exceed warehouse costs for very high file volumes, making COPY INTO cheaper for batch loads of large files.
How do Snowflake streams and tasks work together for change data capture?
A stream tracks row-level changes (inserts, updates, deletes) on a source table. A task is a scheduled SQL statement that can be triggered on a cron schedule or when a stream has new data. Together, they implement CDC: the stream captures changes since the last consumption, and the task processes those changes into a target table. A strong answer mentions that consuming a stream (reading from it in a DML statement within a task) advances the stream offset, so changes are processed exactly once. Also discuss the CHANGES clause and the difference between standard and append-only streams.
A team is running 200 queries per hour against the same warehouse and experiencing queuing. Diagnose and fix the problem.
Check the query profile for queries waiting in the queue (queued status). Solutions: enable multi-cluster warehouses with auto-scale mode to add clusters during peak demand. Set MIN_CLUSTER_COUNT and MAX_CLUSTER_COUNT. If some queries are large scans while others are small lookups, separate them into different warehouses by workload type. A strong answer also checks for query optimization opportunities: are any of the 200 queries scanning full tables when they could use partition pruning? Are result caches being used (identical queries should hit the 24-hour result cache)?
Explain zero-copy cloning. How would you use it to create a dev environment?
CREATE DATABASE dev_db CLONE prod_db creates an instant, full copy of the production database with no additional storage cost. Dev queries run on a separate warehouse, so there is no performance impact on production. As developers modify tables in dev_db, only the changed micro-partitions consume additional storage. A strong answer mentions that cloning preserves grants, so you may need to revoke production-level privileges on the clone. Also discuss using time travel to clone a database at a specific point in time for reproducible testing.
How do you manage costs in Snowflake? What are the biggest cost drivers?
The two biggest cost drivers are warehouse compute time and storage. For compute: set auto-suspend aggressively (60 seconds for ad-hoc warehouses), right-size warehouses (do not use XL for small queries), use resource monitors to set credit limits. For storage: set appropriate time travel retention (1 day instead of 90 for transient tables), monitor clone storage growth, and drop old temporary tables. A strong answer mentions the ACCOUNT_USAGE schema for historical cost analysis, the importance of warehouse utilization metrics, and that materialized views and automatic clustering also consume credits.
What are transient and temporary tables in Snowflake? When should you use each?
Transient tables have no fail-safe period (only time travel, and only up to 1 day). Temporary tables exist only for the duration of the session and have no fail-safe. Both reduce storage costs. Use transient tables for staging data, ETL intermediate tables, or any data that can be recreated. Use temporary tables for session-scoped scratch work. A strong answer notes that transient tables persist across sessions (unlike temporary), and that the lack of fail-safe means Snowflake support cannot recover accidentally deleted data.
Explain Snowflake's role-based access control (RBAC). How do you structure roles for a data engineering team?
Snowflake uses a hierarchy: users are granted roles, roles are granted privileges on objects, and roles can be granted to other roles. Standard structure: SYSADMIN owns databases and warehouses, a DATA_ENGINEER role has read/write on production schemas, a DATA_ANALYST role has read-only access to marts, and a DATA_LOADER role has write access to raw/staging schemas. A strong answer mentions the principle of least privilege, using SECURITYADMIN for role management, and the importance of separating production read access from write access even for engineers.
This pattern implements change data capture natively in Snowflake, without external tools. Interviewers expect you to know this as an alternative to Debezium or Fivetran.
-- 1. Create a stream on the source table
CREATE OR REPLACE STREAM raw_orders_stream
ON TABLE raw.orders
SHOW_INITIAL_ROWS = FALSE;
-- 2. Create a target table for the curated layer
CREATE TABLE IF NOT EXISTS curated.orders (
order_id STRING PRIMARY KEY,
customer_id STRING,
amount NUMBER(12,2),
status STRING,
loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
-- 3. Create a task that runs every 5 minutes
-- WHEN clause means it only runs when new data exists
CREATE OR REPLACE TASK process_orders_task
WAREHOUSE = etl_wh
SCHEDULE = '5 MINUTE'
WHEN SYSTEM$STREAM_HAS_DATA('raw_orders_stream')
AS
MERGE INTO curated.orders t
USING raw_orders_stream s
ON t.order_id = s.order_id
WHEN MATCHED AND s.METADATA$ACTION = 'INSERT' THEN
UPDATE SET t.amount = s.amount,
t.status = s.status,
t.loaded_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED AND s.METADATA$ACTION = 'INSERT' THEN
INSERT (order_id, customer_id, amount, status)
VALUES (s.order_id, s.customer_id, s.amount, s.status);
-- 4. Enable the task
ALTER TASK process_orders_task RESUME;The WHEN clause prevents the task from starting (and consuming warehouse credits) when there is no new data. The MERGE handles both inserts and updates idempotently. Once the task consumes the stream, the stream offset advances automatically.
Not understanding the cost model: leaving warehouses running 24/7 when they could auto-suspend after 60 seconds of inactivity
Using clustering keys on small tables (under 1TB) where the maintenance cost exceeds the query performance benefit
Confusing time travel (user-accessible) with fail-safe (Snowflake-support-only), leading to incorrect data recovery expectations
Running COPY INTO without enabling the ON_ERROR = CONTINUE or SKIP_FILE option, causing entire loads to fail on a single bad row
Not using result caching: running identical analytical queries repeatedly instead of using Snowflake's 24-hour result cache
Creating full table copies instead of using zero-copy cloning, wasting storage and time
Snowflake runs SQL. Sharpen your SQL skills with real challenges and walk into your interview ready to discuss architecture, tuning, and cost optimization.