Tech-Specific Question Hub

AWS Glue Interview Questions

AWS Glue interview questions for data engineer roles at AWS- native companies. Glue is AWS's managed serverless ETL service combined with a Hive-Metastore-compatible Data Catalog used by Athena, Redshift Spectrum, EMR, and other AWS analytics services. 30+ questions covering Glue ETL job patterns (PySpark on Glue), crawlers, Data Catalog, job bookmarks for incremental processing, Glue Streaming, Glue 4.0 performance improvements, and cost optimization. Pair with the complete data engineer interview preparation framework and the how to pass the AWS Data Engineer interview.

The Short Answer
Glue questions appear in 84% of AWS data engineer loops. The depth ranges from L4 understanding the difference between a Glue job and a Glue crawler to L6 designing the entire AWS analytics platform with Glue as the central orchestration and ETL layer. Strong candidates know when Glue is the right choice (vs EMR, vs Lambda, vs Step Functions), can size worker types for given workloads, and understand the cost model well enough to predict job cost from a Spark job shape.
Updated April 2026·By The DataDriven Team

AWS Glue Topic Frequency in Interviews

From 156 reported AWS DE loops in 2024-2026 that included Glue questions.

TopicTest FrequencyDepth Expected
Glue ETL jobs (PySpark)94%Spark on Glue, transformation patterns
Worker type choice (G.1X, G.2X, G.4X)82%Right-sizing for workload
Glue Data Catalog89%Hive-Metastore-compatible, integration with Athena / Redshift Spectrum
Glue crawlers72%Schema discovery, partition discovery, when to use vs explicit definitions
Job bookmarks76%Incremental processing, state management
Glue Streaming47%Spark Structured Streaming on Glue
Glue 4.0 vs 3.0 vs 2.063%Performance improvements, Spark version, runtime features
Glue Studio (visual editor)39%Pros and cons vs code-first development
Glue DataBrew (data prep)28%Newer service, light-touch transformations
Iceberg integration in Glue58%Reading and writing Iceberg tables
Cost optimization67%Worker count, runtime, Glue Flex execution
IAM roles and permissions for Glue54%Cross-account access, S3 permissions
Job triggers and scheduling61%EventBridge, on-demand, scheduled triggers
Error handling and retries53%Failed-job recovery, partial-success patterns

Glue ETL Jobs: The Core Pattern

A Glue ETL job is a PySpark (or Scala) script that reads from a source, transforms, and writes to a sink. Glue handles cluster provisioning (you don't manage EC2), Spark version (Glue 4.0 ships Spark 3.3), and connector libraries. You write the transformation logic; Glue runs it.

The job structure: read from source (S3, JDBC, or another Glue catalog table) using DynamicFrame (Glue's wrapper around DataFrame) or DataFrame directly. Transform via standard Spark SQL or PySpark transformations. Write to sink (S3 in Parquet typically, JDBC for warehouse loads, Glue catalog for catalog-aware writes). Most jobs use DataFrame for readability; the DynamicFrame abstraction is helpful for schema-on-read scenarios but can be skipped for simpler workloads.

Worker type sizing is the first cost lever. G.1X (4 vCPU, 16 GB) for most light workloads. G.2X (8 vCPU, 32 GB) for moderate workloads with shuffles. G.4X (16 vCPU, 64 GB) for heavy workloads and large state. G.8X (32 vCPU, 128 GB) for ML feature engineering or very large joins. Number of workers (DPU) trades off runtime against cost; doubling workers roughly halves runtime up to a saturation point.

Six Real AWS Glue Interview Questions

L4

Design a Glue job for daily incremental load from S3 to Snowflake

Read from S3 raw landing (date-partitioned Parquet). Transform: dedup, type-cast, normalize. Write to S3 staging (partitioned by ingest_date). Use Snowflake's COPY INTO to load from S3 staging to target table. Schedule via EventBridge trigger. Use job bookmarks to avoid re-processing S3 files already loaded. Cover: failure handling (retry policy, dead-letter prefix for malformed records), idempotency (target Snowflake table uses MERGE on a deterministic key).
L5

When would you use Glue ETL vs EMR vs Lambda for transformation?

Glue ETL: serverless Spark for transformations >5 minutes, <100 GB. Best when Spark is the right tool but you don't want to manage EMR clusters. EMR: when workload requires Hadoop ecosystem tools (Hive, HBase, Hudi), when you have long-running clusters with consistent load, when Spark configuration tuning matters. Lambda: for small transformations <15 minutes and <10 GB memory, S3 event-triggered, single-record or micro-batch. The honest answer: Glue is the default for managed Spark in 2026; EMR for Hadoop-ecosystem needs; Lambda for lightweight event-driven work.
L5

How do Glue job bookmarks work and when do they fail?

Job bookmarks track which S3 files (or JDBC primary keys) have been processed by previous job runs. Subsequent runs skip already-processed input and process only new data. Bookmarks fail when: source changes structure (file paths, schema), upstream systems modify already-processed files (modification time changes invalidate the bookmark), the job logic is non-idempotent so reprocessing produces wrong output. Mitigation: design source for bookmark compatibility (immutable file paths, stable partition structure), bookmark restoration via reset for known-bad runs, monitoring of processed vs available file count.
L5

Design a Glue Streaming job for real-time enrichment

Source: Kinesis Data Stream or MSK. Glue Streaming job runs Spark Structured Streaming with micro-batch trigger (typically 1 minute). Transform: enrich with reference data (cached in Spark, refreshed periodically), apply business logic, write to sink. Sink options: S3 (event-time partitioned), Redshift (streaming ingestion), DynamoDB (item-by-item). Cover: checkpoint location in S3 for restart, exactly-once via deterministic logic + idempotent sink, watermark for handling late events.
L5

Right-size Glue workers for a 1TB daily transformation

Estimate: 1TB input, typical compression ratio (Parquet) means ~3-5x in-memory size = 3-5TB of Spark working set. With G.2X workers (32 GB usable per executor), need ~100-150 executors of working memory. Glue ratio: 1 DPU = 1 worker (G.1X) or 2 workers (G.2X) etc. So target ~75-100 DPU. Discuss: if shuffle-heavy (joins on high-cardinality columns), increase workers; if compute-bound (regex, complex transformations), reduce workers but use G.4X or G.8X for more CPU per executor. Always prototype on a 1% sample first.
L6

Design the Glue + Athena + Redshift Spectrum analytics platform

Glue Data Catalog as the central catalog. Glue crawlers discover schema for raw S3 data; explicit table definitions for curated data. Glue ETL jobs transform raw to curated, register tables in catalog. Athena queries curated data directly via catalog. Redshift Spectrum queries curated data via external tables backed by catalog. Quicksight consumes from Athena and Redshift. Cover: catalog permissions via Lake Formation for fine-grained access, column-level security, cross-account access patterns, the lock-in vs interoperability trade-off (Glue Catalog is AWS-proprietary; alternative is self-managed Hive Metastore or Unity Catalog).

Glue 4.0: What Changed

Glue 4.0 (released late 2022, mature in 2024-2026) shipped Spark 3.3, Python 3.10, and several performance improvements that closed the historical gap with self-managed Spark on EMR.

Performance: Glue 4.0 jobs run roughly 30-40% faster than Glue 3.0 on equivalent workloads, primarily due to Spark 3.3's adaptive query execution and improved code generation. New connectors: native Iceberg, Hudi, Delta Lake support out of the box (no custom JARs required). Streaming: Spark Structured Streaming improvements including better state management and lower micro-batch latency.

In interviews, mentioning Glue 4.0 specifically signals you know the current generation. Defaulting to Glue 3.0 patterns is a yellow flag in 2026; you should know what changed and why upgrading matters.

How Glue Connects to the Rest of the Cluster

Glue is the ETL component in any how to pass the AWS Data Engineer interview stack and the Hive-Metastore-equivalent for AWS Redshift interview prep Spectrum and Athena. The system design framework from how to pass the system design round applies to Glue-based architectures with substitutions (Glue ETL replaces self-managed Spark, Glue Catalog replaces Hive Metastore).

For comparison with non-AWS equivalents, see Google BigQuery interview prep (GCP equivalent: Dataflow + BigQuery) and the how to pass the Azure Data Engineer interview guide (Azure equivalent: Data Factory + Synapse). The concepts transfer; the service names differ.

Data Engineer Interview Prep FAQ

Should I use DynamicFrame or DataFrame in Glue jobs?+
DataFrame for most workloads; DynamicFrame when you genuinely benefit from schema-on-read or Glue-specific transformations (resolveChoice, drop_fields with path syntax). DataFrame is the standard Spark API and easier to debug, test, and port if you ever migrate off Glue. Most production teams default to DataFrame.
Is Glue Studio (visual editor) production-ready?+
Light usage only. Glue Studio is fine for early prototyping or for analyst-level users. For production-quality jobs, code-first development with Git versioning is essential. The Studio-generated code is verbose and harder to maintain than hand-written PySpark.
When would I use Glue Flex execution?+
Glue Flex (released 2022) runs jobs at lower cost (~50% discount) but with no SLA on start time. Best for: non-time-sensitive batch jobs, backfills, dev / staging environments. Not appropriate for: production pipelines with strict SLAs, customer-facing data refreshes.
How does Glue cost compare to self-managed Spark on EC2?+
Glue is more expensive per DPU-hour than equivalent EC2 capacity (~30-50% premium). The trade-off is operational simplicity: Glue handles cluster provisioning, Spark version management, autoscaling. For workloads under 100 DPU-hours per day, Glue typically wins on total cost (including ops time). At higher scale, self-managed Spark on EMR can be more cost-effective.
What's the difference between Glue ETL and Glue DataBrew?+
Glue ETL: code-first PySpark for full transformation control. Glue DataBrew: visual, no-code data preparation for analyst-level users. They serve different audiences. DataBrew is rare in production data engineering pipelines.
Can Glue handle real-time streaming workloads?+
Yes via Glue Streaming (Spark Structured Streaming on Glue). Best for micro-batch (1-min triggers) workloads. For sub-second latency requirements, Kinesis Data Analytics for Apache Flink is the right choice. Glue Streaming is the right default for most streaming-with-tolerable-latency use cases.
Are Glue crawlers worth running in production?+
For raw landing zones with unpredictable schemas, yes. For curated tables with stable schemas, prefer explicit table definitions in the Data Catalog over crawler-discovered schemas. Crawler discoveries occasionally produce surprising results (column type changes, partition discoveries gone wrong) that break downstream consumers.
How is the AWS Glue Data Catalog different from a Hive Metastore?+
Glue Data Catalog is API-compatible with Hive Metastore for most operations. Differences: Glue is AWS-managed (no operational burden), supports cross-account sharing, integrates with IAM and Lake Formation for security, has higher quotas. Most Hive-aware engines (Spark, Trino, Presto) work with Glue Catalog as a drop-in replacement.

Practice AWS Glue ETL Patterns

Drill the system design patterns relevant to AWS Glue interviews in our practice sandbox.

Start Practicing

More Data Engineer Interview Prep Guides

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats