AWS Glue Interview Questions
AWS Glue Topic Frequency in Interviews
From 156 reported AWS DE loops in 2024-2026 that included Glue questions.
| Topic | Test Frequency | Depth Expected |
|---|---|---|
| Glue ETL jobs (PySpark) | 94% | Spark on Glue, transformation patterns |
| Worker type choice (G.1X, G.2X, G.4X) | 82% | Right-sizing for workload |
| Glue Data Catalog | 89% | Hive-Metastore-compatible, integration with Athena / Redshift Spectrum |
| Glue crawlers | 72% | Schema discovery, partition discovery, when to use vs explicit definitions |
| Job bookmarks | 76% | Incremental processing, state management |
| Glue Streaming | 47% | Spark Structured Streaming on Glue |
| Glue 4.0 vs 3.0 vs 2.0 | 63% | Performance improvements, Spark version, runtime features |
| Glue Studio (visual editor) | 39% | Pros and cons vs code-first development |
| Glue DataBrew (data prep) | 28% | Newer service, light-touch transformations |
| Iceberg integration in Glue | 58% | Reading and writing Iceberg tables |
| Cost optimization | 67% | Worker count, runtime, Glue Flex execution |
| IAM roles and permissions for Glue | 54% | Cross-account access, S3 permissions |
| Job triggers and scheduling | 61% | EventBridge, on-demand, scheduled triggers |
| Error handling and retries | 53% | Failed-job recovery, partial-success patterns |
Glue ETL Jobs: The Core Pattern
A Glue ETL job is a PySpark (or Scala) script that reads from a source, transforms, and writes to a sink. Glue handles cluster provisioning (you don't manage EC2), Spark version (Glue 4.0 ships Spark 3.3), and connector libraries. You write the transformation logic; Glue runs it.
The job structure: read from source (S3, JDBC, or another Glue catalog table) using DynamicFrame (Glue's wrapper around DataFrame) or DataFrame directly. Transform via standard Spark SQL or PySpark transformations. Write to sink (S3 in Parquet typically, JDBC for warehouse loads, Glue catalog for catalog-aware writes). Most jobs use DataFrame for readability; the DynamicFrame abstraction is helpful for schema-on-read scenarios but can be skipped for simpler workloads.
Worker type sizing is the first cost lever. G.1X (4 vCPU, 16 GB) for most light workloads. G.2X (8 vCPU, 32 GB) for moderate workloads with shuffles. G.4X (16 vCPU, 64 GB) for heavy workloads and large state. G.8X (32 vCPU, 128 GB) for ML feature engineering or very large joins. Number of workers (DPU) trades off runtime against cost; doubling workers roughly halves runtime up to a saturation point.
Six Real AWS Glue Interview Questions
Design a Glue job for daily incremental load from S3 to Snowflake
When would you use Glue ETL vs EMR vs Lambda for transformation?
How do Glue job bookmarks work and when do they fail?
Design a Glue Streaming job for real-time enrichment
Right-size Glue workers for a 1TB daily transformation
Design the Glue + Athena + Redshift Spectrum analytics platform
Glue 4.0: What Changed
Glue 4.0 (released late 2022, mature in 2024-2026) shipped Spark 3.3, Python 3.10, and several performance improvements that closed the historical gap with self-managed Spark on EMR.
Performance: Glue 4.0 jobs run roughly 30-40% faster than Glue 3.0 on equivalent workloads, primarily due to Spark 3.3's adaptive query execution and improved code generation. New connectors: native Iceberg, Hudi, Delta Lake support out of the box (no custom JARs required). Streaming: Spark Structured Streaming improvements including better state management and lower micro-batch latency.
In interviews, mentioning Glue 4.0 specifically signals you know the current generation. Defaulting to Glue 3.0 patterns is a yellow flag in 2026; you should know what changed and why upgrading matters.
How Glue Connects to the Rest of the Cluster
Glue is the ETL component in any how to pass the AWS Data Engineer interview stack and the Hive-Metastore-equivalent for AWS Redshift interview prep Spectrum and Athena. The system design framework from how to pass the system design round applies to Glue-based architectures with substitutions (Glue ETL replaces self-managed Spark, Glue Catalog replaces Hive Metastore).
For comparison with non-AWS equivalents, see Google BigQuery interview prep (GCP equivalent: Dataflow + BigQuery) and the how to pass the Azure Data Engineer interview guide (Azure equivalent: Data Factory + Synapse). The concepts transfer; the service names differ.
Data engineer interview prep FAQ
Should I use DynamicFrame or DataFrame in Glue jobs?+
Is Glue Studio (visual editor) production-ready?+
When would I use Glue Flex execution?+
How does Glue cost compare to self-managed Spark on EC2?+
What's the difference between Glue ETL and Glue DataBrew?+
Can Glue handle real-time streaming workloads?+
Are Glue crawlers worth running in production?+
How is the AWS Glue Data Catalog different from a Hive Metastore?+
Practice AWS Glue ETL Patterns
Drill the system design patterns relevant to AWS Glue interviews in our practice sandbox.
Adjacent Data Engineer Interview Prep Reading
The full AWS data engineer loop framework with Glue as central ETL.
Glue Data Catalog integration with Redshift Spectrum.
Pillar guide covering every round in the Data Engineer loop, end to end.
More data engineer interview prep guides
The full SQL interview question bank, indexed by topic, difficulty, and company.
BigQuery internals, slot-based pricing, partitioning, and clustering interview prep.
Redshift sort keys, dist keys, compression, and RA3 architecture interview prep.
Postgres MVCC, indexing, partitioning, and replication interview prep.
Apache Flink stateful streaming, watermarks, exactly-once, checkpointing interview prep.
Hadoop ecosystem (HDFS, MapReduce, YARN, Hive) interview prep, including modern relevance.