AWS Glue interview questions for data engineer roles at AWS- native companies. Glue is AWS's managed serverless ETL service combined with a Hive-Metastore-compatible Data Catalog used by Athena, Redshift Spectrum, EMR, and other AWS analytics services. 30+ questions covering Glue ETL job patterns (PySpark on Glue), crawlers, Data Catalog, job bookmarks for incremental processing, Glue Streaming, Glue 4.0 performance improvements, and cost optimization. Pair with the complete data engineer interview preparation framework and the how to pass the AWS Data Engineer interview.
From 156 reported AWS DE loops in 2024-2026 that included Glue questions.
| Topic | Test Frequency | Depth Expected |
|---|---|---|
| Glue ETL jobs (PySpark) | 94% | Spark on Glue, transformation patterns |
| Worker type choice (G.1X, G.2X, G.4X) | 82% | Right-sizing for workload |
| Glue Data Catalog | 89% | Hive-Metastore-compatible, integration with Athena / Redshift Spectrum |
| Glue crawlers | 72% | Schema discovery, partition discovery, when to use vs explicit definitions |
| Job bookmarks | 76% | Incremental processing, state management |
| Glue Streaming | 47% | Spark Structured Streaming on Glue |
| Glue 4.0 vs 3.0 vs 2.0 | 63% | Performance improvements, Spark version, runtime features |
| Glue Studio (visual editor) | 39% | Pros and cons vs code-first development |
| Glue DataBrew (data prep) | 28% | Newer service, light-touch transformations |
| Iceberg integration in Glue | 58% | Reading and writing Iceberg tables |
| Cost optimization | 67% | Worker count, runtime, Glue Flex execution |
| IAM roles and permissions for Glue | 54% | Cross-account access, S3 permissions |
| Job triggers and scheduling | 61% | EventBridge, on-demand, scheduled triggers |
| Error handling and retries | 53% | Failed-job recovery, partial-success patterns |
A Glue ETL job is a PySpark (or Scala) script that reads from a source, transforms, and writes to a sink. Glue handles cluster provisioning (you don't manage EC2), Spark version (Glue 4.0 ships Spark 3.3), and connector libraries. You write the transformation logic; Glue runs it.
The job structure: read from source (S3, JDBC, or another Glue catalog table) using DynamicFrame (Glue's wrapper around DataFrame) or DataFrame directly. Transform via standard Spark SQL or PySpark transformations. Write to sink (S3 in Parquet typically, JDBC for warehouse loads, Glue catalog for catalog-aware writes). Most jobs use DataFrame for readability; the DynamicFrame abstraction is helpful for schema-on-read scenarios but can be skipped for simpler workloads.
Worker type sizing is the first cost lever. G.1X (4 vCPU, 16 GB) for most light workloads. G.2X (8 vCPU, 32 GB) for moderate workloads with shuffles. G.4X (16 vCPU, 64 GB) for heavy workloads and large state. G.8X (32 vCPU, 128 GB) for ML feature engineering or very large joins. Number of workers (DPU) trades off runtime against cost; doubling workers roughly halves runtime up to a saturation point.
Glue 4.0 (released late 2022, mature in 2024-2026) shipped Spark 3.3, Python 3.10, and several performance improvements that closed the historical gap with self-managed Spark on EMR.
Performance: Glue 4.0 jobs run roughly 30-40% faster than Glue 3.0 on equivalent workloads, primarily due to Spark 3.3's adaptive query execution and improved code generation. New connectors: native Iceberg, Hudi, Delta Lake support out of the box (no custom JARs required). Streaming: Spark Structured Streaming improvements including better state management and lower micro-batch latency.
In interviews, mentioning Glue 4.0 specifically signals you know the current generation. Defaulting to Glue 3.0 patterns is a yellow flag in 2026; you should know what changed and why upgrading matters.
Glue is the ETL component in any how to pass the AWS Data Engineer interview stack and the Hive-Metastore-equivalent for AWS Redshift interview prep Spectrum and Athena. The system design framework from how to pass the system design round applies to Glue-based architectures with substitutions (Glue ETL replaces self-managed Spark, Glue Catalog replaces Hive Metastore).
For comparison with non-AWS equivalents, see Google BigQuery interview prep (GCP equivalent: Dataflow + BigQuery) and the how to pass the Azure Data Engineer interview guide (Azure equivalent: Data Factory + Synapse). The concepts transfer; the service names differ.
Drill the system design patterns relevant to AWS Glue interviews in our practice sandbox.
Start PracticingThe full AWS data engineer loop framework with Glue as central ETL.
Glue Data Catalog integration with Redshift Spectrum.
Pillar guide covering every round in the Data Engineer loop, end to end.
The full SQL interview question bank, indexed by topic, difficulty, and company.
BigQuery internals, slot-based pricing, partitioning, and clustering interview prep.
Redshift sort keys, dist keys, compression, and RA3 architecture interview prep.
Postgres MVCC, indexing, partitioning, and replication interview prep.
Apache Flink stateful streaming, watermarks, exactly-once, checkpointing interview prep.
Hadoop ecosystem (HDFS, MapReduce, YARN, Hive) interview prep, including modern relevance.
Continue your prep
50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.