AWS Glue Interview Questions

AWS Glue interview questions for data engineer roles at AWS- native companies. Glue is AWS's managed serverless ETL service combined with a Hive-Metastore-compatible Data Catalog used by Athena, Redshift Spectrum, EMR, and other AWS analytics services. 30+ questions covering Glue ETL job patterns (PySpark on Glue), crawlers, Data Catalog, job bookmarks for incremental processing, Glue Streaming, Glue 4.0 performance improvements, and cost optimization. Pair with the complete data engineer interview preparation framework and the how to pass the AWS Data Engineer interview.

AWS Glue Topics That Show Up in Interviews

Topics ranked roughly by how often they appear in AWS-heavy data engineer loops.

Topic	Frequency	Depth Expected
Glue ETL jobs (PySpark)	Very common	Spark on Glue, transformation patterns
Worker type choice (G.1X, G.2X, G.4X)	Common	Right-sizing for workload
Glue Data Catalog	Very common	Hive-Metastore-compatible, integration with Athena / Redshift Spectrum
Glue crawlers	Common	Schema discovery, partition discovery, when to use vs explicit definitions
Job bookmarks	Common	Incremental processing, state management
Glue Streaming	Occasional	Spark Structured Streaming on Glue
Glue 4.0 vs 3.0 vs 2.0	Common	Performance improvements, Spark version, runtime features
Glue Studio (visual editor)	Occasional	Pros and cons vs code-first development
Glue DataBrew (data prep)	Rare	Newer service, light-touch transformations
Iceberg integration in Glue	Common	Reading and writing Iceberg tables
Cost optimization	Common	Worker count, runtime, Glue Flex execution
IAM roles and permissions for Glue	Common	Cross-account access, S3 permissions
Job triggers and scheduling	Common	EventBridge, on-demand, scheduled triggers
Error handling and retries	Common	Failed-job recovery, partial-success patterns

Glue ETL Jobs: The Core Pattern

A Glue ETL job is a PySpark (or Scala) script that reads from a source, transforms, and writes to a sink. Glue handles cluster provisioning (you don't manage EC2), Spark version (Glue 4.0 ships Spark 3.3), and connector libraries. You write the transformation logic; Glue runs it.

The job structure: read from source (S3, JDBC, or another Glue catalog table) using DynamicFrame (Glue's wrapper around DataFrame) or DataFrame directly. Transform via standard Spark SQL or PySpark transformations. Write to sink (S3 in Parquet typically, JDBC for warehouse loads, Glue catalog for catalog-aware writes). Most jobs use DataFrame for readability; the DynamicFrame abstraction is helpful for schema-on-read scenarios but can be skipped for simpler workloads.

Worker type sizing is the first cost lever. G.1X (4 vCPU, 16 GB) for most light workloads. G.2X (8 vCPU, 32 GB) for moderate workloads with shuffles. G.4X (16 vCPU, 64 GB) for heavy workloads and large state. G.8X (32 vCPU, 128 GB) for ML feature engineering or very large joins. Number of workers (DPU) trades off runtime against cost; doubling workers roughly halves runtime up to a saturation point.

Six Real AWS Glue Interview Questions

Design a Glue job for daily incremental load from S3 to Snowflake

Read from S3 raw landing (date-partitioned Parquet). Transform: dedup, type-cast, normalize. Write to S3 staging (partitioned by ingest_date). Use Snowflake’s COPY INTO to load from S3 staging to target table. Schedule via EventBridge trigger. Use job bookmarks to avoid re-processing S3 files already loaded. Cover: failure handling (retry policy, dead-letter prefix for malformed records), idempotency (target Snowflake table uses MERGE on a deterministic key).

When would you use Glue ETL vs EMR vs Lambda for transformation?

Glue ETL: serverless Spark for transformations >5 minutes, <100 GB. Best when Spark is the right tool but you don’t want to manage EMR clusters. EMR: when the workload requires Hadoop ecosystem tools (Hive, HBase, Hudi), when you have long-running clusters with consistent load, when Spark configuration tuning matters. Lambda: for small transformations <15 minutes and <10 GB memory, S3 event-triggered, single-record or micro-batch. In short: Glue is the default for managed Spark in 2026, EMR for Hadoop-ecosystem needs, Lambda for lightweight event-driven work.

How do Glue job bookmarks work and when do they fail?

Job bookmarks track which S3 files (or JDBC primary keys) have been processed by previous job runs. Subsequent runs skip already-processed input and process only new data. Bookmarks fail when: source changes structure (file paths, schema), upstream systems modify already-processed files (modification time changes invalidate the bookmark), the job logic is non-idempotent so reprocessing produces wrong output. Mitigation: design source for bookmark compatibility (immutable file paths, stable partition structure), bookmark restoration via reset for known-bad runs, monitoring of processed vs available file count.

Design a Glue Streaming job for real-time enrichment

Source: Kinesis Data Stream or MSK. Glue Streaming job runs Spark Structured Streaming with micro-batch trigger (typically 1 minute). Transform: enrich with reference data (cached in Spark, refreshed periodically), apply business logic, write to sink. Sink options: S3 (event-time partitioned), Redshift (streaming ingestion), DynamoDB (item-by-item). Cover: checkpoint location in S3 for restart, exactly-once via deterministic logic + idempotent sink, watermark for handling late events.

Right-size Glue workers for a 1TB daily transformation

Estimate: 1TB input, typical compression ratio (Parquet) means ~3-5x in-memory size = 3-5TB of Spark working set. With G.2X workers (32 GB usable per executor), need ~100-150 executors of working memory. Glue ratio: 1 DPU = 1 worker (G.1X) or 2 workers (G.2X) etc. So target ~75-100 DPU. Discuss: if shuffle-heavy (joins on high-cardinality columns), increase workers; if compute-bound (regex, complex transformations), reduce workers but use G.4X or G.8X for more CPU per executor. Always prototype on a 1% sample first.

Design the Glue + Athena + Redshift Spectrum analytics platform

Glue Data Catalog as the central catalog. Glue crawlers discover schema for raw S3 data; explicit table definitions for curated data. Glue ETL jobs transform raw to curated, register tables in catalog. Athena queries curated data directly via catalog. Redshift Spectrum queries curated data via external tables backed by catalog. Quicksight consumes from Athena and Redshift. Cover: catalog permissions via Lake Formation for fine-grained access, column-level security, cross-account access patterns, the lock-in vs interoperability trade-off (Glue Catalog is AWS-proprietary; alternative is self-managed Hive Metastore or Unity Catalog).

Glue 4.0: What Changed

Glue 4.0 (released late 2022, mature in 2024-2026) shipped Spark 3.3, Python 3.10, and several performance improvements that closed the historical gap with self-managed Spark on EMR.

Performance: Glue 4.0 jobs run noticeably faster than Glue 3.0 on equivalent workloads, mostly because of Spark 3.3's adaptive query execution and improved code generation. New connectors: native Iceberg, Hudi, Delta Lake support out of the box (no custom JARs required). Streaming: Spark Structured Streaming improvements including better state management and lower micro-batch latency.

Mentioning Glue 4.0 specifically signals you know the current generation. Defaulting to Glue 3.0 patterns will read as out-of-date in 2026; know what changed and why upgrading matters.

Prepare for the interview

01 / Open invite

02min.

Know AWS Glue the way the interviewer who asks it knows it.

a AWS Glue query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

PinterestInterview question

Solve a AWS Glue problem

How Glue Connects to the Rest of the Cluster

Glue is the ETL component in any how to pass the AWS Data Engineer interview stack and the Hive-Metastore-equivalent for AWS Redshift interview prep Spectrum and Athena. The system design framework from how to pass the system design round applies to Glue-based architectures with substitutions (Glue ETL replaces self-managed Spark, Glue Catalog replaces Hive Metastore).

For comparison with non-AWS equivalents, see Google BigQuery interview prep (GCP equivalent: Dataflow + BigQuery) and the how to pass the Azure Data Engineer interview guide (Azure equivalent: Data Factory + Synapse). The concepts transfer; the service names differ.

Before the Batch Is Lost

> We run bottling and canning lines across a few hundred breweries, and every line streams sensor readings (fill level, temperature, capper torque) at about 4 billion events a day, some of them stuck at a constant value or spraying out-of-range garbage. Plant operators need to catch a line drifting out of spec within seconds so they can stop it before a whole batch is scrapped, while supply chain needs an exact daily count of good units produced per SKU that reconciles for finance, and a stuck or garbage reading can neither halt the line monitors nor corrupt that count. Many plants have flaky connectivity, so readings arrive late and out of order, and the daily numbers still have to include them.

+ Source

+ Transform

+ Storage

+ Quality

+ Consumer

+ Queue

Bronze

Silver

Gold

Custom

Pipeline Architecture

Sketch the architecture.

Click or drag a node from the toolbar above. Right-click the canvas for the full menu.

Drag from a node's right port to another node's left port to wire data flow.

Data engineer interview prep FAQ

Should I use DynamicFrame or DataFrame in Glue jobs?+

DataFrame for most workloads; DynamicFrame when you need schema-on-read or Glue-specific transformations (resolveChoice, drop_fields with path syntax). DataFrame is the standard Spark API and easier to debug, test, and port if you ever migrate off Glue. Most production teams default to DataFrame.

Is Glue Studio (visual editor) production-ready?+

Light usage only. Glue Studio is fine for early prototyping or for analyst-level users. For production-quality jobs, code-first development with Git versioning is essential. The Studio-generated code is verbose and harder to maintain than hand-written PySpark.

When would I use Glue Flex execution?+

Glue Flex (released 2022) runs jobs at lower cost (~50% discount) but with no SLA on start time. Best for: non-time-sensitive batch jobs, backfills, dev / staging environments. Not appropriate for: production pipelines with strict SLAs, customer-facing data refreshes.

How does Glue cost compare to self-managed Spark on EC2?+

Glue is more expensive per DPU-hour than equivalent EC2 capacity (~30-50% premium). The trade-off is operational simplicity: Glue handles cluster provisioning, Spark version management, autoscaling. For workloads under 100 DPU-hours per day, Glue typically wins on total cost (including ops time). At higher scale, self-managed Spark on EMR can be more cost-effective.

What’s the difference between Glue ETL and Glue DataBrew?+

Glue ETL: code-first PySpark for full transformation control. Glue DataBrew: visual, no-code data preparation for analyst-level users. They serve different audiences. DataBrew is rare in production data engineering pipelines.

Can Glue handle real-time streaming workloads?+

Yes via Glue Streaming (Spark Structured Streaming on Glue). Best for micro-batch (1-min triggers) workloads. For sub-second latency requirements, Kinesis Data Analytics for Apache Flink is the right choice. Glue Streaming is the right default for most streaming-with-tolerable-latency use cases.

Are Glue crawlers worth running in production?+

For raw landing zones with unpredictable schemas, yes. For curated tables with stable schemas, prefer explicit table definitions in the Data Catalog over crawler-discovered schemas. Crawler discoveries occasionally produce surprising results (column type changes, partition discoveries gone wrong) that break downstream consumers.

How is the AWS Glue Data Catalog different from a Hive Metastore?+

Glue Data Catalog is API-compatible with Hive Metastore for most operations. Differences: Glue is AWS-managed (no operational burden), supports cross-account sharing, integrates with IAM and Lake Formation for security, has higher quotas. Most Hive-aware engines (Spark, Trino, Presto) work with Glue Catalog as a drop-in replacement.

02 / Why practice

Practice AWS Glue ETL Patterns

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Start a mock

More data engineer interview prep guides

the complete SQL interview problem set→

The full SQL interview problem set, indexed by topic, difficulty, and company.

Google BigQuery interview prep→

BigQuery internals, slot-based pricing, partitioning, and clustering interview prep.

AWS Redshift interview prep→

Redshift sort keys, dist keys, compression, and RA3 architecture interview prep.

PostgreSQL interview prep→

Postgres MVCC, indexing, partitioning, and replication interview prep.

Apache Flink interview prep→

Apache Flink stateful streaming, watermarks, exactly-once, checkpointing interview prep.

Hadoop ecosystem interview prep→

Hadoop ecosystem (HDFS, MapReduce, YARN, Hive) interview prep, including modern relevance.