AWS Glue Interview Questions

AWS Glue interview questions for data engineer roles at AWS- native companies. Glue is AWS's managed serverless ETL service combined with a Hive-Metastore-compatible Data Catalog used by Athena, Redshift Spectrum, EMR, and other AWS analytics services. 30+ questions covering Glue ETL job patterns (PySpark on Glue), crawlers, Data Catalog, job bookmarks for incremental processing, Glue Streaming, Glue 4.0 performance improvements, and cost optimization. Pair with the complete data engineer interview preparation framework and the how to pass the AWS Data Engineer interview.

AWS Glue Topics That Show Up in Interviews

Topics ranked roughly by how often they appear in AWS-heavy data engineer loops.

TopicFrequencyDepth Expected
Glue ETL jobs (PySpark)Very commonSpark on Glue, transformation patterns
Worker type choice (G.1X, G.2X, G.4X)CommonRight-sizing for workload
Glue Data CatalogVery commonHive-Metastore-compatible, integration with Athena / Redshift Spectrum
Glue crawlersCommonSchema discovery, partition discovery, when to use vs explicit definitions
Job bookmarksCommonIncremental processing, state management
Glue StreamingOccasionalSpark Structured Streaming on Glue
Glue 4.0 vs 3.0 vs 2.0CommonPerformance improvements, Spark version, runtime features
Glue Studio (visual editor)OccasionalPros and cons vs code-first development
Glue DataBrew (data prep)RareNewer service, light-touch transformations
Iceberg integration in GlueCommonReading and writing Iceberg tables
Cost optimizationCommonWorker count, runtime, Glue Flex execution
IAM roles and permissions for GlueCommonCross-account access, S3 permissions
Job triggers and schedulingCommonEventBridge, on-demand, scheduled triggers
Error handling and retriesCommonFailed-job recovery, partial-success patterns

Glue ETL Jobs: The Core Pattern

A Glue ETL job is a PySpark (or Scala) script that reads from a source, transforms, and writes to a sink. Glue handles cluster provisioning (you don't manage EC2), Spark version (Glue 4.0 ships Spark 3.3), and connector libraries. You write the transformation logic; Glue runs it.

The job structure: read from source (S3, JDBC, or another Glue catalog table) using DynamicFrame (Glue's wrapper around DataFrame) or DataFrame directly. Transform via standard Spark SQL or PySpark transformations. Write to sink (S3 in Parquet typically, JDBC for warehouse loads, Glue catalog for catalog-aware writes). Most jobs use DataFrame for readability; the DynamicFrame abstraction is helpful for schema-on-read scenarios but can be skipped for simpler workloads.

Worker type sizing is the first cost lever. G.1X (4 vCPU, 16 GB) for most light workloads. G.2X (8 vCPU, 32 GB) for moderate workloads with shuffles. G.4X (16 vCPU, 64 GB) for heavy workloads and large state. G.8X (32 vCPU, 128 GB) for ML feature engineering or very large joins. Number of workers (DPU) trades off runtime against cost; doubling workers roughly halves runtime up to a saturation point.

Six Real AWS Glue Interview Questions

L4

Design a Glue job for daily incremental load from S3 to Snowflake

Read from S3 raw landing (date-partitioned Parquet). Transform: dedup, type-cast, normalize. Write to S3 staging (partitioned by ingest_date). Use Snowflake’s COPY INTO to load from S3 staging to target table. Schedule via EventBridge trigger. Use job bookmarks to avoid re-processing S3 files already loaded. Cover: failure handling (retry policy, dead-letter prefix for malformed records), idempotency (target Snowflake table uses MERGE on a deterministic key).
L5

When would you use Glue ETL vs EMR vs Lambda for transformation?

Glue ETL: serverless Spark for transformations >5 minutes, <100 GB. Best when Spark is the right tool but you don’t want to manage EMR clusters. EMR: when the workload requires Hadoop ecosystem tools (Hive, HBase, Hudi), when you have long-running clusters with consistent load, when Spark configuration tuning matters. Lambda: for small transformations <15 minutes and <10 GB memory, S3 event-triggered, single-record or micro-batch. In short: Glue is the default for managed Spark in 2026, EMR for Hadoop-ecosystem needs, Lambda for lightweight event-driven work.
L5

How do Glue job bookmarks work and when do they fail?

Job bookmarks track which S3 files (or JDBC primary keys) have been processed by previous job runs. Subsequent runs skip already-processed input and process only new data. Bookmarks fail when: source changes structure (file paths, schema), upstream systems modify already-processed files (modification time changes invalidate the bookmark), the job logic is non-idempotent so reprocessing produces wrong output. Mitigation: design source for bookmark compatibility (immutable file paths, stable partition structure), bookmark restoration via reset for known-bad runs, monitoring of processed vs available file count.
L5

Design a Glue Streaming job for real-time enrichment

Source: Kinesis Data Stream or MSK. Glue Streaming job runs Spark Structured Streaming with micro-batch trigger (typically 1 minute). Transform: enrich with reference data (cached in Spark, refreshed periodically), apply business logic, write to sink. Sink options: S3 (event-time partitioned), Redshift (streaming ingestion), DynamoDB (item-by-item). Cover: checkpoint location in S3 for restart, exactly-once via deterministic logic + idempotent sink, watermark for handling late events.
L5

Right-size Glue workers for a 1TB daily transformation

Estimate: 1TB input, typical compression ratio (Parquet) means ~3-5x in-memory size = 3-5TB of Spark working set. With G.2X workers (32 GB usable per executor), need ~100-150 executors of working memory. Glue ratio: 1 DPU = 1 worker (G.1X) or 2 workers (G.2X) etc. So target ~75-100 DPU. Discuss: if shuffle-heavy (joins on high-cardinality columns), increase workers; if compute-bound (regex, complex transformations), reduce workers but use G.4X or G.8X for more CPU per executor. Always prototype on a 1% sample first.
L6

Design the Glue + Athena + Redshift Spectrum analytics platform

Glue Data Catalog as the central catalog. Glue crawlers discover schema for raw S3 data; explicit table definitions for curated data. Glue ETL jobs transform raw to curated, register tables in catalog. Athena queries curated data directly via catalog. Redshift Spectrum queries curated data via external tables backed by catalog. Quicksight consumes from Athena and Redshift. Cover: catalog permissions via Lake Formation for fine-grained access, column-level security, cross-account access patterns, the lock-in vs interoperability trade-off (Glue Catalog is AWS-proprietary; alternative is self-managed Hive Metastore or Unity Catalog).

Glue 4.0: What Changed

Glue 4.0 (released late 2022, mature in 2024-2026) shipped Spark 3.3, Python 3.10, and several performance improvements that closed the historical gap with self-managed Spark on EMR.

Performance: Glue 4.0 jobs run noticeably faster than Glue 3.0 on equivalent workloads, mostly because of Spark 3.3's adaptive query execution and improved code generation. New connectors: native Iceberg, Hudi, Delta Lake support out of the box (no custom JARs required). Streaming: Spark Structured Streaming improvements including better state management and lower micro-batch latency.

Mentioning Glue 4.0 specifically signals you know the current generation. Defaulting to Glue 3.0 patterns will read as out-of-date in 2026; know what changed and why upgrading matters.

Prepare for the interview
01 / Open invite
02min.

Know AWS Glue the way the interviewer who asks it knows it.

a AWS Glue query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1source → bronze → silver → gold
2 ingest : CDC + Kafka
3 transform : dbt + Airflow
4 serve : Snowflake
5
Execute your solution0.4s avg.
PinterestInterview question
Solve a AWS Glue problem

How Glue Connects to the Rest of the Cluster

Glue is the ETL component in any how to pass the AWS Data Engineer interview stack and the Hive-Metastore-equivalent for AWS Redshift interview prep Spectrum and Athena. The system design framework from how to pass the system design round applies to Glue-based architectures with substitutions (Glue ETL replaces self-managed Spark, Glue Catalog replaces Hive Metastore).

For comparison with non-AWS equivalents, see Google BigQuery interview prep (GCP equivalent: Dataflow + BigQuery) and the how to pass the Azure Data Engineer interview guide (Azure equivalent: Data Factory + Synapse). The concepts transfer; the service names differ.

Two Hundred Million Redirects

> Our link shortener does about 200 million redirects a day. Every redirect fires a click event and we need to serve two consumers from that stream: a real-time dashboard that shows per-link clicks within the last hour, and a nightly batch aggregate that powers the analytics API for date-range queries. Traffic is very spiky and some links go viral. Design the pipeline.

+ Source
+ Transform
+ Storage
+ Quality
+ Consumer
+ Queue
Bronze
Silver
Gold
Custom
Pipeline Architecture
Sketch the architecture.

Click or drag a node from the toolbar above. Right-click the canvas for the full menu.

Drag from a node's right port to another node's left port to wire data flow.

Data engineer interview prep FAQ

Should I use DynamicFrame or DataFrame in Glue jobs?+
DataFrame for most workloads; DynamicFrame when you need schema-on-read or Glue-specific transformations (resolveChoice, drop_fields with path syntax). DataFrame is the standard Spark API and easier to debug, test, and port if you ever migrate off Glue. Most production teams default to DataFrame.
Is Glue Studio (visual editor) production-ready?+
Light usage only. Glue Studio is fine for early prototyping or for analyst-level users. For production-quality jobs, code-first development with Git versioning is essential. The Studio-generated code is verbose and harder to maintain than hand-written PySpark.
When would I use Glue Flex execution?+
Glue Flex (released 2022) runs jobs at lower cost (~50% discount) but with no SLA on start time. Best for: non-time-sensitive batch jobs, backfills, dev / staging environments. Not appropriate for: production pipelines with strict SLAs, customer-facing data refreshes.
How does Glue cost compare to self-managed Spark on EC2?+
Glue is more expensive per DPU-hour than equivalent EC2 capacity (~30-50% premium). The trade-off is operational simplicity: Glue handles cluster provisioning, Spark version management, autoscaling. For workloads under 100 DPU-hours per day, Glue typically wins on total cost (including ops time). At higher scale, self-managed Spark on EMR can be more cost-effective.
What’s the difference between Glue ETL and Glue DataBrew?+
Glue ETL: code-first PySpark for full transformation control. Glue DataBrew: visual, no-code data preparation for analyst-level users. They serve different audiences. DataBrew is rare in production data engineering pipelines.
Can Glue handle real-time streaming workloads?+
Yes via Glue Streaming (Spark Structured Streaming on Glue). Best for micro-batch (1-min triggers) workloads. For sub-second latency requirements, Kinesis Data Analytics for Apache Flink is the right choice. Glue Streaming is the right default for most streaming-with-tolerable-latency use cases.
Are Glue crawlers worth running in production?+
For raw landing zones with unpredictable schemas, yes. For curated tables with stable schemas, prefer explicit table definitions in the Data Catalog over crawler-discovered schemas. Crawler discoveries occasionally produce surprising results (column type changes, partition discoveries gone wrong) that break downstream consumers.
How is the AWS Glue Data Catalog different from a Hive Metastore?+
Glue Data Catalog is API-compatible with Hive Metastore for most operations. Differences: Glue is AWS-managed (no operational burden), supports cross-account sharing, integrates with IAM and Lake Formation for security, has higher quotas. Most Hive-aware engines (Spark, Trino, Presto) work with Glue Catalog as a drop-in replacement.
02 / Why practice

Practice AWS Glue ETL Patterns

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

More data engineer interview prep reading

More data engineer interview prep guides