Postgres Interview Questions

PostgreSQL interview questions for data engineer roles. Postgres is the most common OLTP source system that data pipelines pull from, so understanding its internals (MVCC, indexing, replication) is essential for designing CDC pipelines, debugging source-system performance issues, and modeling appropriately. This guide covers 40+ questions on Postgres-specific topics that show up in data engineer loops. Pair with the complete data engineer interview preparation framework.

Postgres Topics in Interviews

Topics ranked roughly by how often they appear in data engineer loops with Postgres-heavy stacks.

Topic	Frequency	Depth Expected
MVCC and transaction isolation	Common	Why long transactions block VACUUM, isolation level effects
Indexing strategies (B-tree, GIN, GiST, BRIN)	Common	Choose index type for query pattern
EXPLAIN ANALYZE plan reading	Common	Identify scans, joins, costs, actual vs estimated
Partitioning (range, list, hash)	Common	Native Postgres partitioning, when to use each
Replication (logical vs physical)	Common	Streaming replication, logical decoding, replica lag
JSONB operators and indexing	Common	Postgres JSON, GIN indexes on JSONB
LATERAL joins	Occasional	Row-correlated subqueries, top-N per group patterns
Window functions specifics	Common	Postgres-specific behavior, IGNORE NULLS limitations
CDC via Debezium / logical replication	Occasional	Replication slots, snapshot phase, LSN tracking
VACUUM and autovacuum	Common	Bloat, table maintenance, statistics
Common Table Expressions (CTEs)	Common	Pre-Postgres-12 materialization vs inline behavior
Stored procedures and PL/pgSQL	Occasional	When to push logic to database vs application
Connection pooling (PgBouncer)	Occasional	Why connections matter, transaction vs session pooling
Row-level security (RLS)	Rare	Policy design for multi-tenant

MVCC: The Most-Tested Postgres Concept

Multi-Version Concurrency Control is how Postgres handles concurrent reads and writes without locking. Every row update creates a new version (tuple) with a transaction ID; the old version remains until VACUUM cleans it up. Reads see the version visible to their transaction's snapshot.

The implication for data engineering: long-running transactions on the source database prevent VACUUM from cleaning up dead tuples, leading to table bloat and degraded performance. CDC pipelines that hold replication slots open are a common cause: if the slot lags significantly, dead tuples accumulate and must be retained until the slot catches up.

Strong candidates know to monitor replication slot lag, understand why bloat is a separate problem from disk space (bloat slows queries even with disk available), and can describe the autovacuum tuning that prevents these issues. The L5 signal is explaining the connection between MVCC, replication slots, autovacuum, and downstream pipeline reliability.

Prepare for the interview

01 / Open invite

02min.

Know PostgreSQL the way the interviewer who asks it knows it.

a PostgreSQL query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1SELECT user_id,

2 COUNT(*) AS sessions

3FROM events

4WHERE ts >= NOW() - INTERVAL '7 day'

Execute your solution0.4s avg.

DoorDashInterview question

Solve a PostgreSQL problem

Indexing Strategies: Match Index to Query

Postgres supports multiple index types, each optimized for different query patterns. The interview tests whether you can match the right index to the right query.

B-tree: default, supports equality and range queries on ordered data. Right answer 80% of the time. Multi- column B-tree indexes follow leftmost-prefix rules: an index on (a, b, c) supports queries on a, on (a, b), and on (a, b, c), but not on b alone.

GIN (Generalized Inverted Index): for JSONB, arrays, full-text search. Larger than B-tree, slower to update, but enables containment and key-existence queries on semi-structured data.

GiST: for geometric data (PostGIS), range types, tsvector. Used in geospatial analytics workloads.

BRIN (Block Range INdex): for very large tables with naturally-ordered data (e.g., timestamps that increase with insert order). Tiny size, low maintenance, but only useful when data has correlation with physical storage. Common pattern for event logs partitioned by date.

Partial indexes: indexes with a WHERE clause. Useful for indexing only the rows matching a predicate (e.g., CREATE INDEX idx_active_users ON users (last_login) WHERE active = true). Smaller than a full index, faster to maintain, and useful when queries consistently filter on the predicate.

Eight Real Postgres Interview Questions

Design indexes for fact_orders frequently queried by date and customer_id

Composite B-tree on (order_date, customer_id): supports queries filtered by date alone or by date+customer_id. Single-column index on customer_id: supports queries filtering by customer alone. Two separate indexes vs one composite: depends on query ratio. Discuss the trade-off: composite indexes serve more query patterns but are larger and slower to maintain.

-- Composite for date-filtered queries
CREATE INDEX idx_orders_date_customer
ON fact_orders (order_date, customer_id);

-- Single-column for customer-only queries
CREATE INDEX idx_orders_customer
ON fact_orders (customer_id);

-- Partial for active orders only (if query pattern matches)
CREATE INDEX idx_active_orders
ON fact_orders (order_date)
WHERE status = 'active';

EXPLAIN ANALYZE shows a Seq Scan on a 100M row table. How would you fix it?

Identify the WHERE clause. Check whether an index exists for that column. If not, create one. If yes, check the index isn’t being skipped due to function in WHERE (e.g., WHERE date(created_ts) = '2026-01-01' prevents the index because the function is applied to every row). Fix by rewriting WHERE to be sargable (created_ts >= '2026-01-01' AND created_ts < '2026-01-02').

Use LATERAL join to find top 3 most recent orders per customer

LATERAL allows a subquery to reference columns from preceding tables in the FROM clause. Common pattern for top-N per group when window functions are awkward. Often performs better than the ROW_NUMBER approach for small N because it can use index seeks rather than sorting all rows.

SELECT
  c.customer_id,
  c.name,
  recent.order_id,
  recent.order_date,
  recent.amount
FROM customers c
CROSS JOIN LATERAL (
  SELECT order_id, order_date, amount
  FROM orders o
  WHERE o.customer_id = c.customer_id
  ORDER BY order_date DESC
  LIMIT 3
) recent;

Design a CDC pipeline from Postgres to Snowflake using Debezium

Postgres logical replication slot -> Debezium reads WAL changes -> Kafka -> Snowflake consumer (or S3 landing then Snowpipe). Cover: replication slot lag monitoring (slots that lag prevent WAL cleanup, fill the disk), schema evolution propagation, snapshot phase for initial backfill (Debezium does an initial table scan before tailing the WAL), heartbeat events to prevent slot stalls during low-activity periods.

Why is autovacuum important and when does it fail?

Autovacuum runs VACUUM on tables when their dead-tuple ratio exceeds threshold. Without it, dead tuples accumulate, table bloat grows, queries slow. Autovacuum fails when: long-running transactions prevent VACUUM from cleaning up tuples (because those tuples might still be visible to the long transaction); replication slot lag holds back cleanup; autovacuum is too conservative for write-heavy tables and falls behind. Tuning involves per-table autovacuum_vacuum_threshold, autovacuum_vacuum_scale_factor, and autovacuum_vacuum_cost_limit settings.

Design partitioning strategy for a 2TB events table

Range partition by event_date (most common). Each partition is a separate physical table; the parent table is logical. Queries filtered by event_date use partition pruning to scan only relevant partitions. Indexes are per-partition, which keeps them small and fast. Discuss: native Postgres partitioning vs pg_partman extension for automatic partition creation; partition retention policies (DROP PARTITION is constant-time; DELETE is expensive).

When would you use JSONB vs separate columns vs an EAV model?

JSONB: when the schema is truly variable (per-record different fields), when most queries don’t filter on the JSON contents, when you want flexible evolution without migrations. Separate columns: when the fields are stable, when many queries filter on specific fields (better index support). EAV (entity-attribute-value): rarely the right answer; usually signals a fundamental modeling problem. Strong candidates push back on JSONB-everything: it sacrifices query performance for schema flexibility.

Design a multi-tenant Postgres schema with row-level security

Single Postgres cluster, schema-per-tenant or shared schema with tenant_id column. RLS policies on every table that filter rows by tenant_id from the session context (set via SET app.tenant_id at connection time). Discuss the trade-off: schema-per-tenant has stronger isolation but harder to query cross-tenant for analytics; shared schema with RLS has cleaner cross-tenant query support but RLS bugs are catastrophic. Most teams pick shared with RLS for <1000 tenants, schema-per-tenant for larger enterprise customers.

How Postgres Connects to the Rest of the Cluster

Postgres knowledge appears across the cluster wherever CDC and source-system patterns come up. The how to pass the system design round framework includes Postgres as a common source in ETL pipelines. The how to pass the streaming Data Engineer interview guide covers Debezium-based CDC patterns that depend on Postgres internals.

For SQL fluency at the dialect-portable level, see the the complete SQL interview problem set. For dialect comparison, see Google BigQuery interview prep (analytics warehouse) and AWS Redshift interview prep (AWS native analytics).

Below the Peaks

> Finance is auditing outlier allocations within each team. Surface the allocations that land above their team's own average yet outside that team's five biggest by amount, and return the team name and the amount.

Data engineer interview prep FAQ

Why does Postgres come up in data engineer interviews so often?+

Postgres is the most common OLTP source system in modern stacks. Most data pipelines start with reading from a Postgres database (production application data) and writing to an analytics warehouse. Understanding the source means understanding Postgres.

Should I learn Postgres internals deeply or just SQL syntax?+

Both, with internals weighted more for L5+. Syntax is necessary; internals (MVCC, indexing, partitioning, replication) are what differentiates senior candidates. The CDC pipeline questions specifically require internal knowledge.

Is JSONB performance comparable to native columns?+

Close but not identical. JSONB with GIN indexes can match native column performance for containment queries (does JSON have field X). Native columns are still faster for typed comparisons and complex aggregations. Default to native columns; use JSONB when schema flexibility justifies the performance cost.

Should I use stored procedures in Postgres for ETL?+

Light usage is fine for simple transformations; heavy usage typically signals an application that should be a separate service. Stored procedures are hard to test, hard to version, and limit your team’s flexibility. Most modern data engineering pushes ETL logic to dbt or Spark, leaving Postgres for OLTP.

How important is PostGIS for data engineer roles?+

Important for geospatial-heavy companies (Lyft, Uber, DoorDash, Instacart, real estate platforms). For most general data engineer roles, PostGIS knowledge is not required. Know that it exists and what types of queries it enables.

What’s the difference between PostgreSQL and Aurora Postgres?+

Aurora Postgres is AWS’s managed Postgres service with custom storage layer for performance and availability. SQL-compatible with vanilla Postgres but with extensions (Aurora-specific monitoring, automated failover, read replicas with shared storage). Most cloud-deployed Postgres at companies with AWS investment is Aurora; the SQL knowledge transfers directly.

How does logical replication differ from physical replication?+

Physical replication: byte-by-byte copy of WAL to replica; replica is identical to primary. Best for HA. Logical replication: replays SQL-level changes to a replica that can have a different schema. Best for CDC, cross-version replication, and selective table replication.

Are Postgres extensions production-ready?+

Many are. PostGIS, pg_partman, pg_cron, hypopg, pg_stat_statements all see heavy production use. Some extensions are AWS-Aurora-specific (e.g., aws_s3 for COPY FROM S3); others are widely available. Check support before depending on an extension in production.

02 / Why practice

Practice Postgres-Flavored SQL

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Start with a problem

More data engineer interview prep guides

the complete SQL interview problem set→

The full SQL interview problem set, indexed by topic, difficulty, and company.

Google BigQuery interview prep→

BigQuery internals, slot-based pricing, partitioning, and clustering interview prep.

AWS Redshift interview prep→

Redshift sort keys, dist keys, compression, and RA3 architecture interview prep.

Apache Flink interview prep→

Apache Flink stateful streaming, watermarks, exactly-once, checkpointing interview prep.

Hadoop ecosystem interview prep→

Hadoop ecosystem (HDFS, MapReduce, YARN, Hive) interview prep, including modern relevance.

Glue interview prep for AWS Data Engineer roles→

AWS Glue ETL jobs, crawlers, Data Catalog, and PySpark-on-Glue interview prep.