OLAP vs OLTP: What Data Engineering Interviewers Test

Q: What is OLAP?

Online Analytical Processing. Systems optimized for complex queries scanning large datasets. Data warehouses like Snowflake, BigQuery, and Redshift are OLAP systems using columnar storage.

Q: What is OLTP?

Online Transaction Processing. Systems handling individual operations at high concurrency and low latency. Application databases like PostgreSQL and MySQL are OLTP systems using row storage.

Q: What is a columnar database?

A database storing data by column rather than by row. All values for a column are stored together, making analytical queries fast and enabling better compression. Examples: Snowflake, BigQuery, ClickHouse, DuckDB.

Q: Can one database handle both OLAP and OLTP?

Some attempt this (HTAP: Hybrid Transactional/Analytical Processing). TiDB and SingleStore are examples. In practice, most teams separate the systems because optimizing for both simultaneously creates compromises. For interviews, the safe answer is to separate them and explain why.

Q: What is partition pruning and why does it matter for OLAP?

Means the query engine skips entire partitions that cannot contain relevant rows. If your data is partitioned by month and you query the last 7 days, the engine reads only the current month's partition. Can reduce data scanned by 90%+ and directly reduces query time and cost on usage-based systems like BigQuery.

Baseline knowledge for DE interviews. Understand why warehouses exist separately from application databases and how storage layout affects query performance.

What this guide actually says

OLAP and OLTP are workload patterns, not products. The physical difference is storage layout: columnar reads only the columns you need; row-based reads every column of every matching row. That single fact drives query speed, schema design, and cost. Interviewers don't ask 'define OLAP'. They describe a workload and assess whether you can separate the OLTP and OLAP sides cleanly and explain how data flows between them.

Detailed comparison

Dimension	OLAP	OLTP
Purpose	Analytical processing. Complex queries across large datasets ('revenue by region for Q4'). Read-heavy, scans millions of rows.	Transactional processing. Individual operations ('insert this order', 'update this address'). Many small, fast operations.
Storage layout	Columnar. Each column stored together. A query needing 3 of 200 columns reads only those 3. Analytical scans are fast.	Row-based. All columns for a row stored together. Fetching a full record by ID is one disk read. Point lookups are fast.
Query patterns	Few queries scanning millions of rows. GROUP BY, aggregation, window functions. Seconds to minutes. Throughput-optimized.	Millions of queries touching few rows each. INSERT, UPDATE, SELECT by key. Milliseconds. Latency-optimized.
Schema	Denormalized (star schema, wide tables). Fewer joins, faster queries. Redundancy is acceptable.	Normalized (3NF). Minimize redundancy, prevent update anomalies. Designed for consistent transactional access.
Examples	Snowflake, BigQuery, Redshift, ClickHouse, DuckDB.	PostgreSQL, MySQL, Oracle, SQL Server (application use).
Concurrency	Low. Dozens to hundreds of concurrent queries, each resource-heavy.	High. Thousands to millions of concurrent transactions, each lightweight.

How columnar storage works

The physical difference between row and columnar is the single most important concept. Everything else follows from it.

How row storage works

In a row-oriented database, all column values for a single row are stored together on disk. SELECT * FROM users WHERE user_id = 42 finds that one block and reads everything in a single disk operation, ideal for transactional access. But SELECT city, SUM(revenue) FROM users GROUP BY city must read every byte of every row, including columns you don't need. For 200 columns and 100M rows, that's massive wasted I/O.

How columnar storage works

All values for a single column are stored together. All 100M user_id values sit in one contiguous block; all 100M city values in another. That GROUP BY city query reads only the city and revenue columns, skipping the other 198 entirely (sometimes 1% of the data). Columnar also compresses dramatically better because values in the same column tend to be similar. A column of country codes (US, US, US, CA, CA, US) compresses far better than mixed-type rows. Compression ratios of 10:1 are common.

Why this matters for interviews

Interviewers ask about columnar vs row to test whether you understand the physical reality beneath SQL. Knowing Snowflake is columnar is table stakes. Explaining why a GROUP BY on a columnar system skips irrelevant columns and why that matters for cost and speed shows real understanding. Best answers connect storage layout to practical decisions: partitioning, column pruning, why SELECT * is expensive on wide tables in OLAP.

Real systems mapped

Knowing which databases fall into which category is expected.

OLTP

PostgreSQL

Row-oriented relational. Default for application backends. Excellent for transactional workloads with strong ACID guarantees.

OLTP

MySQL

Row-oriented. Powers most web applications. InnoDB engine provides ACID. Optimized for high-concurrency point lookups and writes.

OLAP

Snowflake

Cloud-native columnar warehouse. Separates storage and compute. Pay per query. Dominant in enterprise analytics.

OLAP

BigQuery

Google's serverless columnar warehouse. No infrastructure to manage. Charges by data scanned, making column selection directly affect cost.

OLAP

Redshift

AWS columnar warehouse. Cluster-based with provisioned or serverless options. Tightly integrated with AWS ecosystem.

OLAP

ClickHouse

Open-source columnar built for real-time analytics. Extremely fast aggregation. Used for event analytics and log analysis.

OLAP

DuckDB

In-process columnar database. 'SQLite for analytics.' Runs on a laptop with no server. Reads Parquet, CSV, JSON natively.

OLTP (with OLAP features)

Oracle

Enterprise row-oriented with some columnar features (In-Memory Column Store). Primarily OLTP but can handle mixed workloads.

OLTP

CockroachDB

Distributed SQL. Row-oriented with strong consistency across regions. Built for globally distributed transactional applications.

OLAP

Apache Druid

Real-time columnar analytics. Sub-second queries on streaming and batch data. Used for user-facing analytics dashboards.

Query that runs fast on OLAP, slow on OLTP

SELECT
  region,
  product_category,
  DATE_TRUNC('month', order_date) AS month,
  SUM(revenue) AS total_revenue,
  COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY region, product_category, DATE_TRUNC('month', order_date)
ORDER BY total_revenue DESC;

Scans millions of rows but only needs 4-5 columns. On columnar OLAP: reads only the needed columns, benefits from columnar compression and vectorized execution. On a row-oriented OLTP: reads every column of every matching row. On a 500M-row table, 3 seconds on Snowflake vs 45 minutes on PostgreSQL.

Query that runs fast on OLTP, slow on OLAP

SELECT *
FROM orders
WHERE order_id = 10097;

A point lookup by primary key. On row-oriented OLTP with a B-tree index: one index lookup + one disk read for the entire row. Sub-millisecond. On columnar OLAP: the database reconstructs the full row by reading from every column file and stitching values. More I/O than necessary for a single row. OLAP systems aren't designed for this.

Query that exposes the trade-off

SELECT
  user_id,
  transaction_id,
  transaction_date,
  product_id,
  total_amount
FROM transactions
WHERE user_id = 1070
ORDER BY transaction_date DESC
LIMIT 20;

Recent orders for a single customer. OLTP hits an index on customer_id and returns 20 rows instantly. OLAP might still be fast if the system supports efficient filtering, but it's doing more work than necessary because columnar is optimized for scanning many rows, not fetching a few.

Window function on large dataset

SELECT
  user_id,
  transaction_date,
  total_amount,
  SUM(total_amount) OVER (
    PARTITION BY user_id
    ORDER BY transaction_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS cumulative_revenue
FROM transactions
WHERE transaction_date >= '2024-01-01';

Running totals per customer across a year. On OLAP: reads three columns, sorts within partitions using columnar execution, computes the window in a single pass. On OLTP: reads all columns, sorts in memory or on disk, processes row by row. Window functions over millions of rows are a core OLAP use case.

How interviewers test this

They rarely ask 'define OLAP.' They create scenarios that require you to apply the concept.

They describe a workload and ask which system to use

'We have an application with 10K transactions/sec and a reporting dashboard that runs hourly aggregations. How would you architect the data layer?' They want you to separate OLTP (application database) from OLAP (warehouse). App writes to PostgreSQL. A pipeline replicates to Snowflake. Dashboards query Snowflake. Standard pattern; they want you to articulate it clearly.

They ask why you can't just use PostgreSQL for everything

Tests whether you understand the physical storage difference. Answer: PostgreSQL is row-oriented and scans all columns even when you only need a few. For small datasets (under 50M rows) it works fine for analytics. Beyond that, I/O cost makes analytical queries impractical. Knowing the inflection point shows practical experience.

They ask about denormalization trade-offs

OLTP schemas are normalized to prevent update anomalies. OLAP schemas are denormalized for query speed. Explain why denormalization is acceptable in a warehouse (data loaded in bulk, not updated row-by-row) and dangerous in a transactional system (concurrent updates to redundant data cause inconsistencies).

They ask you to optimize a slow query

Sometimes the answer isn't 'add an index' but 'this query belongs on a different system.' A GROUP BY scanning 2B rows on an OLTP database is a design problem, not a tuning problem. Recognizing when to move a workload from OLTP to OLAP is a signal of engineering maturity.

Interview questions with guidance

Nine questions covering storage internals, system selection, and performance trade-offs.

What is the difference between OLAP and OLTP?

OLAP: analytical queries aggregating large datasets. OLTP: transactional operations on individual records. Key differences: storage layout (columnar vs row), query patterns (scans vs lookups), schema (denormalized vs normalized), concurrency model.

Why are columnar databases faster for analytics?

Columnar storage reads only needed columns, skipping the rest. For aggregating one column from a 200-column table, columnar reads 1/200th of the data. Columnar also compresses better because similar values are stored together.

Can PostgreSQL work as an OLAP database?

PostgreSQL is row-oriented and not optimized for analytics. For datasets under 100M rows, it works adequately with proper indexing. For large-scale analytics, use a dedicated OLAP system. Knowing this boundary shows practical judgment.

When would you denormalize an OLTP database?

When a critical query path requires multiple joins creating unacceptable latency. Add a denormalized summary table updated asynchronously. Trade-off: faster reads, more complex writes, potential inconsistency.

How does a data warehouse relate to OLAP?

A warehouse IS an OLAP system. Columnar storage, denormalized schemas, designed for analytical queries. Receives data from OLTP source systems, transforms it, serves analytics.

How would you move data from an OLTP system to an OLAP system?

Change Data Capture (CDC) for near-real-time, or scheduled batch extraction. Debezium captures row-level changes from the OLTP WAL and streams them to a warehouse or lake. Batch ETL tools like Fivetran or Airbyte run on a schedule. Choice depends on latency requirements and data volume.

What is a materialized view and how does it relate to OLAP vs OLTP?

Precomputes and stores query results. In OLTP, it can speed up expensive analytical queries without moving data to a separate system. In OLAP, materialized views accelerate frequently-run dashboards. Trade-off: storage cost and refresh latency.

Explain vectorized execution in columnar databases

Columnar engines process data in batches (vectors) rather than row-by-row. A batch of 1,024 integer values can be summed in a single CPU SIMD instruction. Row-oriented engines process one row at a time with per-row function-call overhead. Why columnar databases aggregate billions of values in seconds.

How does partitioning differ between OLAP and OLTP?

OLTP partitions by key range or hash to distribute writes evenly and keep indexes small. OLAP partitions by time (date columns) to enable partition pruning on analytical queries. A query for 'last 30 days' skips all older partitions. Different goals: OLTP for write throughput, OLAP for read efficiency.

How indexing differs between OLAP and OLTP

OLTP and OLAP systems use fundamentally different indexing strategies. Understanding this shows interviewers you know what's happening beneath the query optimizer.

B-tree index (OLTP)

The workhorse of row-oriented databases. Balanced tree allowing point lookups in O(log n). PostgreSQL and MySQL use B-trees as their default. Ideal for equality (WHERE id = 42) and range scans (WHERE date BETWEEN x AND y). Not useful for full-column aggregations because the index doesn't store values in a scannable format.

Zone maps / min-max indexes (OLAP)

Columnar databases store min and max values for each data block (micro-partition in Snowflake, row group in Parquet). When a query filters WHERE order_date > '2025-01-01', the engine checks each block's max. If max < '2025-01-01', the entire block is skipped. No explicit index creation needed. Works automatically on sorted or clustered data. Poor clustering degrades effectiveness.

Bitmap index (OLAP)

Used for low-cardinality columns (status, country, gender). Creates a bit vector for each distinct value. Extremely fast for combining multiple filters with AND/OR. ClickHouse and Oracle use them. Not suitable for high-cardinality columns (user_id) because the number of bit vectors becomes impractical.

Inverted index (OLAP / Search)

Maps values back to the rows containing them. Druid and Elasticsearch use inverted indexes for fast filtering on string columns. Useful for interactive dashboards where users filter by category, region, or tag. Different from B-trees because they're optimized for filtering, not sorting.

How compression works in columnar vs row storage

Compression is a major reason columnar databases are faster and cheaper for analytics.

OLAP

Run-length encoding (RLE)

Replaces consecutive repeated values with the value and a count. A column with ['US', 'US', 'US', 'US', 'CA', 'CA'] becomes [('US', 4), ('CA', 2)]. Extremely effective on sorted, low-cardinality columns. A country column with 10 distinct values across 1B rows, sorted by country, compresses to almost nothing.

OLAP

Dictionary encoding

Builds a dictionary mapping each distinct value to a small integer. Instead of storing 'United States' 500M times, the column stores the integer 3 and looks up the string when needed. Reduces storage dramatically for string columns. Parquet and ORC use it by default when distinct values is low enough.

OLAP

Delta encoding

Stores the difference between consecutive values instead of absolute values. Timestamps like [1000, 1001, 1002, 1003] become [1000, 1, 1, 1]. Works on sorted numeric columns where values increase gradually. Combined with bit-packing, delta can represent millions of timestamps in a few kilobytes.

Both

Page-level compression (general)

Row-oriented databases compress entire pages (8KB in PostgreSQL) using LZ4 or zstd. Helps, but mixed data types on the same page limit compression. A page with integers, strings, dates, and booleans doesn't compress as well as a page of only integers. Why columnar achieves 5-10x better compression than row on the same data.

Common interview mistakes

'OLAP is for big data and OLTP is for small data'

Size isn't the distinction. An OLTP system can hold billions of rows and handle millions of transactions. The difference is access pattern: OLAP reads many rows to produce aggregates, OLTP reads or writes a few rows at a time. Some OLTP databases are larger than some OLAP databases.

Confusing OLAP with a specific product

OLAP is a workload pattern, not a product. Snowflake is OLAP. BigQuery is OLAP. A Parquet file queried by DuckDB is OLAP. The term describes the access pattern, not the vendor.

Claiming columnar storage is always better

Columnar is better for analytical queries. Worse for transactional. Inserting a single row writes to every column file separately. Point lookups require reading from every column file. The right storage layout depends on the workload; strong candidates say this explicitly.

Forgetting about hybrid workloads

In practice, most organizations need both OLAP and OLTP. The architecture question is how data flows between them: CDC, batch ETL, or streaming. Interviewers reward candidates who think about the full system, not just one side.

OLAP vs OLTP FAQ

What is OLAP?+

Online Analytical Processing. Systems optimized for complex queries scanning large datasets. Data warehouses like Snowflake, BigQuery, and Redshift are OLAP systems using columnar storage.

What is OLTP?+

Online Transaction Processing. Systems handling individual operations at high concurrency and low latency. Application databases like PostgreSQL and MySQL are OLTP systems using row storage.

What is a columnar database?+

A database storing data by column rather than by row. All values for a column are stored together, making analytical queries fast and enabling better compression. Examples: Snowflake, BigQuery, ClickHouse, DuckDB.

Can one database handle both OLAP and OLTP?+

Some attempt this (HTAP: Hybrid Transactional/Analytical Processing). TiDB and SingleStore are examples. In practice, most teams separate the systems because optimizing for both simultaneously creates compromises. For interviews, the safe answer is to separate them and explain why.

What is partition pruning and why does it matter for OLAP?+

Means the query engine skips entire partitions that cannot contain relevant rows. If your data is partitioned by month and you query the last 7 days, the engine reads only the current month's partition. Can reduce data scanned by 90%+ and directly reduces query time and cost on usage-based systems like BigQuery.

02 / Why practice

Practice data engineering concepts

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
The round is won on tradeoffs, not on the diagram
Grain, star vs snowflake, SCD type, conformed dimensions, late-arriving data. Modeling under live pushback is what separates the bands, and it is the half almost nobody rehearses

Open the problems

Related guides

Data Modeling Questions→

All modeling topics interviewers test.

Dimensional Modeling→

The primary modeling pattern for OLAP.

Data Lake vs Warehouse→

Lakes, warehouses, and lakehouses compared.