Data Engineering Concepts: Glossary for Interviews and Production (2026)

Use this as a glossary, not a textbook. 18 concepts grouped by tier (foundation, infrastructure, architecture) with the depth required at each role level. Each concept has the definition tight enough to use in an interview and a link to the deeper read.

Core concepts covered

Tiers: foundation, infra, architecture

Grain

Most-tested concept at senior level

Idempotency

Most-missed concept in senior loops

Foundation concepts (every interview, every level)

Five concepts that show up in every loop. Junior interviews mostly stop here. If these are weak, the rest of your prep doesn't matter.

Foundation · /etl-vs-elt

ETL and ELT

Extract-Transform-Load vs Extract-Load-Transform. ETL transforms in flight; ELT lands raw and transforms in the warehouse. ELT won because cloud warehouse compute is cheap and elastic. ETL still wins for PII redaction at the edge and non-SQL transformations. The interview signal is naming the conditions that flip the answer, not picking a side.

ELT is the 2026 default; ETL is the exception

Foundation · /batch-vs-streaming

Batch vs streaming

Batch processes bounded datasets on a schedule. Streaming processes unbounded event streams continuously. Pick batch when latency can be measured in hours, streaming when it's measured in seconds, hybrid when analytics needs both. Most candidates over-pick streaming; the SLA decides, not the ambition.

Let the SLA decide

Foundation · /concepts/olap-vs-oltp

OLAP vs OLTP

Transaction systems (OLTP: Postgres, MySQL) handle high-volume low-latency reads and writes for applications. Analytical systems (OLAP: Snowflake, BigQuery) handle complex aggregations on large datasets. Data engineers build pipelines that move data from OLTP to OLAP, and the schema decisions differ at each end.

Different storage, different schema goals

Foundation · /concepts/acid-properties

ACID properties

Atomicity, consistency, isolation, durability. The transactional guarantees that distinguish 'your write succeeded' from 'your write probably succeeded.' Increasingly relevant on the analytics side too, because lakehouse table formats (Delta, Iceberg, Hudi) brought ACID to object storage.

ACID is now table stakes for analytics too

Foundation · /concepts/idempotent

Idempotency

Same input, same output, regardless of how many times the pipeline runs. Retries and backfills shouldn't produce duplicates or corrupt downstream state. Implement with MERGE, partition overwrites, or natural-key dedup. The interview signal is naming why this matters before being prompted.

Required for any production retry policy

ETL vs ELT side by side

The most-asked tradeoff comparison in DE interviews. Naming the conditions that flip the answer is the signal.

Dimension	ETL	ELT
Transformation location	Outside the warehouse (in flight)	Inside the warehouse
Storage of raw data	Often discarded or staged briefly	Preserved indefinitely
When it wins	PII redaction at the edge, non-SQL transforms (ML scoring)	Cloud warehouse era: cheap compute, cheap storage
Optionality	Lower, since once transformed, raw is gone	Higher: re-transform anytime
Best engines	Spark, Airflow + custom Python	Snowflake / BigQuery / dbt
2026 default	Niche use cases only	Almost universal at modern data teams

Why these two tradeoffs show up everywhere

ETL-vs-ELT and batch-vs-streaming aren't academic dichotomies. They're the two decisions every data platform re-litigates as it grows. A team that picks ELT-everywhere in year one usually adds a thin ETL layer back in once GDPR deletion requests or a PII-heavy source show up. A team that picks batch-everywhere adds streaming back in once product wants a fraud-detection or personalization feature with a latency budget batch can't hit. The interview question is really asking whether you've seen a system evolve, not whether you can define the terms.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a system design query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

PayPalInterview question

Solve a problem

In practice, "ELT" means a managed pipeline layer sits between the source systems and the warehouse, handling connector maintenance and schema drift so the transformation logic can live in SQL where the rest of the team can read it. Kleene.ai is one example of that layer: it connects 600+ business tools (CRMs, ERPs, ad platforms) into a governed warehouse and adds AI-assisted modeling and natural-language querying on top, aimed at lean teams that don't have a dedicated pipeline engineer to build the ELT layer by hand. Recognizing this category matters for interviews too: when a team says "we use a managed ELT tool," the follow-up question is what they built themselves on top of it, not whether the tool was the right choice.

Infrastructure concepts (mid-level interviews)

Seven concepts that mid-level interviews and system design rounds focus on. Practice designing schemas from product specs to internalize the grain question.

Infrastructure · /data-modeling/dimensional-modeling

Dimensional modeling

Fact tables hold measures and foreign keys; dimensions hold descriptive context. Star schema denormalizes for analytics; snowflake normalizes the dimensions; data vault separates integration from business. State the grain before you draw anything. The grain question separates senior from mid more than any other.

Grain is the load-bearing decision

Infrastructure · /data-modeling/scd-type-2

Slowly changing dimensions

Type 1 overwrites. Type 2 keeps history with valid_from/valid_to. Type 6 combines current and historical. Pick by query pattern: need point-in-time correctness? Type 2. Only need today's value? Type 1. The most common interview mistake is implementing Type 2 when Type 1 would do.

Don't over-engineer SCD

Infrastructure · /concepts/data-warehouse-design

Data warehouses

Columnar storage, massively parallel processing, separation of storage and compute. Snowflake, BigQuery, Redshift, Databricks SQL are the platforms. The differences that matter: pricing model (per-second vs per-slot), concurrency handling, dialect quirks. Know one deeply; recognize the others.

Deep on one engine, recognition on three more

Infrastructure · /data-lakehouse

Data lakes and lakehouses

Lakes store raw data in object storage at low cost. Lakehouses add ACID via table formats: Delta Lake, Iceberg, Hudi. The two-tier lake-plus-warehouse model is collapsing into a single lakehouse at most companies; Iceberg is the format converging across vendors.

Iceberg is the cross-vendor convergence

Infrastructure · /data-modeling/medallion-architecture

Medallion architecture

Bronze for raw, silver for cleaned and conformed, gold for business-ready. The layer boundaries are quality gates; failures isolate instead of cascading. Not always right (single-source pipelines don't need three layers), but the default mental model for any modern lakehouse.

Bronze-silver-gold layers as quality gates

Infrastructure · /pipeline/architecture

Orchestration

Schedules, retries, dependency graphs. Airflow is the default and what you'll interview on. Dagster and Prefect for newer builds. The interview is rarely about syntax; it's about retries, idempotent restarts, and the failure modes of long-running schedules.

Failure modes > syntax

Infrastructure · /data-quality

Data quality

Six dimensions: accuracy, completeness, consistency, timeliness, uniqueness, validity. Practical checks happen at layer boundaries: schema validation on ingestion, row count reconciliation, business rule assertions, freshness monitoring. dbt tests, Great Expectations, and Soda automate most of it.

Checks at boundaries, not inside

Batch vs streaming side by side

The second-most-asked tradeoff. Most candidates over-pick streaming; the SLA decides.

Dimension	Batch	Streaming
Latency	Hours to days	Seconds to minutes
Data shape	Bounded datasets (fixed end)	Unbounded event streams
Cost per record	Low (amortized over batch)	Higher (per-event processing)
Failure recovery	Re-run the batch	Replay from checkpoint or offset
When it wins	Reporting, dashboards, ML training	Fraud detection, alerts, real-time personalization
Hybrid pattern	Lambda architecture (batch + streaming)	Kappa architecture (streaming only)

How infrastructure choices become architecture problems

Foundation and infrastructure concepts like grain, SCD type, warehouse choice, and orchestration are decisions you make once, at design time. Architecture concepts are what happens after dozens of those decisions accumulate across a real organization, once no single person can hold the whole system in their head anymore. Medallion architecture exists because a raw-to-gold pipeline with no quality gates fails in ways that are expensive to trace. Data lineage exists because "what breaks if I change this column" stops being answerable by memory once a warehouse has a few hundred models. Data contracts exist because schema drift from an upstream team is the single most common cause of a broken dashboard, and data observability only tells you it broke, not that it was about to.

Analysts Are Slowing the Store Down

> We run an e-commerce marketplace where the analytics team queries the production database directly, and that load is degrading the live application. Move analytics onto its own warehouse using a replication path that adds no load to the production system, while a merchant-facing dashboard still shows each seller their new orders within a couple of minutes on a path of its own. A small fraction of orders arrive with broken merchant references or totals that do not add up, so those have to be held back and caught before they reach the reporting tables.

+ Source

+ Transform

+ Storage

+ Quality

+ Consumer

+ Queue

Bronze

Silver

Gold

Custom

Pipeline Architecture

Sketch the architecture.

Click or drag a node from the toolbar above. Right-click the canvas for the full menu.

Drag from a node's right port to another node's left port to wire data flow.

This is also why architecture-tier questions (governance, lineage, catalog, mesh, contracts) dominate senior loops: they're really asking whether you understand data engineering as an organizational problem, not just a technical one. A senior engineer who can design a star schema but has never had to explain to another team why their untested schema change broke a production dashboard is missing that organizational half of the job.

Architecture concepts (senior+ interviews)

Six concepts that senior loops test. Don't reach for these until your foundation is solid: talking about data mesh with wobbly grain knowledge backfires.

Architecture · /data-governance

Data governance

Technical implementation of access control, PII classification, audit logging, and retention. Row-level security, column masking, GDPR-style deletion pipelines. DEs build the infrastructure; legal and security define the policy. Interviews ask about implementation, not policy.

DE owns the how, not the why

Architecture · /data-observability

Data observability

Monitoring data systems the way you'd monitor an application: freshness, volume, schema stability, distribution drift. Tools: Monte Carlo, Bigeye, Elementary. The newer term for what good data teams have always done, with a vendor category attached.

Same discipline, new tooling category

Architecture · /concepts/data-lineage

Data lineage

Where this column came from and what depends on it. Answers the 'if this is wrong, what broke' question. dbt provides column-level lineage out of the box for SQL transformations; DataHub, Marquez, and OpenLineage cover the broader pipeline.

Required for incident response at scale

Architecture · /concepts/data-catalog

Data catalog

Searchable index of every dataset, with schemas, descriptions, owners, and usage statistics. The point is discoverability so analysts can find data without asking the engineering team. Atlan, DataHub, Alation, plus cloud-native options like AWS Glue Catalog.

Self-service discoverability

Architecture · /data-mesh

Data mesh

Decentralized data ownership. Domain teams publish data products with quality SLAs; a platform team provides self-serve infrastructure; federated governance sets global standards. Works at large organizations where the central team has become a bottleneck. Often misapplied at small ones.

Right for scale, wrong for stage

Architecture · emerging

Data contracts

Explicit schema and SLA agreements between data producers and consumers. Producers commit to schema stability and freshness; consumers commit to documented expectations. Enforced via schema validation in CI/CD. Solves the silent breakage problem that data observability only detects. Emerging concept; expect to see it in 2027 interviews.

Producer-consumer contract as code

Cloud data warehouses compared

The five platforms you should recognize. Know one deeply; recognize the others. The differences that matter: pricing model, concurrency, dialect.

Platform	Pricing model	What it's known for
Snowflake	Per-second compute via virtual warehouses	Best-in-class concurrency, multi-cloud, separation of storage/compute
BigQuery	Per-slot or per-query bytes	Serverless, deep Google ecosystem integration, ANSI SQL
Databricks SQL	Per-compute via SQL warehouses	Lakehouse-native, Photon vectorized, MLflow integration
Redshift	Per-node-hour (RA3) or per-second (Serverless)	AWS-native, mature Postgres-style SQL, getting better with Serverless
DuckDB	Free (single-node)	Embedded analytics, replaces pandas for medium-data workloads

How the warehouse choice actually gets made

In an interview, "which warehouse would you pick" sounds like a technical question. In practice it's rarely a green-field decision. Most data engineers inherit a warehouse chosen by whoever set up the company's AWS or GCP account years earlier, and the real skill is working within those constraints rather than re-litigating the platform choice. When the choice is live (a startup's first warehouse, or a migration), it's decided by existing cloud spend and team SQL fluency more often than by feature comparisons between Snowflake, BigQuery, Databricks SQL, and Redshift. Knowing the differences signals competence; leading with a platform opinion before understanding the constraints reads as junior.

Concepts and tools both matter less than sequence. Learning data contracts before you can state a grain, or picking a warehouse opinion before you've worked within someone else's constraints, produces the same failure mode: vocabulary without judgment. Fixing that sequencing is the point of grouping these 18 concepts into foundation, infrastructure, and architecture tiers rather than listing them alphabetically.

Learning order if you're starting from scratch

Four phases of concept learning. Don't skip foundations to chase architecture. Skipping causes senior-loop failures.

Weeks 1–6

Foundations first (junior roles)

SQL fluency. Python fluency. ETL vs ELT. Batch vs streaming. OLAP vs OLTP. ACID. Idempotency. These show up in every interview at every level. Junior interviews mostly stop here. If your foundation is weak, no amount of architecture vocabulary will save you in a senior loop.

5 concepts cover 80% of junior interviews

Weeks 7–14

Infrastructure next (mid-level)

Dimensional modeling. Star schema. SCDs. Data warehousing. Lakehouse formats. Orchestration. Data quality. Idempotency in depth. Mid-level interviews and system design rounds live here. Practice designing schemas from product specs.

Schema + orchestration = mid-level signal

Weeks 15+

Architecture last (senior+)

Medallion. Lakehouse. Data mesh. Governance. Observability. Lineage. Catalog. Contracts. Senior loops and architecture conversations test these. Don't reach for them until the foundation is solid: the failure mode at senior loops is talking about data mesh while having a wobbly grasp of grain.

Foundation gaps disqualify senior candidates

Cross-cutting

Tools deep on one, recognition on three

Concepts matter more than tools, but interviewers do ask. One warehouse (Snowflake, BigQuery, or Postgres) deeply. Airflow conceptually. dbt for transformations. Spark conceptually for big-data interviews. One cloud (usually AWS). Anything beyond that is optional; pretending to know it backfires.

Depth > breadth

Data engineering concepts FAQ

What are the most important concepts to learn first?+

SQL fluency, Python fluency, dimensional modeling with grain, and the batch-vs-streaming decision. These four cover the foundation of every data engineering interview. Everything else builds on them, and trying to learn medallion architecture before you can articulate grain is the canonical mistake.

ETL or ELT, what's the right answer?+

Almost always ELT in 2026. Cloud warehouse compute is cheap, storage is cheap, and loading raw preserves optionality. ETL still wins for PII redaction at the edge and for transformations that aren't expressible in SQL (ML scoring, heavy text processing). Naming the conditions that flip the answer is what the interviewer is listening for.

How do I prepare for the concepts portion of an interview?+

Read the deep dive on each concept once. Then explain it out loud, to a wall or a phone recorder, without looking. The gap between recognition and recall is where interview answers fail. Recognition feels like understanding; only recall transfers.

Which tools should I know?+

Concepts matter more than tools, but interviewers ask. One warehouse (Snowflake, BigQuery, or Postgres) deeply. Airflow conceptually. dbt for transformations. Spark conceptually for big-data interviews. One cloud (usually AWS). Anything beyond that is optional; pretending to know it backfires.

Are these concepts the same across all companies?+

The concepts are universal. The vocabulary and emphasis varies. Netflix and Meta talk about 'data products' (data mesh terminology). Amazon talks about 'data lakes' and 'data sources' (older terminology). Google talks about 'data platforms' (less ideological). The underlying ideas are the same. Learn the concepts; adapt the words to the team's language during interviews.

What concept is most often missing from interview prep?+

Idempotency. Candidates can define dimensional modeling and know what a lakehouse is, but cannot describe how they'd make a backfill safe to re-run. Idempotency comes up in nearly every senior pipeline-design interview and is the most common gap in otherwise-strong candidates.

02 / Why practice

Reading isn't the work

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Practice DE problems

Where to go next

Data engineer roadmap→

What to learn in what order, and what to skip

Interview prep pillar→

Every round of the loop, written for a senior reader

What is data engineering?→

The role, the day-to-day, the career path

SQL interview questions→

Foundation skill tested in every loop