Data Engineering Concepts

A glossary of the concepts that show up in interviews and on the job. Each definition is short enough to use; each link goes to the deeper read when you need it.

Use this as a glossary, not a textbook. Each concept has the definition tight enough to use in an interview and a link to the deeper read if you need it. The order is the order you'd learn them in: foundations first, infrastructure in the middle, architecture-level concepts last.

The depth required shifts with role level. Juniors get asked for definitions and tradeoffs. Seniors get asked how they'd implement the concept with specific tools and what fails when they do. Both versions live in the deep dives.

Prepare for the interview
01 / Open invite
02min.

Know the patterns before the interviewer asks them.

a system design query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1source → bronze → silver → gold
2 ingest : CDC + Kafka
3 transform : dbt + Airflow
4 serve : Snowflake
5
Execute your solution0.4s avg.
PayPalInterview question
Solve a problem

The concepts that show up everywhere

ETL and ELT. Extract-Transform-Load versus Extract-Load-Transform. ETL transforms in flight; ELT lands raw and transforms in the warehouse. ELT won because cloud warehouse compute is cheap and elastic. ETL still wins for PII redaction at the edge and non-SQL transformations. The interview signal is naming the conditions that flip the answer, not picking a side.

Batch versus streaming. Batch processes bounded datasets on a schedule. Streaming processes unbounded event streams continuously. Pick batch when latency can be measured in hours, streaming when it's measured in seconds, hybrid when analytics needs both. Most candidates over-pick streaming; the SLA decides, not the ambition.

Dimensional modeling. Fact tables hold measures and foreign keys; dimensions hold descriptive context. Star schema denormalizes for analytics; snowflake normalizes the dimensions; data vault separates integration from business. State the grain before you draw anything. The grain question separates senior from mid more than any other.

Slowly Changing Dimensions. Type 1 overwrites. Type 2 keeps history with valid_from and valid_to. Type 6 combines current and historical. Pick by query pattern: do you need point-in-time correctness? Type 2. Do you only need today's value? Type 1. The most common interview mistake is implementing Type 2 when Type 1 would do.

Data warehouses. Columnar storage, massively parallel processing, separation of storage and compute. Snowflake, BigQuery, Redshift, Databricks SQL are the platforms. The differences that matter: pricing model (per-second versus per-slot), concurrency handling, dialect quirks. Know one deeply; recognize the others.

Data lakes and lakehouses. Lakes store raw data in object storage at low cost. Lakehouses add ACID via table formats: Delta Lake, Iceberg, Hudi. The two-tier lake-plus-warehouse model is collapsing into a single lakehouse at most companies; Iceberg is the format converging across vendors.

Medallion architecture. Bronze for raw, silver for cleaned and conformed, gold for business-ready. The layer boundaries are quality gates; failures isolate instead of cascading. Not always right (single-source pipelines don't need three layers), but the default mental model for any modern lakehouse.

Orchestration. Schedules, retries, dependency graphs. Airflow is the default and what you'll interview on. Dagster and Prefect for newer builds. The interview is rarely about syntax; it's about retries, idempotent restarts, and the failure modes of long-running schedules.

Idempotency. Same input, same output, regardless of how many times the pipeline runs. Retries and backfills shouldn't produce duplicates or corrupt downstream state. Implement with MERGE, partition overwrites, or natural-key dedup. The interview signal is naming why this matters before being prompted.

Data quality. Six dimensions: accuracy, completeness, consistency, timeliness, uniqueness, validity. Practical checks happen at layer boundaries: schema validation on ingestion, row count reconciliation, business rule assertions, freshness monitoring. dbt tests, Great Expectations, and Soda automate most of it.

Data governance. The technical implementation of access control, PII classification, audit logging, and retention. Row-level security, column masking, GDPR-style deletion pipelines. Data engineers build the infrastructure; legal and security define the policy. Interviews ask about the implementation, not the policy.

Data observability. Monitoring data systems the way you'd monitor an application: freshness, volume, schema stability, distribution drift. Tools: Monte Carlo, Bigeye, Elementary. The newer term for what good data teams have always done, with a vendor category attached.

Data lineage. Where this column came from and what depends on it. Answers the "if this is wrong, what broke" question. dbt provides column-level lineage out of the box for SQL transformations; DataHub, Marquez, OpenLineage cover the broader pipeline.

Data catalog. A searchable index of every dataset, with schemas, descriptions, owners, and usage statistics. The point is discoverability so analysts can find data without asking the engineering team. Atlan, DataHub, Alation, plus cloud-native options like AWS Glue Catalog.

Data mesh. Decentralized data ownership. Domain teams publish data products with quality SLAs; a platform team provides the self-serve infrastructure; federated governance sets global standards. Works at large organizations where the central team has become a bottleneck. Often misapplied at small ones.

OLAP versus OLTP. Transaction systems (OLTP: Postgres, MySQL) handle high-volume low-latency reads and writes for applications. Analytical systems (OLAP: Snowflake, BigQuery) handle complex aggregations on large datasets. Data engineers build the pipelines that move data from OLTP to OLAP, and the schema decisions differ at each end.

ACID properties. Atomicity, consistency, isolation, durability. The transactional guarantees that distinguish "your write succeeded" from "your write probably succeeded." Increasingly relevant on the analytics side too, because lakehouse table formats brought ACID to object storage.

Prepare for the interview
03 / From the bank02 of many
02hand-picked.

What Everyone Is Watching

Hard30 min

Someone is watching. Capture everything.

Learning order, if you're starting from scratch

Foundations. SQL fluency. Python fluency. ETL versus ELT. Batch versus streaming. OLAP versus OLTP. ACID. These show up in every loop at every level. Junior interviews mostly stop here.

Infrastructure. Dimensional modeling. Star schema. SCDs. Data warehousing. Orchestration. Data quality. Idempotency. Mid-level interviews and system design rounds live here.

Architecture. Lakehouse. Medallion. Data mesh. Governance. Observability. Lineage. Catalog. Senior loops and architecture conversations test these. Don't reach for them until the foundation is solid; the failure mode at senior loops is talking about data mesh while having a wobbly grasp of grain.

Common questions

What are the most important concepts to learn first?+
SQL fluency, Python fluency, dimensional modeling with grain, and the batch-versus-streaming decision. These four cover the foundation of every data engineering interview. Everything else builds on them, and trying to learn medallion architecture before you can articulate grain is the canonical mistake.
ETL or ELT, what's the right answer?+
Almost always ELT in 2026. Cloud warehouse compute is cheap, storage is cheap, and loading raw preserves optionality. ETL still wins for PII redaction at the edge and for transformations that aren't expressible in SQL (ML scoring, heavy text processing). Naming the conditions that flip the answer is what the interviewer is listening for.
How do I prepare for the concepts portion of an interview?+
Read the deep dive on each concept once. Then explain it out loud, to a wall or a phone recorder, without looking. The gap between recognition and recall is where interview answers fail. Recognition feels like understanding; only recall transfers.
Which tools should I know?+
Concepts matter more than tools, but interviewers do ask. One warehouse (Snowflake, BigQuery, or Postgres) deeply. Airflow conceptually. dbt for transformations. Spark conceptually for big-data interviews. One cloud (usually AWS). Anything beyond that is optional; pretending to know it backfires.
02 / Why practice

Reading isn't the work

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Where to go next