Data Engineering

Data Engineering Concepts

Here's the truth no one tells you: you don't need to memorize every concept in data engineering. You need fluency in the ten that show up everywhere and a working vocabulary for the rest. This page gives you both. Start at the top and read downward. The foundational concepts come first because they're what interviewers probe when you stumble on the advanced ones. Get the basics solid and the senior topics get easier.

18%

Data modeling share

84

Lessons in library

176

Interactive atoms

4

Domains covered

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Core Data Engineering Concepts

You'll notice the depth you need shifts with the role level. Juniors get asked for definitions and trade-offs. Seniors get asked how they'd implement the concept with specific tools and failure modes. Read the summary, follow the deep-dive link when the concept feels fuzzy, and come back when you're ready for the next one.

ETL and ELT

Extract, Transform, Load vs Extract, Load, Transform. ETL transforms data before loading; ELT loads raw data and transforms inside the warehouse. Modern stacks favor ELT because cloud warehouses handle the compute. Understanding when to use each pattern is a standard interview question.

Pipeline Architecture

Batch vs. Streaming

Batch pipelines process data in scheduled intervals (hourly, daily). Streaming pipelines process data continuously as it arrives. The choice depends on latency requirements, data volume, and infrastructure complexity. Most production systems use batch with streaming for specific low-latency use cases.

Pipeline Architecture

Data Modeling

How you structure tables in a warehouse determines query performance and analyst productivity. Star schemas (fact + dimension tables) are the standard for analytics. Snowflake schemas normalize dimensions further. Data vault handles complex, evolving source systems. Interviewers test whether you can design a schema for a given business domain.

Data Modeling Guide

Data Warehousing

A data warehouse is a centralized store optimized for analytical queries. Columnar storage, MPP (massively parallel processing), and query optimization make warehouses fast for aggregation and joins. Snowflake, BigQuery, Redshift, and Databricks SQL are the major platforms. Know the differences between them.

Warehouse Design

Data Lakes and Lakehouses

Data lakes store raw data in object storage (S3, GCS) at low cost. Lakehouses add ACID transactions and query performance via table formats (Delta Lake, Iceberg, Hudi). The lakehouse architecture is replacing the lake + warehouse two-tier model at many companies.

Lakehouse Architecture

Orchestration

Orchestration tools schedule, monitor, and manage pipeline dependencies. Apache Airflow is the most common orchestrator in data engineering. Dagster, Prefect, and Mage are newer alternatives. Interviewers ask how you would handle retries, backfills, dependency management, and failure alerting.

Pipeline Architecture

Data Quality

Data quality covers six dimensions: accuracy, completeness, consistency, timeliness, uniqueness, and validity. Testing patterns include schema validation at ingestion, row count reconciliation, business rule assertions, and freshness monitoring. Tools like Great Expectations, dbt tests, and Soda automate these checks.

Data Quality Guide

Data Governance

Governance is the technical implementation of access control, data classification, PII handling, audit trails, and retention policies. Data engineers build the infrastructure that enforces governance: RBAC, column-level masking, row-level security, query logging, and deletion pipelines for GDPR compliance.

Data Governance Guide

Data Observability

Observability monitors data systems for anomalies, freshness issues, and schema changes without predefined rules. It is the data equivalent of application monitoring. Tools like Monte Carlo, Bigeye, and Elementary track freshness, volume, distribution, and schema stability across your warehouse.

Data Observability Guide

Data Mesh

Data mesh decentralizes data ownership from a central team to domain teams. Each domain publishes data products with quality SLAs. A platform team provides the self-serve infrastructure. Federated governance sets global standards. It works best at large organizations where the central team has become a bottleneck.

Data Mesh Guide

OLAP vs. OLTP

OLTP (Online Transaction Processing) systems handle high-volume, low-latency reads and writes for operational applications. OLAP (Online Analytical Processing) systems handle complex aggregation queries on large datasets for analytics. Data engineers build pipelines that move data from OLTP sources to OLAP warehouses.

OLAP vs OLTP

Idempotency

An idempotent pipeline produces the same output regardless of how many times it runs with the same input. This is critical for reliability: retries, backfills, and recovery should not produce duplicates or corrupt data. Implement idempotency with upserts (MERGE), partition overwriting, and deduplication logic.

Idempotency Guide

ACID Properties

Atomicity, Consistency, Isolation, Durability. These four properties guarantee reliable transaction processing. Data engineers need to understand ACID for database operations, lakehouse table formats, and pipeline design. Interviewers ask about ACID in the context of data correctness and system reliability.

ACID Properties Guide

Data Lineage

Lineage tracks where data comes from, how it was transformed, and where it goes. It answers the question: if this column is wrong, what upstream source caused it? Tools like dbt, DataHub, and Marquez provide lineage automatically. Lineage is essential for debugging, compliance, and impact analysis.

Data Lineage Guide

Data Catalog

A data catalog is a centralized inventory of all datasets in an organization. It stores metadata: schemas, descriptions, owners, quality metrics, and usage statistics. Catalogs make data discoverable. Tools include Alation, Atlan, DataHub, and cloud-native options like AWS Glue Catalog.

Data Catalog Guide

Suggested Learning Path

If you are studying for interviews or building your data engineering foundation, work through these concepts in order. Each level builds on the previous one.

Level 1: Foundations

SQL (joins, aggregation, window functions, CTEs). Python fundamentals (data structures, file I/O, APIs). ETL vs ELT. Batch vs streaming. OLAP vs OLTP. ACID properties. These concepts show up in every data engineering interview at every level.

Level 2: Infrastructure

Data modeling (star schema, snowflake schema). Data warehousing (Snowflake, BigQuery, Redshift). Orchestration (Airflow). Data quality (testing, monitoring). Idempotency. These concepts are tested in mid-level interviews and system design rounds.

Level 3: Architecture

Data lakehouse. Data mesh. Data governance. Data observability. Data lineage. Data catalog. System design (end-to-end pipeline design with trade-offs). These concepts are tested in senior interviews and architecture discussions.

Data Engineering Concepts FAQ

What are the most important data engineering concepts to learn?+
Start with the foundational patterns: ETL vs ELT (how data moves), batch vs streaming (when data moves), dimensional modeling (how data is structured for analytics), and SQL (how data is queried). Then move to infrastructure concepts: data warehousing, data lakes, orchestration, and data quality. Finally, learn architecture-level concepts: data mesh, data lakehouse, system design, and governance. This progression matches what interviews test at each level: fundamentals for junior roles, infrastructure for mid-level, and architecture for senior.
What is the difference between ETL and ELT?+
ETL (Extract, Transform, Load) transforms data before loading it into the target system. ELT (Extract, Load, Transform) loads raw data first and transforms it inside the target system. The shift from ETL to ELT happened because modern cloud warehouses (Snowflake, BigQuery, Redshift) have enough compute power to handle transformation. ELT simplifies the pipeline (raw data is preserved in the warehouse), enables flexible transformation (you can re-transform without re-extracting), and leverages the warehouse engine for heavy computation.
How should I prepare for a data engineering interview?+
Cover four areas. SQL: practice window functions, CTEs, joins, and aggregation at interview difficulty. Python: know data structures, file processing, API interaction, and testing. System design: practice designing pipelines, warehouses, and streaming systems with trade-off analysis. Domain knowledge: understand the concepts on this page (ETL/ELT, batch/streaming, data modeling, orchestration, data quality) well enough to discuss them in a conversation, not just define them.
What tools should a data engineer know?+
The specific tools matter less than the concepts they implement, but interviewers do ask about tools. For SQL: PostgreSQL, BigQuery, or Snowflake. For orchestration: Apache Airflow (most common in interviews). For processing: Spark (batch), Flink or Kafka Streams (streaming). For transformation: dbt. For storage: S3 or GCS with Parquet or Iceberg. For containerization: Docker. Know at least one tool in each category well enough to discuss its strengths and limitations.

You Know the Words. Now Use Them.

Writing code forces concepts out of the abstract and into your fingertips. Start with one problem today.