Here's the truth no one tells you: you don't need to memorize every concept in data engineering. You need fluency in the ten that show up everywhere and a working vocabulary for the rest. This page gives you both. Start at the top and read downward. The foundational concepts come first because they're what interviewers probe when you stumble on the advanced ones. Get the basics solid and the senior topics get easier.
Data modeling share
Lessons in library
Interactive atoms
Domains covered
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
You'll notice the depth you need shifts with the role level. Juniors get asked for definitions and trade-offs. Seniors get asked how they'd implement the concept with specific tools and failure modes. Read the summary, follow the deep-dive link when the concept feels fuzzy, and come back when you're ready for the next one.
Extract, Transform, Load vs Extract, Load, Transform. ETL transforms data before loading; ELT loads raw data and transforms inside the warehouse. Modern stacks favor ELT because cloud warehouses handle the compute. Understanding when to use each pattern is a standard interview question.
Pipeline Architecture →Batch pipelines process data in scheduled intervals (hourly, daily). Streaming pipelines process data continuously as it arrives. The choice depends on latency requirements, data volume, and infrastructure complexity. Most production systems use batch with streaming for specific low-latency use cases.
Pipeline Architecture →How you structure tables in a warehouse determines query performance and analyst productivity. Star schemas (fact + dimension tables) are the standard for analytics. Snowflake schemas normalize dimensions further. Data vault handles complex, evolving source systems. Interviewers test whether you can design a schema for a given business domain.
Data Modeling Guide →A data warehouse is a centralized store optimized for analytical queries. Columnar storage, MPP (massively parallel processing), and query optimization make warehouses fast for aggregation and joins. Snowflake, BigQuery, Redshift, and Databricks SQL are the major platforms. Know the differences between them.
Warehouse Design →Data lakes store raw data in object storage (S3, GCS) at low cost. Lakehouses add ACID transactions and query performance via table formats (Delta Lake, Iceberg, Hudi). The lakehouse architecture is replacing the lake + warehouse two-tier model at many companies.
Lakehouse Architecture →Orchestration tools schedule, monitor, and manage pipeline dependencies. Apache Airflow is the most common orchestrator in data engineering. Dagster, Prefect, and Mage are newer alternatives. Interviewers ask how you would handle retries, backfills, dependency management, and failure alerting.
Pipeline Architecture →Data quality covers six dimensions: accuracy, completeness, consistency, timeliness, uniqueness, and validity. Testing patterns include schema validation at ingestion, row count reconciliation, business rule assertions, and freshness monitoring. Tools like Great Expectations, dbt tests, and Soda automate these checks.
Data Quality Guide →Governance is the technical implementation of access control, data classification, PII handling, audit trails, and retention policies. Data engineers build the infrastructure that enforces governance: RBAC, column-level masking, row-level security, query logging, and deletion pipelines for GDPR compliance.
Data Governance Guide →Observability monitors data systems for anomalies, freshness issues, and schema changes without predefined rules. It is the data equivalent of application monitoring. Tools like Monte Carlo, Bigeye, and Elementary track freshness, volume, distribution, and schema stability across your warehouse.
Data Observability Guide →Data mesh decentralizes data ownership from a central team to domain teams. Each domain publishes data products with quality SLAs. A platform team provides the self-serve infrastructure. Federated governance sets global standards. It works best at large organizations where the central team has become a bottleneck.
Data Mesh Guide →OLTP (Online Transaction Processing) systems handle high-volume, low-latency reads and writes for operational applications. OLAP (Online Analytical Processing) systems handle complex aggregation queries on large datasets for analytics. Data engineers build pipelines that move data from OLTP sources to OLAP warehouses.
OLAP vs OLTP →An idempotent pipeline produces the same output regardless of how many times it runs with the same input. This is critical for reliability: retries, backfills, and recovery should not produce duplicates or corrupt data. Implement idempotency with upserts (MERGE), partition overwriting, and deduplication logic.
Idempotency Guide →Atomicity, Consistency, Isolation, Durability. These four properties guarantee reliable transaction processing. Data engineers need to understand ACID for database operations, lakehouse table formats, and pipeline design. Interviewers ask about ACID in the context of data correctness and system reliability.
ACID Properties Guide →Lineage tracks where data comes from, how it was transformed, and where it goes. It answers the question: if this column is wrong, what upstream source caused it? Tools like dbt, DataHub, and Marquez provide lineage automatically. Lineage is essential for debugging, compliance, and impact analysis.
Data Lineage Guide →A data catalog is a centralized inventory of all datasets in an organization. It stores metadata: schemas, descriptions, owners, quality metrics, and usage statistics. Catalogs make data discoverable. Tools include Alation, Atlan, DataHub, and cloud-native options like AWS Glue Catalog.
Data Catalog Guide →If you are studying for interviews or building your data engineering foundation, work through these concepts in order. Each level builds on the previous one.
SQL (joins, aggregation, window functions, CTEs). Python fundamentals (data structures, file I/O, APIs). ETL vs ELT. Batch vs streaming. OLAP vs OLTP. ACID properties. These concepts show up in every data engineering interview at every level.
Data modeling (star schema, snowflake schema). Data warehousing (Snowflake, BigQuery, Redshift). Orchestration (Airflow). Data quality (testing, monitoring). Idempotency. These concepts are tested in mid-level interviews and system design rounds.
Data lakehouse. Data mesh. Data governance. Data observability. Data lineage. Data catalog. System design (end-to-end pipeline design with trade-offs). These concepts are tested in senior interviews and architecture discussions.
Writing code forces concepts out of the abstract and into your fingertips. Start with one problem today.