Loading lesson...
Ingestion Patterns: Intermediate
Incremental loads, change capture, and idempotent consumption decide whether ingestion scales
Incremental loads, change capture, and idempotent consumption decide whether ingestion scales
- Category
- Pipeline Architecture
- Difficulty
- intermediate
- Duration
- 32 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Full, Incremental, and Bookmarks, CDC: The Three Families, Log-Based CDC: Mechanics and Costs, Idempotent At-Least-Once Ingest, Three Ways to Ingest from Postgres
Lesson Sections
- Full, Incremental, and Bookmarks (concepts: paFullVsIncremental, paBookmarkPattern)
Pull ingestion lives on a spectrum. At one end, every run reads the entire source table. At the other end, every run reads only the rows that changed since the last successful run. The second pattern requires a bookmark of some kind, persisted between runs, that defines what 'since the last run' means. Picking the right point on the spectrum and choosing the right bookmark is the difference between ingestion that scales and ingestion that turns the source database into the bottleneck. When Each
- CDC: The Three Families (concepts: paCdc)
Change Data Capture, abbreviated CDC, is the discipline of capturing every insert, update, and delete that lands in an operational database, in order, and surfacing them for downstream consumers. CDC is what turns a transactional database into a streaming source without the application code knowing anything has changed. Three families of CDC dominate. They differ in where the capture happens and what it costs. Family Comparison Trigger-Based CDC Trigger-based CDC is the oldest pattern. The DBA i
- Log-Based CDC: Mechanics and Costs (concepts: paCdc, paSchemaEvolution)
Log-based CDC sounds free until the operational profile arrives. The mechanism is direct: the database is already writing every change to its log for crash recovery; tap the log, decode it, ship it downstream. The reality has costs. Replication slots can fill the disk. Schema changes upstream become Kafka topic problems. The CDC connector becomes a critical piece of infrastructure that must be operated like a database. Debezium and AWS DMS at a Glance The Replication Slot Problem Postgres uses l
- Idempotent At-Least-Once Ingest (concepts: paIdempotency, paDeduplication)
The general idempotent-write playbook is the subject of Lesson 5 (partition overwrite, MERGE on a business key, DELETE-then-INSERT). This section narrows the playbook to the ingestion seam, where at-least-once delivery from the source forces deduplication on a message key. Most ingestion systems offer at-least-once delivery. Kafka consumers reprocess after rebalancing. Webhook senders retry until they get a 2xx. SFTP partners re-upload the same file when their cron job retries. Pull jobs that cr
- Three Ways to Ingest from Postgres (concepts: paFullVsIncremental, paCdc, paIdempotency)
Vocabulary becomes useful applied to a single source seen three ways. Take a Postgres operational database with two tables of interest: customers (slowly changing dimension, ~2M rows) and orders (event-shaped, ~500M rows growing at ~10M per day). The downstream destination is Snowflake. The product team wants the customers table fresh within an hour and the orders table fresh within five minutes. Three legitimate strategies exist. Each is correct for some scale and some operational context. Stra