A logistics company at 600 engineers had thirty pipelines pulling incrementally from the same Postgres operational database. Each ran every fifteen minutes against updated_at. The replica's IOPS doubled in a quarter. A platform engineer profiled the workload and found that thirty incremental queries scanning the same indexes thirty times every fifteen minutes was producing twelve hundred index scans per hour, half of which returned zero new rows. The fix was not faster ingestion. The fix was Debezium reading the Postgres write-ahead log once, publishing changes to a Kafka topic, and the thirty pipelines consuming from Kafka instead of querying the database. The replica's IOPS dropped 80 percent that week. The story is a fair summary of why the intermediate tier of ingestion exists. Pull ingestion stops scaling at some inflection point and the right answer is to invert the model.
Full, Incremental, and Bookmarks
Daily Life
Interviews
Choose between full and incremental loads, design a bookmark store, and write the bounded query that drives incremental pulls.
Pull ingestion lives on a spectrum. At one end, every run reads the entire source table. At the other end, every run reads only the rows that changed since the last successful run. The second pattern requires a bookmark of some kind, persisted between runs, that defines what 'since the last run' means. Picking the right point on the spectrum and choosing the right bookmark is the difference between ingestion that scales and ingestion that turns the source database into the bottleneck.
When Each Strategy Fits
Full loadIncremental by timestampIncremental by sequence idCDC stream
Full load
Reread the entire source
Truncate-and-replace. Best for tables under ~1M rows; reference and dimension tables.
Incremental by timestamp
WHERE updated_at > X
Cost scales with change rate. Requires a reliable updated_at; sensitive to clock skew.
Incremental by sequence id
WHERE id > last_id
Cleanest for append-only tables. Immune to clock skew. Cannot detect updates.
CDC stream
Read the WAL
Captures every insert, update, and delete. Strongest at scale. Operational cost is real.
The Bookmark Pattern
A bookmark is a small piece of state that records the position the last successful ingestion run reached. The bookmark is persisted in a metadata table the pipeline owns, not in the source. On every run, the pipeline reads the bookmark, queries the source for rows after that position, processes them, and on success advances the bookmark to the new maximum. The bookmark is the entire mechanism that separates incremental ingestion from full reload.
1
'CREATE TABLE ingestion_bookmarks (
source_name TEXT PRIMARY KEY,
bookmark_value TEXT NOT NULL,
bookmark_kind TEXT NOT NULL,
last_run_at TIMESTAMP NOT NULL,
last_row_count BIGINT,
last_status TEXT NOT NULL
)'
2
/* The bookmark store is a tiny metadata table the pipeline owns. */
3
/* ISO timestamp or sequence id, as text */
4
/* 'timestamp' | 'sequence' | 'change_version' */
Three columns define the contract: the source name (so the table holds bookmarks for many sources), the bookmark value (a string, kept as text so timestamps and integers share a column), and the bookmark kind (so the pipeline knows how to compare). The remaining columns are operational metadata that pay for themselves the first time someone needs to debug a stuck pipeline.
The Boundaries That Make Bookmarks Correct
Three boundaries every bookmark-driven pull must enforce:
▸Lower bound: WHERE bookmark_column >= :previous_value (inclusive on the left to avoid losing the boundary row)
▸Upper bound: AND bookmark_column < :run_started_at (exclusive, fixed at run launch, never wall clock mid-query)
▸Order: ORDER BY bookmark_column so the new bookmark is unambiguously the maximum value seen
▸Atomicity: advance the bookmark only after the rows are durably written downstream
▸Recoverability: keep the previous bookmark value so a failed run does not advance and the next run reprocesses
Choosing the Bookmark Column
Not every column is bookmark-quality. The column must be monotonic for the rows the pipeline cares about: an updated_at that some triggers fail to update is not safe. The column must be indexed: a sequential scan of a billion-row table on every run is not viable. The column must be source-controlled: a trigger-fired updated_at on the source is fine, an updated_at written by application code in some paths but not others is a trap. The audit takes thirty minutes per source and prevents months of silent data loss.
•Bookmark on updated_at
Captures inserts and updates
Subject to clock skew and trigger gaps
Universal: most tables already have it
Resolution issue: integer-second timestamps lose rows updated within the same second
•Bookmark on sequence id
Captures only inserts (or any monotonic position)
Immune to clock skew because it is not time-based
Cleanest for append-only event tables
Cannot detect updates: if a row is mutated, its id does not change
A Bookmark-Driven Pull in SQL
1
/* The pipeline runs this query, bound to the persisted bookmark value. */
2
SELECT
3
order_id,
4
customer_id,
5
total_cents,
6
status,
7
updated_at
8
FROMorders
9
WHEREupdated_at>='2026-04-25 01:00:00'
10
ANDupdated_at<'2026-04-25 01:15:00'
11
ORDERBYupdated_at,order_id
12
LIMIT5
Result
order_id
customer_id
total_cents
status
updated_at
1001
C-42
12500
paid
2026-04-25 01:01:14
1002
C-87
9900
paid
2026-04-25 01:03:55
1003
C-12
45000
refunded
2026-04-25 01:08:22
1004
C-55
780
paid
2026-04-25 01:11:09
1005
C-91
6700
pending
2026-04-25 01:14:48
The lower bound is the previous bookmark value. The upper bound is the run start time, fixed once at launch. The compound order (updated_at, order_id) guarantees stable pagination even when many rows share the same timestamp. After the rows write to the raw zone, the bookmark advances to the run start time. The next run picks up from there.
The bookmark is a single piece of state that turns full reload into incremental.
Advance the bookmark only after the downstream write is durable.
Audit the bookmark column for monotonicity, indexing, and source-control before relying on it.
TIP
Keep the previous bookmark value alongside the current one. Recovering from a buggy advance becomes a one-row UPDATE instead of a postmortem.
CDC: The Three Families
Daily Life
Interviews
Distinguish trigger-based, query-based, and log-based CDC and choose the right family for a given source.
Change Data Capture, abbreviated CDC, is the discipline of capturing every insert, update, and delete that lands in an operational database, in order, and surfacing them for downstream consumers. CDC is what turns a transactional database into a streaming source without the application code knowing anything has changed. Three families of CDC dominate. They differ in where the capture happens and what it costs.
Family Comparison
Family
How It Captures
Captures Deletes?
Operational Cost
Trigger-based
Database triggers fire on INSERT/UPDATE/DELETE and write to an audit table
Yes
High write amplification on the source; trigger maintenance burden
Query-based
Periodic SELECT WHERE updated_at > X (the bookmark pattern from section 0)
Only with explicit tombstone columns
Adds repeated scan load to the source; misses hard deletes
Log-based
Reads the database write-ahead log directly (WAL, binlog, oplog)
Yes (deletes appear in the log)
Adds a replication slot; minimal CPU at the source
Trigger-Based CDC
Trigger-based CDC is the oldest pattern. The DBA installs database triggers on every table of interest. Each trigger fires on INSERT, UPDATE, or DELETE and writes a row to a side table that records what happened. A pipeline polls the side table, drains it, and writes to downstream. The pattern works on any database that supports triggers, captures deletes naturally, and is straightforward to reason about.
Every business write to orders now triggers an additional write to orders_cdc. That doubling of write volume is the cost. On a high-throughput OLTP table the doubling can be the difference between fitting on the existing instance and needing a larger one. The trigger code is also code: bugs in the trigger become bugs in the source-of-truth write path, which DBAs hate for legitimate reasons.
When trigger-based CDC is the right choice:
▸The source database does not expose a usable replication log (older versions, hosted services without WAL access)
▸Write volume is low enough that doubling it is acceptable
▸Audit requirements demand a trigger-managed history table for compliance reasons
▸The team has DBA capacity to maintain triggers as schemas evolve
Query-Based CDC
Query-based CDC is the bookmark pattern from section 0, used as a CDC mechanism. The pipeline polls the source on a cadence and pulls rows whose updated_at advanced since the last poll. It is the simplest pattern and the most common in off-the-shelf connectors (Airbyte, Fivetran, Stitch). It captures inserts and updates well and captures deletes only if the source maintains a deleted_at column. Hard deletes (rows physically removed) are invisible to query-based CDC.
✓Query-Based Strengths
Works on any database that supports SELECT
Easy to set up; no DBA cooperation required
Bounded source-side cost (just queries)
Off-the-shelf connectors handle it
•Query-Based Weaknesses
Misses hard deletes entirely
Sensitive to clock skew and trigger-skipped updates
Polling cadence is the floor on freshness
Adds repeated scan load proportional to poll frequency
Log-Based CDC: The Strongest Family
Log-based CDC reads the database's write-ahead log directly. Postgres calls it the WAL. MySQL calls it the binlog. MongoDB calls it the oplog. The log records every committed change in commit order, including deletes. A log-based CDC tool subscribes to the log as a logical replica and emits each change as an event downstream. The pattern captures every change with no work on the application or trigger side, captures deletes naturally, and adds minimal CPU cost to the source.
The Debezium connector reads the WAL through a Postgres logical replication slot, decodes each transaction into row-level events, and publishes them to a Kafka topic per source table. Each event carries an op field ('c' for create, 'u' for update, 'd' for delete), an optional before image, and an after image. Downstream consumers see every change, in order, with deletes preserved. Log-based CDC is the only CDC family that captures deletes correctly without source-side cooperation, and that single property is enough to make it the default at scale.
Trigger-basedQuery-basedLog-based
Trigger-based
Triggers write to side tables
Captures deletes. Doubles write volume on the source. Trigger code becomes part of the write path.
Query-based
Polling on a watermark
Simplest. Works on any source. Misses hard deletes. Cadence floors freshness.
Log-based
Read the WAL or binlog
Captures every change including deletes. Sub-second latency. Requires Kafka, registry, and slot lag monitoring.
Trigger-based CDC trades source-side write amplification for any-database compatibility.
Query-based CDC misses hard deletes; that gap is permanent without an application-level tombstone.
Log-based CDC is the strongest family because the log is already the source of truth for the database.
TIP
Audit any in-house CDC implementation for delete handling. The most common silent data quality failure in operational mirrors is ignored hard deletes.
Log-Based CDC: Mechanics and Costs
Daily Life
Interviews
Operate a log-based CDC pipeline including initial snapshot, replication slot lag, and schema evolution.
Log-based CDC sounds free until the operational profile arrives. The mechanism is direct: the database is already writing every change to its log for crash recovery; tap the log, decode it, ship it downstream. The reality has costs. Replication slots can fill the disk. Schema changes upstream become Kafka topic problems. The CDC connector becomes a critical piece of infrastructure that must be operated like a database.
Debezium and AWS DMS at a Glance
Tool
How It Runs
Output
Where It Fits
Debezium
Kafka Connect plugin (or embedded library)
Per-table Kafka topic of CDC events
Companies running Kafka; need fine-grained control
AWS DMS
Managed AWS service
Streams to Kinesis, Kafka MSK, S3, or warehouse
AWS-heavy stacks; want managed-service tradeoff
GCP Datastream
Managed GCP service
Streams to BigQuery, Pub/Sub, or GCS
GCP-heavy stacks; BigQuery as primary destination
Azure Event Hubs CDC
Managed Azure service tied to SQL or Cosmos
Streams to Event Hubs
Azure SQL or Cosmos as the source
The Replication Slot Problem
Postgres uses logical replication slots to track which WAL position each consumer has reached. The slot guarantees the WAL is retained until the consumer confirms it has been read. That guarantee is also a liability. If the consumer falls behind or stops entirely, the WAL grows without bound on the source. A stuck Debezium connector has filled production database disks more than once at every company that runs CDC at scale. Operating CDC means monitoring slot lag the way databases monitor replication lag.
What can go wrong with a logical replication slot:
▸Connector restarts and never reattaches; slot grows the WAL until the disk fills
▸Schema change upstream that the connector cannot decode; events accumulate while the slot waits
▸Network partition between connector and database; slot pauses, WAL grows
▸Connector consumes slowly under load; slot lag climbs and replication lag climbs with it
▸Multiple slots on the same database compounding their retention requirements
Schema Evolution in CDC Pipelines
Source-side DDL is a CDC event too. When an upstream team adds a column to orders, Debezium notices, emits a schema change event, and the schema registry receives a new version. Downstream consumers reading from Kafka either tolerate the new field (forward compatibility) or break. The expand-contract pattern from the schema evolution lesson applies directly: additive changes flow through; destructive changes (drop, rename, type narrowing) require coordination across the producer, the connector, the registry, and every consumer.
✓Additive Schema Change
Add a new optional column
Connector emits new schema version
Existing consumers ignore the field; new consumers read it
No coordination required beyond announcement
•Breaking Schema Change
Drop a column, rename a column, narrow a type
Connector emits an incompatible schema version
Consumers fail to deserialize until updated
Requires producer-connector-consumer coordination
An Initial Snapshot Plus Streaming Tail
A new CDC subscription must reconcile two pieces of state. The history (rows that exist before the subscription started) and the tail (rows changing after the subscription started). Debezium's standard pattern is the snapshot. On startup, the connector locks the source briefly, records the current WAL position, snapshots all rows of every captured table, and then begins streaming from the recorded position. The two phases stitch together to produce a complete history with no gap and no overlap.
Log-based CDC is the strongest CDC family because it taps the database's existing source of truth.
Replication slots fill source disks when consumers stall; monitor slot lag.
Initial snapshot plus streaming tail is the standard onboarding pattern.
TIP
Treat the CDC connector as a piece of database infrastructure with its own SLO, not as another pipeline. The blast radius of a broken connector is the source database itself.
The general idempotent-write playbook is the subject of Lesson 5 (partition overwrite, MERGE on a business key, DELETE-then-INSERT). This section narrows the playbook to the ingestion seam, where at-least-once delivery from the source forces deduplication on a message key.
Most ingestion systems offer at-least-once delivery. Kafka consumers reprocess after rebalancing. Webhook senders retry until they get a 2xx. SFTP partners re-upload the same file when their cron job retries. Pull jobs that crash mid-write rerun the same window. The world the ingestion layer lives in is not exactly-once; it is at-least-once with deduplication required at the sink. Pipelines that pretend otherwise produce duplicate rows that take weeks to find.
Why At-Least-Once Is the Default
Scenario
Why Duplicates Appear
Frequency
Kafka consumer rebalance
Offset committed before write; rebalance reassigns the partition
Some rows written, bookmark not yet advanced; next run repeats the window
Whenever the underlying compute crashes
File drop re-upload
Partner cron retried after a network blip; same filename, same contents
Weekly to monthly per partner
CDC initial snapshot replay
Connector restarted before snapshot completed
On every connector restart during onboarding
The Dedupe Key
An idempotent ingestion sink deduplicates on a stable key extracted from the message itself. The key is whatever uniquely identifies the logical event, regardless of how many times the message is delivered. For Kafka events, the key is often a UUID generated by the producer and included in the event payload. For webhook events, vendors usually include an event id (Stripe's evt_..., Shopify's webhook_id). For file drops, the row's natural key plus the source file's identity is sufficient. For pull ingestion, the source primary key plus the bookmark window position works.
The unique constraint on event_id makes the table self-deduplicating. Two attempts to insert the same event_id produce one row. The pipeline does not have to know whether this is the first attempt or the fifth. The semantics shift from at-least-once delivery plus error-on-duplicate to at-least-once delivery plus silent dedupe. That shift is what makes the pipeline tolerate any number of retries.
Dedupe by Window vs Dedupe by Key
•Dedupe by Time Window
Group events by hour or day; within the window, GROUP BY a natural key
Loses correctness if duplicates span windows
Cheaper to implement when no global id exists
Acceptable for analytics aggregates, not for transactional facts
✓Dedupe by Stable Key
Every event has a unique id; sink enforces uniqueness on it
Correct under any retry pattern, any window
Requires the producer to generate the key (or extract it from upstream)
Standard for transactional events and CDC
Idempotent Consumption in Code
1
# Simulating at-least-once delivery and dedupe-by-key on the consumer side.
2
incoming=[
3
{"event_id":"e1","amount":100},
4
{"event_id":"e2","amount":250},
5
{"event_id":"e1","amount":100},# retry of e1
6
{"event_id":"e3","amount":75},
7
{"event_id":"e2","amount":250},# retry of e2
8
]
9
10
seen=set()
11
stored=[]
12
foreventinincoming:
13
ifevent["event_id"]inseen:
14
continue
15
seen.add(event["event_id"])
16
stored.append(event)
17
18
print(f"Received {len(incoming)} events")
19
print(f"Stored {len(stored)} unique events")
20
print(f"Total amount: {sum(e['amount'] for e in stored)}")
>>>Output
Received 5 events
Stored 3 unique events
Total amount: 425
Five events arrive, three unique events get stored, and the total amount reflects only the unique events. The dedupe set is in memory in the example; in a real ingestion sink the dedupe set is the unique constraint on the destination table, the keyed compaction in Kafka, or a separate dedupe table that the consumer consults before writing. The shape of the logic does not change.
Where Dedupe Belongs
Three valid places to deduplicate, in order of preference:
▸At the sink, via a unique constraint or MERGE on the dedupe key (cheapest, most reliable)
▸In the streaming engine, via keyed state and exactly-once sinks (Flink, Spark Structured Streaming)
▸Downstream during transformation, via ROW_NUMBER OVER PARTITION BY key (works but is expensive)
Pushing dedupe to the sink is the cheapest because the sink already needs an index on the key for query performance, and the unique constraint piggybacks on that index. Doing it in the streaming engine is correct but binds the pipeline to a specific engine. Doing it downstream is what every legacy pipeline ends up doing because nobody put the dedupe key in place at ingest time. The pattern is so common that 'add a dedupe step downstream' is a tax most analytics pipelines pay forever. Dedupe is not optional in at-least-once systems; the choice is where it happens, not whether it happens.
✓Do
Require every ingestion source to expose a stable event id
Enforce uniqueness on the dedupe key at the raw zone, not in transforms
Generate ids at the producer when sources do not provide them (UUID v4 plus content hash)
Persist the dedupe key alongside the payload so downstream can reuse it
✗Don't
Trust delivery semantics to be exactly-once at any scale
Build dedupe-by-window logic when a stable key is available
Use natural keys (email, phone) as dedupe keys when they can be edited upstream
Forget that file drops can be re-uploaded and need their own dedupe key per row
At-least-once is the default; deduplication at the sink is the answer.
A stable event id from the producer is the cheapest dedupe key available.
Dedupe pushed downstream costs more than dedupe at ingest, every time.
TIP
If the source does not provide an event id, generate one at the ingestion boundary using a content hash. The id is then deterministic across retries of the same logical event.
Three Ways to Ingest from Postgres
Daily Life
Interviews
Compare three Postgres ingestion strategies and pick the right one based on scale, deletes, latency, and team capacity.
Vocabulary becomes useful applied to a single source seen three ways. Take a Postgres operational database with two tables of interest: customers (slowly changing dimension, ~2M rows) and orders (event-shaped, ~500M rows growing at ~10M per day). The downstream destination is Snowflake. The product team wants the customers table fresh within an hour and the orders table fresh within five minutes. Three legitimate strategies exist. Each is correct for some scale and some operational context.
Strategy 1: Periodic Full Load + Incremental Pull
The first strategy treats the two tables differently. Customers is small enough that a nightly full load (truncate-and-replace) is acceptable; the table reloads every 24 hours and the freshness SLO is met. Orders is too large for full load, so incremental pull every five minutes uses the bookmark pattern with updated_at as the watermark. Both jobs run in Airflow. State lives in a metadata table.
Aspect
Customers (full load)
Orders (incremental pull)
Cadence
Once per day
Every 5 minutes
Source-side cost
One full table scan per night
One bookmark-bounded scan per 5 min, ~288 scans/day
Captures deletes?
Yes (full reload)
Only soft deletes via deleted_at
Failure recovery
Just rerun; idempotent by truncate
Bookmark unchanged on failure; next run reprocesses window
Strategy 2: Trigger-Based CDC
The second strategy installs triggers on both tables that write to side tables capturing every insert, update, and delete. A small worker drains the side tables every minute and writes the events to Snowflake. The pattern captures deletes naturally and provides a complete change history. The cost is doubled write volume on every business write to the source database, plus the operational burden of maintaining triggers as schemas evolve.
1
/* The drain worker pulls from the trigger-populated side table. */
2
/* Returns every change captured since the last drain. */
3
SELECT
4
op,
5
order_id,
6
captured_at
7
FROMorders_cdc
8
WHEREcaptured_at>='2026-04-25 02:00:00'
9
ORDERBYcaptured_at
10
LIMIT5
Result
op
order_id
captured_at
I
1001
2026-04-25 02:00:14
U
1001
2026-04-25 02:01:09
I
1002
2026-04-25 02:01:33
D
0998
2026-04-25 02:02:48
U
1002
2026-04-25 02:03:12
Strategy 2 captures deletes correctly. It also doubles the write volume on a 500M row table that already grows by 10M rows a day. The doubling is the deal-breaker for any high-throughput table. Strategy 2 is the right answer when the source database does not expose its WAL (older managed services), when audit requirements demand a trigger-managed history table for compliance reasons, or when the volume is low enough that doubling it is acceptable.
Strategy 3: Log-Based CDC with Debezium
The third strategy installs Debezium against the Postgres logical replication slot. Debezium reads the WAL, decodes inserts, updates, and deletes, and publishes them to Kafka topics (one per table). A Snowflake sink connector consumes from Kafka and writes to Snowflake using MERGE on the primary key. End-to-end latency is on the order of seconds. Source-side cost is minimal: the WAL is being written anyway. Captures deletes naturally.
Hits when Kafka or schema registry becomes a bottleneck
Right at scale
Up to ~10s of GB and minutes of latency
Up to medium write volume; legacy compatibility
Hundreds of GB to many TB; sub-minute latency
•Pick Strategy 1 when
Source is small to medium (under 100GB)
Hard deletes are rare or unimportant
Latency target is minutes, not seconds
Team has no Kafka or DBA capacity
✓Pick Strategy 3 when
Source is large (TBs) or growing fast
Deletes must be captured correctly
Latency target is seconds
Multiple downstream consumers will subscribe to the same change feed
Most companies start at Strategy 1 because the operational cost is the lowest and the scale rarely justifies more upfront. Companies that grow into Strategy 3 do so because the source-side load of repeated polling becomes the operational bottleneck, or because three or more downstream consumers want the same change feed and producing it once via Debezium is cheaper than producing it three times via JDBC. Strategy 2 is rarely the long-term answer; it is the bridge for sources that cannot expose their WAL.
Three legitimate strategies cover almost every Postgres ingestion case.
Strategy 1 wins at small scale because the operational cost is lowest.
Strategy 3 wins at scale because the source-side cost is constant per change, not per consumer.
TIP
Picking the right ingestion strategy is rarely picking the highest-throughput one. It is picking the one whose operational cost matches the team's capacity to operate it.
❯❯❯PUTTING IT ALL TOGETHER
> A growing fintech operates a Postgres operational database with three high-volume tables: accounts (10M rows), transactions (2B rows, 50M added daily), and ledger_entries (5B rows, 100M added daily). Three downstream consumers exist: a fraud detection model wanting sub-minute freshness, a finance reporting warehouse wanting hourly correctness with full delete capture, and a customer-facing API wanting eventual consistency for cached balances. The data engineering team has two engineers and no Kafka experience yet. The CTO asks: 'Design the ingestion layer.'
Three downstream consumers point at log-based CDC as the long-term shape. Producing the change feed once via Debezium and serving three consumers from Kafka is structurally cheaper than three independent polling pipelines hitting the source.
The two-engineer team has no Kafka experience. The right migration is staged. Phase one is Strategy 1 (incremental pull on transactions and ledger_entries, full load on accounts) so the consumers get something working in two weeks. Phase two introduces managed Kafka and Debezium when the team has hired or learned the operational capability.
Idempotent consumption is non-negotiable. Every event into the raw zone carries a dedupe key (event_id from the source primary key plus the WAL position) and the destination tables enforce uniqueness on it. This protects against connector restarts during the eventual Phase two cutover. (Builds on Lesson 5 idempotency.)
Schema evolution policy is set up front. Additive changes propagate; breaking changes go through expand-contract with explicit producer-connector-consumer coordination. Without the policy in place at Phase one, the first breaking change in Phase two becomes an incident. (Builds on Lesson 8 schema evolution.)
KEY TAKEAWAYS
Bookmarks turn full reload into incremental: lower bound from the previous run, upper bound from this run's start, ordered by the watermark column.
Three CDC families have different costs: trigger-based doubles writes, query-based misses hard deletes, log-based reads the WAL once and captures everything.
Log-based CDC needs operating: replication slot lag, schema registry, initial snapshot, and topic management are real infrastructure.
At-least-once is the default delivery model: deduplicate at the sink on a stable event id; that is the cheapest place for the dedupe to live.
Choose the strategy that matches operational capacity: Strategy 1 wins at small scale, Strategy 3 wins at scale, Strategy 2 is the bridge for sources that cannot expose their WAL.
Topics covered: Full, Incremental, and Bookmarks, CDC: The Three Families, Log-Based CDC: Mechanics and Costs, Idempotent At-Least-Once Ingest, Three Ways to Ingest from Postgres
Pull ingestion lives on a spectrum. At one end, every run reads the entire source table. At the other end, every run reads only the rows that changed since the last successful run. The second pattern requires a bookmark of some kind, persisted between runs, that defines what 'since the last run' means. Picking the right point on the spectrum and choosing the right bookmark is the difference between ingestion that scales and ingestion that turns the source database into the bottleneck. When Each
Change Data Capture, abbreviated CDC, is the discipline of capturing every insert, update, and delete that lands in an operational database, in order, and surfacing them for downstream consumers. CDC is what turns a transactional database into a streaming source without the application code knowing anything has changed. Three families of CDC dominate. They differ in where the capture happens and what it costs. Family Comparison Trigger-Based CDC Trigger-based CDC is the oldest pattern. The DBA i
Log-based CDC sounds free until the operational profile arrives. The mechanism is direct: the database is already writing every change to its log for crash recovery; tap the log, decode it, ship it downstream. The reality has costs. Replication slots can fill the disk. Schema changes upstream become Kafka topic problems. The CDC connector becomes a critical piece of infrastructure that must be operated like a database. Debezium and AWS DMS at a Glance The Replication Slot Problem Postgres uses l
The general idempotent-write playbook is the subject of Lesson 5 (partition overwrite, MERGE on a business key, DELETE-then-INSERT). This section narrows the playbook to the ingestion seam, where at-least-once delivery from the source forces deduplication on a message key. Most ingestion systems offer at-least-once delivery. Kafka consumers reprocess after rebalancing. Webhook senders retry until they get a 2xx. SFTP partners re-upload the same file when their cron job retries. Pull jobs that cr
Vocabulary becomes useful applied to a single source seen three ways. Take a Postgres operational database with two tables of interest: customers (slowly changing dimension, ~2M rows) and orders (event-shaped, ~500M rows growing at ~10M per day). The downstream destination is Snowflake. The product team wants the customers table fresh within an hour and the orders table fresh within five minutes. Three legitimate strategies exist. Each is correct for some scale and some operational context. Stra