Ingestion Patterns: Intermediate

A logistics company at 600 engineers had thirty pipelines pulling incrementally from the same Postgres operational database. Each ran every fifteen minutes against updated_at. The replica's IOPS doubled in a quarter. A platform engineer profiled the workload and found that thirty incremental queries scanning the same indexes thirty times every fifteen minutes was producing twelve hundred index scans per hour, half of which returned zero new rows. The fix was not faster ingestion. The fix was Debezium reading the Postgres write-ahead log once, publishing changes to a Kafka topic, and the thirty pipelines consuming from Kafka instead of querying the database. The replica's IOPS dropped 80 percent that week. The story is a fair summary of why the intermediate tier of ingestion exists. Pull ingestion stops scaling at some inflection point and the right answer is to invert the model.

What you will be able to do

Distinguish full loads, incremental loads, and bookmark patterns and choose between them

Recognize the three families of CDC (trigger-based, query-based, log-based) and their cost shapes

Apply idempotent consumption patterns to ingestion under at-least-once delivery

Full, Incremental, and Bookmarks

Daily Life

Interviews

Choose between full and incremental loads, design a bookmark store, and write the bounded query that drives incremental pulls.

Pull ingestion lives on a spectrum. At one end, every run reads the entire source table. At the other end, every run reads only the rows that changed since the last successful run. The second pattern requires a bookmark of some kind, persisted between runs, that defines what 'since the last run' means. Picking the right point on the spectrum and choosing the right bookmark is the difference between ingestion that scales and ingestion that turns the source database into the bottleneck.

When Each Strategy Fits

Full loadIncremental by timestampIncremental by sequence idCDC stream

Full load

Reread the entire source

Truncate-and-replace. Best for tables under ~1M rows; reference and dimension tables.

Incremental by timestamp

WHERE updated_at > X

Cost scales with change rate. Requires a reliable updated_at; sensitive to clock skew.

Incremental by sequence id

WHERE id > last_id

Cleanest for append-only tables. Immune to clock skew. Cannot detect updates.

CDC stream

Read the WAL

Captures every insert, update, and delete. Strongest at scale. Operational cost is real.

The Bookmark Pattern

A bookmark is a small piece of state that records the position the last successful ingestion run reached. The bookmark is persisted in a metadata table the pipeline owns, not in the source. On every run, the pipeline reads the bookmark, queries the source for rows after that position, processes them, and on success advances the bookmark to the new maximum. The bookmark is the entire mechanism that separates incremental ingestion from full reload.

	'CREATE TABLE ingestion_bookmarks ( source_name TEXT PRIMARY KEY, bookmark_value TEXT NOT NULL, bookmark_kind TEXT NOT NULL, last_run_at TIMESTAMP NOT NULL, last_row_count BIGINT, last_status TEXT NOT NULL )'
	/* The bookmark store is a tiny metadata table the pipeline owns. */
	/* ISO timestamp or sequence id, as text */
	/* 'timestamp' \| 'sequence' \| 'change_version' */

Three columns define the contract: the source name (so the table holds bookmarks for many sources), the bookmark value (a string, kept as text so timestamps and integers share a column), and the bookmark kind (so the pipeline knows how to compare). The remaining columns are operational metadata that pay for themselves the first time someone needs to debug a stuck pipeline.

The Boundaries That Make Bookmarks Correct

Three boundaries every bookmark-driven pull must enforce:

▸Lower bound: WHERE bookmark_column >= :previous_value (inclusive on the left to avoid losing the boundary row)
▸Upper bound: AND bookmark_column < :run_started_at (exclusive, fixed at run launch, never wall clock mid-query)
▸Order: ORDER BY bookmark_column so the new bookmark is unambiguously the maximum value seen
▸Atomicity: advance the bookmark only after the rows are durably written downstream
▸Recoverability: keep the previous bookmark value so a failed run does not advance and the next run reprocesses

Choosing the Bookmark Column

Not every column is bookmark-quality. The column must be monotonic for the rows the pipeline cares about: an updated_at that some triggers fail to update is not safe. The column must be indexed: a sequential scan of a billion-row table on every run is not viable. The column must be source-controlled: a trigger-fired updated_at on the source is fine, an updated_at written by application code in some paths but not others is a trap. The audit takes thirty minutes per source and prevents months of silent data loss.

•Bookmark on updated_at

Captures inserts and updates
Subject to clock skew and trigger gaps
Universal: most tables already have it
Resolution issue: integer-second timestamps lose rows updated within the same second

•Bookmark on sequence id

Captures only inserts (or any monotonic position)
Immune to clock skew because it is not time-based
Cleanest for append-only event tables
Cannot detect updates: if a row is mutated, its id does not change

A Bookmark-Driven Pull in SQL

	/* The pipeline runs this query, bound to the persisted bookmark value. */
	SELECT
	order_id,
	customer_id,
	total_cents,
	status,
	updated_at
	FROM orders
	WHERE updated_at >= '2026-04-25 01:00:00'
	AND updated_at < '2026-04-25 01:15:00'
	ORDER BY updated_at, order_id
	LIMIT 5

Result

order_id	customer_id	total_cents	status	updated_at
1001	C-42	12500	paid	2026-04-25 01:01:14
1002	C-87	9900	paid	2026-04-25 01:03:55
1003	C-12	45000	refunded	2026-04-25 01:08:22
1004	C-55	780	paid	2026-04-25 01:11:09
1005	C-91	6700	pending	2026-04-25 01:14:48

The lower bound is the previous bookmark value. The upper bound is the run start time, fixed once at launch. The compound order (updated_at, order_id) guarantees stable pagination even when many rows share the same timestamp. After the rows write to the raw zone, the bookmark advances to the run start time. The next run picks up from there.

The bookmark is a single piece of state that turns full reload into incremental.

Advance the bookmark only after the downstream write is durable.

Audit the bookmark column for monotonicity, indexing, and source-control before relying on it.

TIP

Keep the previous bookmark value alongside the current one. Recovering from a buggy advance becomes a one-row UPDATE instead of a postmortem.

pull: query DB

database

push: Kafka/webhook

queue

file drop: S3

file drop

ingestion layer

ingest

raw layer

lake

The shapes of ingestion: pull (query a source DB on a schedule), push (consume a queue or webhook), and file drop (pick up files from object storage). Each lands in the raw layer.

CDC: The Three Families

Daily Life

Interviews

Distinguish trigger-based, query-based, and log-based CDC and choose the right family for a given source.

Change Data Capture, abbreviated CDC, is the discipline of capturing every insert, update, and delete that lands in an operational database, in order, and surfacing them for downstream consumers. CDC is what turns a transactional database into a streaming source without the application code knowing anything has changed. Three families of CDC dominate. They differ in where the capture happens and what it costs.

Family Comparison

Family	How It Captures	Captures Deletes?	Operational Cost
Trigger-based	Database triggers fire on INSERT/UPDATE/DELETE and write to an audit table	Yes	High write amplification on the source; trigger maintenance burden
Query-based	Periodic SELECT WHERE updated_at > X (the bookmark pattern from section 0)	Only with explicit tombstone columns	Adds repeated scan load to the source; misses hard deletes
Log-based	Reads the database write-ahead log directly (WAL, binlog, oplog)	Yes (deletes appear in the log)	Adds a replication slot; minimal CPU at the source

Trigger-Based CDC

Trigger-based CDC is the oldest pattern. The DBA installs database triggers on every table of interest. Each trigger fires on INSERT, UPDATE, or DELETE and writes a row to a side table that records what happened. A pipeline polls the side table, drains it, and writes to downstream. The pattern works on any database that supports triggers, captures deletes naturally, and is straightforward to reason about.

	CREATE TABLE orders_cdc(cdc_id BIGSERIAL PRIMARY KEY, op CHAR(1) NOT NULL, captured_at TIMESTAMP DEFAULT now(), order_id BIGINT NOT NULL, row_snapshot JSONB) ; CREATE OR REPLACE FUNCTION capture_orders_changes() RETURNS trigger AS $ $ BEGIN IF TG_OP = 'DELETE' THEN INSERT INTO orders_cdc(op, order_id, row_snapshot) VALUES('D', OLD.order_id, row_to_json(OLD) :: jsonb) ; ELSIF TG_OP = 'UPDATE' THEN INSERT INTO orders_cdc(op, order_id, row_snapshot) VALUES('U', NEW.order_id, row_to_json(NEW) :: jsonb) ; ELSE INSERT INTO orders_cdc(op, order_id, row_snapshot) VALUES('I', NEW.order_id, row_to_json(NEW) :: jsonb) ; END IF ; RETURN NEW ; END ; $ $ LANGUAGE plpgsql ; CREATE TRIGGER orders_cdc_trigger AFTER INSERT OR UPDATE OR DELETE


	ON orders FOR EACH ROW EXECUTE FUNCTION capture_orders_changes() ;

Every business write to orders now triggers an additional write to orders_cdc. That doubling of write volume is the cost. On a high-throughput OLTP table the doubling can be the difference between fitting on the existing instance and needing a larger one. The trigger code is also code: bugs in the trigger become bugs in the source-of-truth write path, which DBAs hate for legitimate reasons.

When trigger-based CDC is the right choice:

▸The source database does not expose a usable replication log (older versions, hosted services without WAL access)
▸Write volume is low enough that doubling it is acceptable
▸Audit requirements demand a trigger-managed history table for compliance reasons
▸The team has DBA capacity to maintain triggers as schemas evolve

Query-Based CDC

Query-based CDC is the bookmark pattern from section 0, used as a CDC mechanism. The pipeline polls the source on a cadence and pulls rows whose updated_at advanced since the last poll. It is the simplest pattern and the most common in off-the-shelf connectors (Airbyte, Fivetran, Stitch). It captures inserts and updates well and captures deletes only if the source maintains a deleted_at column. Hard deletes (rows physically removed) are invisible to query-based CDC.

✓Query-Based Strengths

Works on any database that supports SELECT
Easy to set up; no DBA cooperation required
Bounded source-side cost (just queries)
Off-the-shelf connectors handle it

•Query-Based Weaknesses

Misses hard deletes entirely
Sensitive to clock skew and trigger-skipped updates
Polling cadence is the floor on freshness
Adds repeated scan load proportional to poll frequency

Log-Based CDC: The Strongest Family

Log-based CDC reads the database's write-ahead log directly. Postgres calls it the WAL. MySQL calls it the binlog. MongoDB calls it the oplog. The log records every committed change in commit order, including deletes. A log-based CDC tool subscribes to the log as a logical replica and emits each change as an event downstream. The pattern captures every change with no work on the application or trigger side, captures deletes naturally, and adds minimal CPU cost to the source.

	Postgres orders table
	\|
	(every commit appends to WAL)
	v
	Write-Ahead Log
	\|
	(Debezium logical replication slot reads in commit order)
	v
	Debezium connector
	\|
	(publishes events to topic)
	v
	Kafka topic: dbz.public.orders
	{op:'c', after:{order_id:1001, total:12500}}
	{op:'u', before:{...}, after:{...}}
	{op:'d', before:{order_id:1001, ...}}

The Debezium connector reads the WAL through a Postgres logical replication slot, decodes each transaction into row-level events, and publishes them to a Kafka topic per source table. Each event carries an op field ('c' for create, 'u' for update, 'd' for delete), an optional before image, and an after image. Downstream consumers see every change, in order, with deletes preserved. Log-based CDC is the only CDC family that captures deletes correctly without source-side cooperation, and that single property is enough to make it the default at scale.

Trigger-basedQuery-basedLog-based

Trigger-based

Triggers write to side tables

Captures deletes. Doubles write volume on the source. Trigger code becomes part of the write path.

Query-based

Polling on a watermark

Simplest. Works on any source. Misses hard deletes. Cadence floors freshness.

Log-based

Read the WAL or binlog

Captures every change including deletes. Sub-second latency. Requires Kafka, registry, and slot lag monitoring.

Trigger-based CDC trades source-side write amplification for any-database compatibility.

Query-based CDC misses hard deletes; that gap is permanent without an application-level tombstone.

Log-based CDC is the strongest family because the log is already the source of truth for the database.

TIP

Audit any in-house CDC implementation for delete handling. The most common silent data quality failure in operational mirrors is ignored hard deletes.

Log-Based CDC: Mechanics and Costs

Daily Life

Interviews

Operate a log-based CDC pipeline including initial snapshot, replication slot lag, and schema evolution.

Log-based CDC sounds free until the operational profile arrives. The mechanism is direct: the database is already writing every change to its log for crash recovery; tap the log, decode it, ship it downstream. The reality has costs. Replication slots can fill the disk. Schema changes upstream become Kafka topic problems. The CDC connector becomes a critical piece of infrastructure that must be operated like a database.

Debezium and AWS DMS at a Glance

Tool	How It Runs	Output	Where It Fits
Debezium	Kafka Connect plugin (or embedded library)	Per-table Kafka topic of CDC events	Companies running Kafka; need fine-grained control
AWS DMS	Managed AWS service	Streams to Kinesis, Kafka MSK, S3, or warehouse	AWS-heavy stacks; want managed-service tradeoff
GCP Datastream	Managed GCP service	Streams to BigQuery, Pub/Sub, or GCS	GCP-heavy stacks; BigQuery as primary destination
Azure Event Hubs CDC	Managed Azure service tied to SQL or Cosmos	Streams to Event Hubs	Azure SQL or Cosmos as the source

The Replication Slot Problem

Postgres uses logical replication slots to track which WAL position each consumer has reached. The slot guarantees the WAL is retained until the consumer confirms it has been read. That guarantee is also a liability. If the consumer falls behind or stops entirely, the WAL grows without bound on the source. A stuck Debezium connector has filled production database disks more than once at every company that runs CDC at scale. Operating CDC means monitoring slot lag the way databases monitor replication lag.

What can go wrong with a logical replication slot:

▸Connector restarts and never reattaches; slot grows the WAL until the disk fills
▸Schema change upstream that the connector cannot decode; events accumulate while the slot waits
▸Network partition between connector and database; slot pauses, WAL grows
▸Connector consumes slowly under load; slot lag climbs and replication lag climbs with it
▸Multiple slots on the same database compounding their retention requirements

Schema Evolution in CDC Pipelines

Source-side DDL is a CDC event too. When an upstream team adds a column to orders, Debezium notices, emits a schema change event, and the schema registry receives a new version. Downstream consumers reading from Kafka either tolerate the new field (forward compatibility) or break. The expand-contract pattern from the schema evolution lesson applies directly: additive changes flow through; destructive changes (drop, rename, type narrowing) require coordination across the producer, the connector, the registry, and every consumer.

✓Additive Schema Change

Add a new optional column
Connector emits new schema version
Existing consumers ignore the field; new consumers read it
No coordination required beyond announcement

•Breaking Schema Change

Drop a column, rename a column, narrow a type
Connector emits an incompatible schema version
Consumers fail to deserialize until updated
Requires producer-connector-consumer coordination

An Initial Snapshot Plus Streaming Tail

A new CDC subscription must reconcile two pieces of state. The history (rows that exist before the subscription started) and the tail (rows changing after the subscription started). Debezium's standard pattern is the snapshot. On startup, the connector locks the source briefly, records the current WAL position, snapshots all rows of every captured table, and then begins streaming from the recorded position. The two phases stitch together to produce a complete history with no gap and no overlap.

	Connector starts : 1. Acquire a brief read lock
	ON captured tables 2. Note WAL POSITION L 3. Release the lock 4.
	SELECT *
	FROM each captured TABLE ; emit each ROW AS 'r'(read) events 5. Begin streaming
	FROM WAL POSITION L ; emit 'c'(CREATE), 'u'(UPDATE), 'd'(DELETE) events Result : snapshot events + stream events = complete history WITH no gap, no overlap

Operational Realities

Concern	Cost	Mitigation
Slot lag monitoring	Required dashboards, alerts, on-call	Treat the connector like a tier-1 database
Schema registry operation	A new piece of infrastructure	Use Confluent or Apicurio managed registries
Initial snapshot duration	Hours for terabyte tables	Parallel snapshot or incremental snapshot extensions
Source DBA cooperation	Replication user, slot, WAL retention configuration	Negotiate SLAs with the OLTP team upfront
Topic management	One topic per captured table per source	Naming conventions; topic-creation automation

Log-based CDC is the strongest CDC family because it taps the database's existing source of truth.

Replication slots fill source disks when consumers stall; monitor slot lag.

Initial snapshot plus streaming tail is the standard onboarding pattern.

TIP

Treat the CDC connector as a piece of database infrastructure with its own SLO, not as another pipeline. The blast radius of a broken connector is the source database itself.

Idempotent At-Least-Once Ingest

Daily Life

Interviews

Apply idempotent consumption patterns to ingestion: dedupe-by-key, sink-side uniqueness, content-hash ids.

The general idempotent-write playbook is the subject of Lesson 5 (partition overwrite, MERGE on a business key, DELETE-then-INSERT). This section narrows the playbook to the ingestion seam, where at-least-once delivery from the source forces deduplication on a message key.

Most ingestion systems offer at-least-once delivery. Kafka consumers reprocess after rebalancing. Webhook senders retry until they get a 2xx. SFTP partners re-upload the same file when their cron job retries. Pull jobs that crash mid-write rerun the same window. The world the ingestion layer lives in is not exactly-once; it is at-least-once with deduplication required at the sink. Pipelines that pretend otherwise produce duplicate rows that take weeks to find.

Why At-Least-Once Is the Default

Scenario	Why Duplicates Appear	Frequency
Kafka consumer rebalance	Offset committed before write; rebalance reassigns the partition	Every deploy or scaling event
Webhook retry	Vendor sends the event; consumer 2xx response gets lost; vendor retries	Daily on any high-volume webhook
Pull job crash mid-batch	Some rows written, bookmark not yet advanced; next run repeats the window	Whenever the underlying compute crashes
File drop re-upload	Partner cron retried after a network blip; same filename, same contents	Weekly to monthly per partner
CDC initial snapshot replay	Connector restarted before snapshot completed	On every connector restart during onboarding

The Dedupe Key

An idempotent ingestion sink deduplicates on a stable key extracted from the message itself. The key is whatever uniquely identifies the logical event, regardless of how many times the message is delivered. For Kafka events, the key is often a UUID generated by the producer and included in the event payload. For webhook events, vendors usually include an event id (Stripe's evt_..., Shopify's webhook_id). For file drops, the row's natural key plus the source file's identity is sufficient. For pull ingestion, the source primary key plus the bookmark window position works.

	CREATE TABLE raw.order_events(event_id UUID PRIMARY KEY, source TEXT NOT NULL, event_type TEXT NOT NULL, payload JSONB NOT NULL, ingested_at TIMESTAMP DEFAULT now()) ; INSERT INTO raw.order_events(event_id, source, event_type, payload) VALUES(: event_id, : source, : event_type, : payload)


	ON CONFLICT(event_id) DO NOTHING ;

The unique constraint on event_id makes the table self-deduplicating. Two attempts to insert the same event_id produce one row. The pipeline does not have to know whether this is the first attempt or the fifth. The semantics shift from at-least-once delivery plus error-on-duplicate to at-least-once delivery plus silent dedupe. That shift is what makes the pipeline tolerate any number of retries.

Dedupe by Window vs Dedupe by Key

•Dedupe by Time Window

Group events by hour or day; within the window, GROUP BY a natural key
Loses correctness if duplicates span windows
Cheaper to implement when no global id exists
Acceptable for analytics aggregates, not for transactional facts

✓Dedupe by Stable Key

Every event has a unique id; sink enforces uniqueness on it
Correct under any retry pattern, any window
Requires the producer to generate the key (or extract it from upstream)
Standard for transactional events and CDC

Idempotent Consumption in Code

	# Simulating at-least-once delivery and dedupe-by-key on the consumer side.
	incoming = [
	{"event_id": "e1", "amount": 100},
	{"event_id": "e2", "amount": 250},
	{"event_id": "e1", "amount": 100}, # retry of e1
	{"event_id": "e3", "amount": 75},
	{"event_id": "e2", "amount": 250}, # retry of e2
	]

	seen = set()
	stored = []
	for event in incoming:
	if event["event_id"] in seen:
	continue
	seen.add(event["event_id"])
	stored.append(event)

	print(f"Received {len(incoming)} events")
	print(f"Stored {len(stored)} unique events")
	print(f"Total amount: {sum(e['amount'] for e in stored)}")

>>>Output

Received 5 events

Stored 3 unique events

Total amount: 425

Five events arrive, three unique events get stored, and the total amount reflects only the unique events. The dedupe set is in memory in the example; in a real ingestion sink the dedupe set is the unique constraint on the destination table, the keyed compaction in Kafka, or a separate dedupe table that the consumer consults before writing. The shape of the logic does not change.

Where Dedupe Belongs

Three valid places to deduplicate, in order of preference:

▸At the sink, via a unique constraint or MERGE on the dedupe key (cheapest, most reliable)
▸In the streaming engine, via keyed state and exactly-once sinks (Flink, Spark Structured Streaming)
▸Downstream during transformation, via ROW_NUMBER OVER PARTITION BY key (works but is expensive)

Pushing dedupe to the sink is the cheapest because the sink already needs an index on the key for query performance, and the unique constraint piggybacks on that index. Doing it in the streaming engine is correct but binds the pipeline to a specific engine. Doing it downstream is what every legacy pipeline ends up doing because nobody put the dedupe key in place at ingest time. The pattern is so common that 'add a dedupe step downstream' is a tax most analytics pipelines pay forever. Dedupe is not optional in at-least-once systems; the choice is where it happens, not whether it happens.

✓Do

Require every ingestion source to expose a stable event id
Enforce uniqueness on the dedupe key at the raw zone, not in transforms
Generate ids at the producer when sources do not provide them (UUID v4 plus content hash)
Persist the dedupe key alongside the payload so downstream can reuse it

✗Don't

Trust delivery semantics to be exactly-once at any scale
Build dedupe-by-window logic when a stable key is available
Use natural keys (email, phone) as dedupe keys when they can be edited upstream
Forget that file drops can be re-uploaded and need their own dedupe key per row

At-least-once is the default; deduplication at the sink is the answer.

A stable event id from the producer is the cheapest dedupe key available.

Dedupe pushed downstream costs more than dedupe at ingest, every time.

TIP

If the source does not provide an event id, generate one at the ingestion boundary using a content hash. The id is then deterministic across retries of the same logical event.

Three Ways to Ingest from Postgres

Daily Life

Interviews

Compare three Postgres ingestion strategies and pick the right one based on scale, deletes, latency, and team capacity.

Vocabulary becomes useful applied to a single source seen three ways. Take a Postgres operational database with two tables of interest: customers (slowly changing dimension, ~2M rows) and orders (event-shaped, ~500M rows growing at ~10M per day). The downstream destination is Snowflake. The product team wants the customers table fresh within an hour and the orders table fresh within five minutes. Three legitimate strategies exist. Each is correct for some scale and some operational context.

Strategy 1: Periodic Full Load + Incremental Pull

The first strategy treats the two tables differently. Customers is small enough that a nightly full load (truncate-and-replace) is acceptable; the table reloads every 24 hours and the freshness SLO is met. Orders is too large for full load, so incremental pull every five minutes uses the bookmark pattern with updated_at as the watermark. Both jobs run in Airflow. State lives in a metadata table.

Aspect	Customers (full load)	Orders (incremental pull)
Cadence	Once per day	Every 5 minutes
Source-side cost	One full table scan per night	One bookmark-bounded scan per 5 min, ~288 scans/day
Captures deletes?	Yes (full reload)	Only soft deletes via deleted_at
Failure recovery	Just rerun; idempotent by truncate	Bookmark unchanged on failure; next run reprocesses window

Strategy 2: Trigger-Based CDC

The second strategy installs triggers on both tables that write to side tables capturing every insert, update, and delete. A small worker drains the side tables every minute and writes the events to Snowflake. The pattern captures deletes naturally and provides a complete change history. The cost is doubled write volume on every business write to the source database, plus the operational burden of maintaining triggers as schemas evolve.

	/* The drain worker pulls from the trigger-populated side table. */
	/* Returns every change captured since the last drain. */
	SELECT
	op,
	order_id,
	captured_at
	FROM orders_cdc
	WHERE captured_at >= '2026-04-25 02:00:00'
	ORDER BY captured_at
	LIMIT 5

Result

op	order_id	captured_at
I	1001	2026-04-25 02:00:14
U	1001	2026-04-25 02:01:09
I	1002	2026-04-25 02:01:33
D	0998	2026-04-25 02:02:48
U	1002	2026-04-25 02:03:12

Strategy 2 captures deletes correctly. It also doubles the write volume on a 500M row table that already grows by 10M rows a day. The doubling is the deal-breaker for any high-throughput table. Strategy 2 is the right answer when the source database does not expose its WAL (older managed services), when audit requirements demand a trigger-managed history table for compliance reasons, or when the volume is low enough that doubling it is acceptable.

Strategy 3: Log-Based CDC with Debezium

The third strategy installs Debezium against the Postgres logical replication slot. Debezium reads the WAL, decodes inserts, updates, and deletes, and publishes them to Kafka topics (one per table). A Snowflake sink connector consumes from Kafka and writes to Snowflake using MERGE on the primary key. End-to-end latency is on the order of seconds. Source-side cost is minimal: the WAL is being written anyway. Captures deletes naturally.

Postgres orders / customers |(logical replication slot,  single read of the WAL) v Debezium connector(Kafka Connect) | topics : dbz.public.customers, dbz.public.orders v SCHEMA registry(Avro / JSON SCHEMA) v Snowflake Kafka connector(MERGE ON PK) v Snowflake : raw.customers, raw.orders + change history v Curated layer : customers_current, orders_fact

Side-by-Side Tradeoffs

Dimension	Strategy 1 (full + incremental)	Strategy 2 (trigger CDC)	Strategy 3 (log CDC)
End-to-end latency	5 min for orders, 24h for customers	1 min for both	Seconds for both
Captures hard deletes	No	Yes	Yes
Source-side cost	Repeated index scans	Doubled write volume	Minimal (just WAL)
Operational complexity	Low (Airflow + JDBC)	Medium (DBA + drain worker)	High (Kafka + Debezium + registry)
Cost ceiling	Hits when scan load saturates the replica	Hits when write doubling saturates the primary	Hits when Kafka or schema registry becomes a bottleneck
Right at scale	Up to ~10s of GB and minutes of latency	Up to medium write volume; legacy compatibility	Hundreds of GB to many TB; sub-minute latency

•Pick Strategy 1 when

Source is small to medium (under 100GB)
Hard deletes are rare or unimportant
Latency target is minutes, not seconds
Team has no Kafka or DBA capacity

✓Pick Strategy 3 when

Source is large (TBs) or growing fast
Deletes must be captured correctly
Latency target is seconds
Multiple downstream consumers will subscribe to the same change feed

Most companies start at Strategy 1 because the operational cost is the lowest and the scale rarely justifies more upfront. Companies that grow into Strategy 3 do so because the source-side load of repeated polling becomes the operational bottleneck, or because three or more downstream consumers want the same change feed and producing it once via Debezium is cheaper than producing it three times via JDBC. Strategy 2 is rarely the long-term answer; it is the bridge for sources that cannot expose their WAL.

Three legitimate strategies cover almost every Postgres ingestion case.

Strategy 1 wins at small scale because the operational cost is lowest.

Strategy 3 wins at scale because the source-side cost is constant per change, not per consumer.

TIP

Picking the right ingestion strategy is rarely picking the highest-throughput one. It is picking the one whose operational cost matches the team's capacity to operate it.

❯❯❯PUTTING IT ALL TOGETHER

> A growing fintech operates a Postgres operational database with three high-volume tables: accounts (10M rows), transactions (2B rows, 50M added daily), and ledger_entries (5B rows, 100M added daily). Three downstream consumers exist: a fraud detection model wanting sub-minute freshness, a finance reporting warehouse wanting hourly correctness with full delete capture, and a customer-facing API wanting eventual consistency for cached balances. The data engineering team has two engineers and no Kafka experience yet. The CTO asks: 'Design the ingestion layer.'

Three downstream consumers point at log-based CDC as the long-term shape. Producing the change feed once via Debezium and serving three consumers from Kafka is structurally cheaper than three independent polling pipelines hitting the source.

The two-engineer team has no Kafka experience. The right migration is staged. Phase one is Strategy 1 (incremental pull on transactions and ledger_entries, full load on accounts) so the consumers get something working in two weeks. Phase two introduces managed Kafka and Debezium when the team has hired or learned the operational capability.

Idempotent consumption is non-negotiable. Every event into the raw zone carries a dedupe key (event_id from the source primary key plus the WAL position) and the destination tables enforce uniqueness on it. This protects against connector restarts during the eventual Phase two cutover. (Builds on Lesson 5 idempotency.)

Schema evolution policy is set up front. Additive changes propagate; breaking changes go through expand-contract with explicit producer-connector-consumer coordination. Without the policy in place at Phase one, the first breaking change in Phase two becomes an incident. (Builds on Lesson 8 schema evolution.)

KEY TAKEAWAYS

Bookmarks turn full reload into incremental: lower bound from the previous run, upper bound from this run's start, ordered by the watermark column.

Three CDC families have different costs: trigger-based doubles writes, query-based misses hard deletes, log-based reads the WAL once and captures everything.

Log-based CDC needs operating: replication slot lag, schema registry, initial snapshot, and topic management are real infrastructure.

At-least-once is the default delivery model: deduplicate at the sink on a stable event id; that is the cheapest place for the dedupe to live.

Choose the strategy that matches operational capacity: Strategy 1 wins at small scale, Strategy 3 wins at scale, Strategy 2 is the bridge for sources that cannot expose their WAL.

Incremental loads, change capture, and idempotent consumption decide whether ingestion scales

Category: Pipeline Architecture
Difficulty: intermediate
Duration: 32 minutes
Challenges: 0 hands-on challenges

Topics covered: Full, Incremental, and Bookmarks, CDC: The Three Families, Log-Based CDC: Mechanics and Costs, Idempotent At-Least-Once Ingest, Three Ways to Ingest from Postgres

Lesson Sections

Full, Incremental, and Bookmarks (concepts: paFullVsIncremental)
Pull ingestion lives on a spectrum. At one end, every run reads the entire source table. At the other end, every run reads only the rows that changed since the last successful run. The second pattern requires a bookmark of some kind, persisted between runs, that defines what 'since the last run' means. Picking the right point on the spectrum and choosing the right bookmark is the difference between ingestion that scales and ingestion that turns the source database into the bottleneck. When Each
CDC: The Three Families (concepts: paCdc)
Change Data Capture, abbreviated CDC, is the discipline of capturing every insert, update, and delete that lands in an operational database, in order, and surfacing them for downstream consumers. CDC is what turns a transactional database into a streaming source without the application code knowing anything has changed. Three families of CDC dominate. They differ in where the capture happens and what it costs. Family Comparison Trigger-Based CDC Trigger-based CDC is the oldest pattern. The DBA i
Log-Based CDC: Mechanics and Costs (concepts: paCdc)
Log-based CDC sounds free until the operational profile arrives. The mechanism is direct: the database is already writing every change to its log for crash recovery; tap the log, decode it, ship it downstream. The reality has costs. Replication slots can fill the disk. Schema changes upstream become Kafka topic problems. The CDC connector becomes a critical piece of infrastructure that must be operated like a database. Debezium and AWS DMS at a Glance The Replication Slot Problem Postgres uses l
Idempotent At-Least-Once Ingest (concepts: paIdempotency)
The general idempotent-write playbook is the subject of Lesson 5 (partition overwrite, MERGE on a business key, DELETE-then-INSERT). This section narrows the playbook to the ingestion seam, where at-least-once delivery from the source forces deduplication on a message key. Most ingestion systems offer at-least-once delivery. Kafka consumers reprocess after rebalancing. Webhook senders retry until they get a 2xx. SFTP partners re-upload the same file when their cron job retries. Pull jobs that cr
Three Ways to Ingest from Postgres (concepts: paCdc)
Vocabulary becomes useful applied to a single source seen three ways. Take a Postgres operational database with two tables of interest: customers (slowly changing dimension, ~2M rows) and orders (event-shaped, ~500M rows growing at ~10M per day). The downstream destination is Snowflake. The product team wants the customers table fresh within an hour and the orders table fresh within five minutes. Three legitimate strategies exist. Each is correct for some scale and some operational context. Stra