Ingestion Patterns: Beginner

An e-commerce company at Series C scale ran twelve ingestion jobs against twelve different sources. Eight pulled from operational databases on a 15 minute cadence. Two consumed from Kafka. One read CSV files dropped hourly into an SFTP server by a logistics partner. One polled a marketing automation API. Over the course of one quarter, every late-night page traced back to ingestion: a Postgres replica that fell behind, a Kafka consumer that lost its offset, an SFTP partner that switched encodings without warning, an API vendor that quietly halved its rate limit. The transforms downstream were boring and reliable. The ingestion layer absorbed every external surprise the company could not control. Ingestion is the boundary where the pipeline meets the world, and the world does not care about the pipeline.

What you will be able to do

Recognize the four shapes ingestion takes: pull, push, file drop, and API

Identify the dominant failure modes for each shape and the system characteristics that pick between them

Apply the right ingestion shape to a new source by matching the shape to the source's behavior

The Four Shapes of Ingestion

Daily Life

Interviews

Recognize the four ingestion shapes and pick the right one for a new source.

Every byte that enters a pipeline arrives through one of four shapes. The shape is determined by who initiates the transfer and what kind of artifact is being transferred. Naming the four shapes is the first move because every later concern, from scheduling to error handling to schema validation, is shaped by the choice. A pipeline that ingests from a Postgres database is structurally different from a pipeline that consumes from a Kafka topic, and pretending the difference does not matter is the most common reason ingestion code rots.

The Four Shapes Side by Side

Shape	Who Initiates	Typical Source	Typical Cadence
Pull	The pipeline (on a schedule)	Operational database, internal data store	Every N minutes or hours
Push	The source (continuously)	Kafka topic, Kinesis stream, webhook callback	Continuous; events arrive when produced
File drop	The source (on its own schedule)	S3 prefix, SFTP directory, partner upload	Hourly or daily, often unpredictable
API	The pipeline (with rate limits)	Stripe, Salesforce, HubSpot, internal REST	Polled on a schedule, paginated

Two dimensions sort the shapes. The first dimension is who decides when the data moves: the pipeline pulling on a schedule, or the source pushing as data is produced. The second dimension is whether the data is delivered as discrete events or as a bulk artifact like a row set or a file. Pull and API ingestion are pipeline-initiated; push and file drop ingestion are source-initiated. Push delivers events; the others typically deliver row sets or files. The four shapes cover almost every real ingestion case.

PullPushFile dropAPI

Pull

Pipeline initiates, source responds

The pipeline runs a query against a database every N minutes. The source has no idea when it will be queried. JDBC, SELECT WHERE updated_at > X.

Push

Source initiates, pipeline reacts

Kafka, Kinesis, webhooks (HTTP callbacks the source fires when something happens). The source emits events; the pipeline subscribes. Cadence is whatever the source produces, not what the pipeline asked for.

File drop

Source places a file in a known location

Partners and batch upstream systems write CSV, JSON, or Parquet to S3 or SFTP on their own schedule. The pipeline watches for new files.

API

Pipeline calls an HTTP endpoint

Stripe, Salesforce, HubSpot. The pipeline polls REST or GraphQL endpoints, paginates results, and respects rate limits.

Why the Shape Matters

Every operational concern downstream depends on the ingestion shape. Pull ingestion has to track what was pulled last so the next pull is incremental. Push ingestion has to track an offset or a sequence number so consumption can resume after a crash. File drop ingestion has to remember which files have been processed so a partner re-uploading does not cause duplicate processing. API ingestion has to handle pagination, rate limits, and the occasional 503 from a vendor that is not the pipeline's vendor. The shapes do not have a winner; they have different costs.

Shape	Main State to Track	Most Common Failure
Pull	Last successful watermark (timestamp or sequence id)	Watermark drift: rows that updated outside the bound get missed
Push	Consumer group offset or stream cursor	Offset reset: consumer rewinds and reprocesses or skips
File drop	Set of file paths already processed	Same filename re-uploaded with different contents
API	Pagination cursor, rate limit budget, auth token expiry	Vendor returns 429 or 5xx mid-pagination

The four ingestion shapes summed up:

▸Pull is on the pipeline's schedule, with a watermark to know how far the last run got
▸Push is on the source's schedule, with an offset to know how far the consumer got
▸File drops are on the source's schedule, with a manifest of what has been processed
▸API ingestion is pipeline-initiated but rate-limited by the vendor; pagination plus retries plus auth

Mixing Shapes in One Pipeline

A real pipeline almost always uses more than one shape. A typical analytical pipeline might pull customer dimensions from a Postgres replica nightly, consume click events from Kafka continuously, accept a daily file drop of partner inventory from an S3 prefix, and poll the Stripe API every fifteen minutes for new charges. Each source forces the shape that fits it; the pipeline accommodates all four. Trying to force every source into the same ingestion pattern is a category error and is the reason in-house ingestion code becomes unmaintainable.

Postgres customers Kafka clickstream S3 partner inventory Stripe API charges | v transforms downstream

Four sources, four shapes, one raw zone. The raw zone exists precisely so that every downstream transform sees a uniform interface, regardless of how the data arrived.

Pull and API ingestion are pipeline-initiated; the pipeline decides when to ask.

Push and file drop ingestion are source-initiated; the source decides when to deliver.

Each shape forces a different piece of state to be tracked: watermark, offset, manifest, or pagination cursor.

TIP

Before writing any ingestion code, name which of the four shapes the source forces. The shape determines half the design.

pull: query DB

database

push: Kafka/webhook

queue

file drop: S3

file drop

ingestion layer

ingest

raw layer

lake

The shapes of ingestion: pull (query a source DB on a schedule), push (consume a queue or webhook), and file drop (pick up files from object storage). Each lands in the raw layer.

Pull Ingestion from a Database

Daily Life

Interviews

Implement a pull-ingestion job with a high-water mark and choose between full and incremental loads.

Pull ingestion is the workhorse of analytical data engineering. A scheduled job opens a JDBC or SQLAlchemy connection to an operational database, runs a SELECT, writes the results to a raw landing zone, and exits. The pattern is decades old, well understood, and still the right answer for a large fraction of ingestion problems. The mechanics are simple. The trap is that simple mechanics tempt engineers to ignore the questions that decide whether the simple mechanics are correct.

The Two Flavors: Full and Incremental

•Full Load

Reads every row in the source table on every run
Simplest to implement and reason about
Cost scales with table size, not change rate
Acceptable for small dimensions; quickly painful for fact tables

✓Incremental Load

Reads only rows changed since the last successful run
Requires a reliable change-tracking column (updated_at, modified_at, sequence id)
Cost scales with change rate, not table size
Standard for any source larger than a few million rows

The choice between full and incremental is the first design decision in pull ingestion. A 10000 row reference table is easy: read it whole every time, replace the destination, move on. A 10 billion row events table is impossible to read whole every run. Somewhere between those extremes the pipeline must learn to read only what changed. That learning shows up as the high-water mark.

The High-Water Mark

A high-water mark is a single value the pipeline persists between runs. Its job is to record how far the last successful run got, so the next run can pick up from there. The most common high-water mark is a timestamp column on the source table. The next run queries WHERE updated_at > last_watermark. After a successful run, the watermark advances to the maximum updated_at seen during the run.

	SELECT
	order_id,
	customer_id,
	total_cents,
	updated_at
	FROM orders
	WHERE updated_at >= : last_watermark AND updated_at < : this_run_started_at
	ORDER BY updated_at ;

The query has three pieces. The lower bound is the previous watermark. The upper bound is the start of this run, fixed at run launch so the boundary is stable even if the query takes ten minutes. The ORDER BY is what makes the boundary meaningful: rows are pulled in watermark order, and the new watermark is the maximum value seen. Skipping any of these three pieces is how watermark drift creeps in.

What can go wrong with a watermark:

▸Source clock skew: rows committed with timestamps slightly in the past relative to the watermark
▸Long-running transactions: a row updated at 2:01am may not be visible to a query at 2:05am
▸Updates without a touched updated_at: triggers or batch jobs that bypass the column
▸Boundary resolution: integer-second watermarks miss rows updated within the same second

A Pull Job in Python, End to End

	# A minimal pull-ingestion script. Reads watermark, runs query, writes file, advances watermark.
	import datetime

	# Pretend the watermark store returns the last watermark for this source.
	def read_watermark(source):
	return datetime.datetime(2026, 4, 24, 23, 0, 0)

	# Pretend the source has these rows since the last watermark.
	def pull_rows_since(watermark, run_started_at):
	rows = [
	{"order_id": 1001, "updated_at": datetime.datetime(2026, 4, 24, 23, 15)},
	{"order_id": 1002, "updated_at": datetime.datetime(2026, 4, 25, 0, 32)},
	{"order_id": 1003, "updated_at": datetime.datetime(2026, 4, 25, 1, 11)},
	]
	return [r for r in rows if r["updated_at"] >= watermark and r["updated_at"] < run_started_at]

	run_started_at = datetime.datetime(2026, 4, 25, 2, 0, 0)
	watermark = read_watermark("orders")
	rows = pull_rows_since(watermark, run_started_at)

	print(f"Pulled {len(rows)} rows since {watermark}")
	new_watermark = max((r["updated_at"] for r in rows), default=watermark)
	print(f"New watermark: {new_watermark}")

>>>Output

Pulled 3 rows since 2026-04-24 23:00:00

New watermark: 2026-04-25 01:11:00

The script has all the moving parts of a real pull-ingestion job in twenty lines. It reads the watermark from a state store, runs the bounded query, processes the rows, and persists the new watermark. A production version replaces the in-memory functions with a JDBC connection and a metadata table, adds error handling, batches the writes to the raw zone, and emits metrics. The skeleton does not change.

Production Pull Ingestion Tools

Tool	What It Does	When It Fits
Airbyte	Open-source connectors for hundreds of sources; manages watermarks per stream	Standard SaaS connectors; small-to-medium teams
Fivetran	Managed JDBC and SaaS ingestion with automatic schema drift handling	Buy-not-build for non-differentiated sources
Singer / Meltano	Open-source taps and targets; tap-postgres pulls incrementally with bookmarks	When the catalog of needed connectors is small and customizable
Custom Python or Spark	Hand-rolled extract job with a metadata table for watermarks	Internal sources, complex business logic, no off-the-shelf connector

✓Do

Persist a per-source watermark in a metadata table the pipeline owns
Use the run start time as the upper bound so the boundary is stable
Pull from a read replica when one exists, never from the primary write node
Track row counts pulled per run so a sudden drop is visible

✗Don't

Use the current wall clock as the upper bound mid-query (rows can shift)
Trust source-side updated_at columns without sampling them for skew
Pull at faster than 5 minute cadence without a managed CDC story (change data capture, covered in the intermediate tier)
Forget to handle the bootstrapping case where no watermark exists yet

Push from Queues and Webhooks

Daily Life

Interviews

Build a push-ingestion consumer that commits offsets safely and routes webhook events through a queue.

Push ingestion inverts the control flow. The source produces events at whatever rate suits it. The pipeline subscribes to those events and consumes them as they arrive. Kafka is the canonical push source for internal event streams. Webhooks are the canonical push source for SaaS vendors. Kinesis, Pub/Sub, and Pulsar fit the same shape. The pattern is fundamentally different from pull because the pipeline does not control the cadence.

Kafka and Friends

A Kafka topic is an append-only log. Producers write events to the end. Consumers read from a position they track called an offset. The broker retains events for some configured period (commonly 7 days). A consumer group coordinates multiple consumers reading the same topic so each event is processed exactly once across the group. The model is the foundation of most modern event-driven ingestion.

Concept	Plain Definition	Why It Matters
Topic	Named, partitioned, append-only log of events	The unit of subscription
Partition	An ordered subset of a topic; events within a partition are ordered	Parallelism and ordering boundary
Offset	Sequential position within a partition the consumer has read up to	How resume-after-crash works
Consumer group	Set of consumers cooperating to read a topic without overlap	Horizontal scaling without duplicating work
Retention	Time or size limit after which events are deleted	Determines how far back replay can go

A Minimal Kafka Consumer

	from kafka import KafkaConsumer

	consumer = KafkaConsumer(
	"order_events",
	bootstrap_servers=["kafka:9092"],
	group_id="raw_zone_writer",
	enable_auto_commit=False, # Commit only after successful write
	auto_offset_reset="earliest",
	)

	for message in consumer:
	write_to_raw_zone(message.value)
	consumer.commit()

The shape is straightforward. The consumer joins a group, the broker assigns it partitions, and the loop processes events as they arrive. The critical detail is the order of operations inside the loop: write first, then commit the offset. Reversing the order produces a class of bugs called message loss, where a crash between commit and write loses the in-flight event forever. Writing first and committing after creates the opposite class of bugs called duplicate delivery, which is the right class of bugs to have because they can be deduplicated downstream.

The two cardinal sins of push ingestion:

▸Committing before writing: a crash loses messages and the loss is silent
▸Holding messages in memory across many cycles: a restart loses the buffer
▸Treating the offset as a side effect: it is the contract with the broker, not a metric
▸Sharing one consumer group across environments: dev consumes prod events

Webhooks: Push for Vendors

Webhooks are the SaaS equivalent of Kafka. A vendor like Stripe or Shopify offers to call a URL on the consumer's side when an event of interest occurs. The consumer side runs a small HTTP server that receives the POST, verifies a signature, and writes the payload to a queue or directly to a raw zone. Webhooks are push: the vendor decides when to call. The pipeline reacts.

	from flask import Flask, request, abort
	import hmac, hashlib

	app = Flask(__name__)
	WEBHOOK_SECRET = b"..."

	@app.post("/stripe")
	def stripe_webhook():
	sig = request.headers.get("Stripe-Signature", "")
	expected = hmac.new(WEBHOOK_SECRET, request.data, hashlib.sha256).hexdigest()
	if not hmac.compare_digest(sig, expected):
	abort(401)
	enqueue_for_processing(request.json)
	return "", 204

The handler does three things: verifies the request really came from the vendor, enqueues the event onto a durable queue (SQS, Kafka, even a Postgres table), and returns 204 immediately. The actual processing happens asynchronously off the queue. Doing real work inside the webhook handler is a recipe for vendor-side timeouts and delivery failures. The vendor wants a fast acknowledgement; the pipeline wants durability. The queue between the two satisfies both.

•Pull Ingestion

Pipeline decides when to ask
State is a watermark in a metadata table
Cadence is a knob the pipeline controls
Failure mode: missed window, drift, slow source

•Push Ingestion

Source decides when to deliver
State is an offset or sequence number
Cadence is whatever the source produces
Failure mode: lost offset, slow consumer, runaway producer

When to Reach for Push Ingestion

Push ingestion is the right answer when downstream consumers care about freshness measured in seconds rather than minutes, when the source is already producing events to a broker, or when the data shape is naturally event-shaped (clicks, payments, sensor readings). It is the wrong answer when the source is a transactional database whose authoritative state is in tables; in that case CDC (covered in the intermediate tier) is the right way to bridge the database into a push stream.

Push ingestion is consumer-driven on the wire but source-driven in cadence.

Always write before committing the offset. The reverse loses messages silently.

Webhooks need a queue between the HTTP handler and the actual processing.

TIP

Treat the offset (or webhook delivery confirmation) as the most precious piece of state in a push pipeline. It is the only thing standing between the pipeline and lost events.

File Drop Ingestion

Daily Life

Interviews

Implement file drop ingestion with a manifest, sentinel handling, and defensive validation.

File drop ingestion is the lowest-tech and most common shape in cross-company integrations. A partner system writes a file to a known location at a known cadence, and the pipeline picks it up. The location is usually an S3 prefix, a GCS bucket, an Azure Blob container, or an SFTP directory. The cadence is whatever the partner promised, which is often whatever the partner happens to do. Senior engineers respect file drops because they handle the messy real world, and they distrust file drops because the messy real world keeps showing up in production.

The Mechanics of a File Drop

Step	What Happens	Where State Lives
Discovery	Pipeline lists the prefix to find new files	Manifest of already-processed file paths
Validation	Filename pattern, size, encoding, header row checked	Validation rules in pipeline config
Read	File is parsed into rows or records	Parser config (CSV dialect, schema, compression)
Land	Rows are written into the raw zone, partitioned by ingest date	Raw zone (S3 prefix or table)
Mark processed	File path added to the manifest so it is skipped next run	Manifest table or move-to-processed prefix

The Manifest Pattern

The single most important piece of state in file drop ingestion is the manifest: the record of which files have already been processed. Without it, the pipeline reprocesses every file every run. With a buggy one, the pipeline reprocesses some files twice (duplicates) or skips some forever (silent loss). Two implementations dominate. The first writes the path of every processed file to a table and consults it on every run. The second moves processed files to a separate prefix (raw/processed/) so the search prefix only contains unprocessed files.

	# Simulating a manifest-based file drop processor.
	processed_manifest = {"inventory/processed/2026-04-24-partA.csv"}

	inbox = [
	"inventory/inbox/2026-04-25-partA.csv",
	"inventory/inbox/2026-04-25-partB.csv",
	"inventory/inbox/2026-04-24-partA.csv", # already processed yesterday
	"inventory/inbox/.uploading-2026-04-25-partC.csv", # half-written sentinel
	]

	for key in inbox:
	if key.startswith("inventory/inbox/.uploading"):
	continue
	moved_key = key.replace("inbox/", "processed/")
	if moved_key in processed_manifest:
	continue
	print(f"Processing: {key}")
	processed_manifest.add(moved_key)

	print(f"Manifest size after run: {len(processed_manifest)}")

>>>Output

Processing: inventory/inbox/2026-04-25-partA.csv

Processing: inventory/inbox/2026-04-25-partB.csv

Manifest size after run: 3

The simulation walks the inbox, skips sentinels, consults the manifest, and processes only the unseen files. The atomic move is the manifest in production: lifting the file from inbox/ to processed/ records the work was done. A separate manifest table works equally well and is sometimes preferable when files cannot be moved (read-only buckets, partner-controlled keys).

Surprises File Drops Bring to Production

What partner file drops do that breaks pipelines:

▸Re-upload yesterday's file with the same name but different contents
▸Switch from UTF-8 to Windows-1252 in the middle of a quarter, no notice
▸Drop a file that is half-written when the pipeline starts reading
▸Add a column without notice (additive changes are usually fine)
▸Remove a column without notice (breaking changes that fail loudly is the better outcome)
▸Send a 0-byte sentinel file as a 'no records today' marker the pipeline does not recognize
▸Skip a daily drop entirely because their SFTP cron failed and nobody noticed

Each surprise is a story from someone's production. None are exotic. File drop ingestion has to be defensive about all of them. A two-line size check (skip files smaller than 100 bytes; alert if the file is more than 5x the historical average) catches most weirdness before the parser does. A header-row check catches schema drift. An encoding sniff catches the Windows-1252 surprise. The parser is not a place to be optimistic.

The Half-Written File Problem

S3 uploads are not atomic from the consumer's point of view by default. A partner writing a 2 GB file uploads in chunks; a pipeline that lists the prefix in the middle of the upload sees a partial file. The conventional fix is a sentinel: the partner uploads the data file as my_drop.csv.uploading and renames to my_drop.csv after the upload completes. The pipeline ignores .uploading files. Modern systems use S3 multipart upload completion events instead, but the sentinel pattern is older, simpler, and still common in SFTP and cross-cloud transfers.

Partner uploads : my_drop.csv.uploading(IN progress,  pipeline ignores) Partner finishes : renames to my_drop.csv(now visible to pipeline) Pipeline lists prefix : sees my_drop.csv, processes it, moves to processed /

The manifest (or move-to-processed) is the durable state that prevents duplicate processing.

Validate filename, size, and encoding before parsing. The parser is not the validator.

Use a sentinel pattern (.uploading suffix) to avoid reading half-written files.

TIP

Assume every partner file drop will at some point be a 0-byte sentinel, an encoding flip, a re-uploaded duplicate, and a half-written file. Defensive parsing is not pessimism; it is realism.

API Ingestion: The Bug Magnet

Daily Life

Interviews

Build an API ingestion job that paginates, respects rate limits, and tolerates the failure modes vendors create.

API ingestion is what a SaaS connector does. Stripe, Salesforce, HubSpot, Google Ads, Shopify, and a long tail of vendors expose REST or GraphQL endpoints that return paginated results. Pipelines call those endpoints, paginate, and write the results to a raw zone. The pattern is conceptually identical to pull ingestion from a database, but the operational profile is dramatically worse because every concern is now mediated by HTTP and a vendor-controlled rate limit.

The Four Concerns of API Ingestion

PaginationRate limitsAuthReliability

Pagination

Walking the result set

Page numbers, cursors, or time ranges. Stop when the cursor is null or the next page is empty.

Rate limits

Vendor cap on requests

Per-second, per-minute, or per-hour budgets. Smooth rate; back off on 429; do not burst.

Auth

Tokens that expire

OAuth refresh, API keys, signed requests. Refresh proactively, before mid-pagination expiry.

Reliability

5xx and timeouts from vendors

Retry with jitter and exponential backoff. Vendor outages outlast the pipeline's patience by default.

Concern	What It Is	Typical Mistake
Pagination	Walking through pages of results to retrieve the full set	Stopping at the first empty page without checking the cursor
Rate limits	Vendor cap on requests per second, minute, or hour	Burning the budget in five minutes and being throttled all hour
Auth	OAuth tokens, API keys, refresh flows	Letting a token expire mid-pagination with no refresh path
Reliability	Vendor returns 5xx, 429, or simply hangs	Retrying without backoff and amplifying the vendor's outage

Pagination Styles

Three pagination styles cover almost every API. Page-based pagination uses page=1, page=2 query parameters and is the easiest to reason about; it suffers from drift if records are added during pagination. Cursor-based pagination returns an opaque cursor in each response that the next request submits; this is the strongest pattern because the cursor encodes the position. Time-based pagination uses since and until parameters; it works for append-only feeds and breaks for sources where records can be updated.

•Page-Based Pagination

page=1, page=2, page=3 query parameters
Drift risk: new records during pagination shift the page contents
Easy to reason about, easy to retry a single page
Common in older REST APIs

✓Cursor-Based Pagination

next_cursor returned in each response, submitted on next request
Stable across additions because cursor encodes a precise position
Harder to retry mid-pagination without storing the cursor history
Standard in modern APIs (Stripe, Shopify, GraphQL connections)

A Paginated, Rate-Limited Pull

	# A simulation of a paginated, rate-limited API pull. The mock 'api' returns three pages.
	import time

	def api_call(cursor):
	pages = {
	None: {"items": [{"id": 1}, {"id": 2}], "next_cursor": "c2"},
	"c2": {"items": [{"id": 3}, {"id": 4}], "next_cursor": "c3"},
	"c3": {"items": [{"id": 5}], "next_cursor": None},
	}
	return pages[cursor]

	MIN_INTERVAL_SEC = 0.05 # respect a hypothetical rate limit
	last_call = 0.0
	cursor = None
	all_items = []

	while True:
	elapsed = time.monotonic() - last_call
	if elapsed < MIN_INTERVAL_SEC:
	time.sleep(MIN_INTERVAL_SEC - elapsed)
	response = api_call(cursor)
	last_call = time.monotonic()
	all_items.extend(response["items"])
	cursor = response["next_cursor"]
	if cursor is None:
	break

	print(f"Pulled {len(all_items)} items across paginated calls")
	print(f"Items: {[i['id'] for i in all_items]}")

>>>Output

Pulled 5 items across paginated calls

Items: [1, 2, 3, 4, 5]

The loop has the rhythm of every production API client. Wait long enough to respect the rate limit. Make the call. Append items. Advance the cursor. Stop when the cursor is null. Real implementations add retry-with-backoff on 429 and 5xx, refresh tokens before they expire, parallelize across non-overlapping cursor ranges where the API allows it, and snapshot intermediate state so a crash mid-pagination does not lose progress.

Rate Limits and Why They Bite

Rate limit failure modes that show up in real pipelines:

▸Burning the daily quota in the first hour and being locked out for the rest of the day
▸Triggering a global vendor-side circuit breaker because retries amplified a transient 5xx
▸Sharing the rate limit across multiple internal teams without coordination
▸Hitting a per-endpoint limit nobody documented because it is implementation-specific
▸Building bursty traffic at the start of the hour that the vendor rate-limits as a single hour-window

The fix is unglamorous. Build a rate-limit-aware client that respects the published budget, smooths the request rate across the window rather than bursting, backs off exponentially on 429 with jitter, and refreshes tokens proactively rather than reactively. None of this is hard, and all of it is annoying enough that off-the-shelf connectors (Airbyte, Fivetran, Stitch) earn their cost by handling these details across hundreds of vendors so each individual team does not have to.

Why API Ingestion Is the Bug Magnet

API ingestion concentrates more failure surface than any other shape. The vendor controls the rate limit and changes it without notice. The vendor changes the schema without notice. The vendor returns 5xx during their incidents and the pipeline must tolerate them. The vendor rotates auth credentials and pipelines that did not implement refresh flows break overnight. The vendor adds new pagination semantics in v3 of the API and the v2 client silently misses records. None of this is the pipeline's fault, and all of it lands in the pipeline's on-call rotation.

API Surprise	Real Example	Mitigation
Rate limit halved overnight	Twitter API in 2023	Adaptive backoff; quota monitoring
Schema field type change	Stripe occasionally widens enum sets	Schema-on-read; permissive parsing
Pagination cursor expiration	Salesforce bulk API cursors expire after 24h	Bound run length below cursor TTL
Endpoint deprecation	Google Ads sunsets v14 every year	Track API version pinning per source

Three of the four ingestion shapes have failure modes the pipeline can prevent. API ingestion has failure modes the pipeline can only absorb. The defensive posture compounds with every API source added.

✓Do

Use cursor-based pagination when the API offers it
Smooth request rate across the window; do not burst
Persist intermediate cursors so a mid-pagination crash does not restart from page one
Refresh OAuth tokens proactively, before they expire

✗Don't

Retry 429 responses without exponential backoff and jitter
Assume the API schema is fixed; assume it is changing right now
Hold a paginated cursor in memory across a long-running run
Build a custom connector for a commodity SaaS source when off-the-shelf works

❯❯❯PUTTING IT ALL TOGETHER

> A new data engineer joins a company that already has Postgres operational data, a Kafka clickstream, a partner sending hourly inventory CSVs to S3, and a Stripe billing source. The team lead asks: 'Sketch the ingestion layer. Where does each source land, what state does each ingestion job track, and what is the most likely first incident on each one?'

Postgres maps to pull ingestion. State tracked is the high-water mark per table (likely updated_at). First incident is watermark drift caused by source-side clock skew or untouched updated_at columns.

Kafka clickstream maps to push ingestion. State tracked is the consumer group offset. First incident is lost messages caused by committing offsets before writing to the raw zone.

Hourly inventory CSV maps to file drop ingestion. State tracked is the manifest of processed files. First incident is a half-written file processed mid-upload, or a partner re-uploading the same filename with new contents.

Stripe billing maps to API ingestion. State tracked is the pagination cursor and the OAuth token expiration. First incident is rate-limit exhaustion or a token refresh failure mid-pagination. All four shapes land in one raw zone, partitioned by source and ingest date, so downstream transforms see a uniform interface.

KEY TAKEAWAYS

Four shapes cover ingestion: pull, push, file drop, API. Each forces a different piece of state and a different failure profile.

Pull ingestion runs on a watermark: the lower bound is the previous run, the upper bound is the run start time, and the order is the watermark column.

Push ingestion writes before committing: the offset is the contract with the broker. Commit-before-write loses messages on crash.

File drops need a manifest and a sentinel: the manifest prevents duplicate processing; the sentinel pattern prevents reading a half-written file.

API ingestion concentrates failure surface: rate limits, pagination, auth, and vendor reliability. Off-the-shelf connectors earn their cost on commodity sources.

Data enters the pipeline through four shapes; the shape decides every later concern

Category: Pipeline Architecture
Difficulty: beginner
Duration: 25 minutes
Challenges: 0 hands-on challenges

Topics covered: The Four Shapes of Ingestion, Pull Ingestion from a Database, Push from Queues and Webhooks, File Drop Ingestion, API Ingestion: The Bug Magnet

Lesson Sections

The Four Shapes of Ingestion (concepts: paBatchVsStreaming)
Every byte that enters a pipeline arrives through one of four shapes. The shape is determined by who initiates the transfer and what kind of artifact is being transferred. Naming the four shapes is the first move because every later concern, from scheduling to error handling to schema validation, is shaped by the choice. A pipeline that ingests from a Postgres database is structurally different from a pipeline that consumes from a Kafka topic, and pretending the difference does not matter is the
Pull Ingestion from a Database (concepts: paFullVsIncremental)
Pull ingestion is the workhorse of analytical data engineering. A scheduled job opens a JDBC or SQLAlchemy connection to an operational database, runs a SELECT, writes the results to a raw landing zone, and exits. The pattern is decades old, well understood, and still the right answer for a large fraction of ingestion problems. The mechanics are simple. The trap is that simple mechanics tempt engineers to ignore the questions that decide whether the simple mechanics are correct. The Two Flavors:
Push from Queues and Webhooks (concepts: paStreamProcessing)
Push ingestion inverts the control flow. The source produces events at whatever rate suits it. The pipeline subscribes to those events and consumes them as they arrive. Kafka is the canonical push source for internal event streams. Webhooks are the canonical push source for SaaS vendors. Kinesis, Pub/Sub, and Pulsar fit the same shape. The pattern is fundamentally different from pull because the pipeline does not control the cadence. Kafka and Friends A Kafka topic is an append-only log. Produce
File Drop Ingestion (concepts: paFileIngestion)
File drop ingestion is the lowest-tech and most common shape in cross-company integrations. A partner system writes a file to a known location at a known cadence, and the pipeline picks it up. The location is usually an S3 prefix, a GCS bucket, an Azure Blob container, or an SFTP directory. The cadence is whatever the partner promised, which is often whatever the partner happens to do. Senior engineers respect file drops because they handle the messy real world, and they distrust file drops beca
API Ingestion: The Bug Magnet (concepts: paApiIngestion)
API ingestion is what a SaaS connector does. Stripe, Salesforce, HubSpot, Google Ads, Shopify, and a long tail of vendors expose REST or GraphQL endpoints that return paginated results. Pipelines call those endpoints, paginate, and write the results to a raw zone. The pattern is conceptually identical to pull ingestion from a database, but the operational profile is dramatically worse because every concern is now mediated by HTTP and a vendor-controlled rate limit. The Four Concerns of API Inges