Ingestion Patterns: Beginner

An e-commerce company at Series C scale ran twelve ingestion jobs against twelve different sources. Eight pulled from operational databases on a 15 minute cadence. Two consumed from Kafka. One read CSV files dropped hourly into an SFTP server by a logistics partner. One polled a marketing automation API. Over the course of one quarter, every late-night page traced back to ingestion: a Postgres replica that fell behind, a Kafka consumer that lost its offset, an SFTP partner that switched encodings without warning, an API vendor that quietly halved its rate limit. The transforms downstream were boring and reliable. The ingestion layer absorbed every external surprise the company could not control. Ingestion is the boundary where the pipeline meets the world, and the world does not care about the pipeline.

The Four Shapes of Ingestion

Daily Life
Interviews

Recognize the four ingestion shapes and pick the right one for a new source.

Every byte that enters a pipeline arrives through one of four shapes. The shape is determined by who initiates the transfer and what kind of artifact is being transferred. Naming the four shapes is the first move because every later concern, from scheduling to error handling to schema validation, is shaped by the choice. A pipeline that ingests from a Postgres database is structurally different from a pipeline that consumes from a Kafka topic, and pretending the difference does not matter is the most common reason ingestion code rots.

The Four Shapes Side by Side

ShapeWho InitiatesTypical SourceTypical Cadence
PullThe pipeline (on a schedule)Operational database, internal data storeEvery N minutes or hours
PushThe source (continuously)Kafka topic, Kinesis stream, webhook callbackContinuous; events arrive when produced
File dropThe source (on its own schedule)S3 prefix, SFTP directory, partner uploadHourly or daily, often unpredictable
APIThe pipeline (with rate limits)Stripe, Salesforce, HubSpot, internal RESTPolled on a schedule, paginated
Two dimensions sort the shapes. The first dimension is who decides when the data moves: the pipeline pulling on a schedule, or the source pushing as data is produced. The second dimension is whether the data is delivered as discrete events or as a bulk artifact like a row set or a file. Pull and API ingestion are pipeline-initiated; push and file drop ingestion are source-initiated. Push delivers events; the others typically deliver row sets or files. The four shapes cover almost every real ingestion case.
PullPushFile dropAPI
Pull
Pipeline initiates, source responds
The pipeline runs a query against a database every N minutes. The source has no idea when it will be queried. JDBC, SELECT WHERE updated_at > X.
Push
Source initiates, pipeline reacts
Kafka, Kinesis, webhooks (HTTP callbacks the source fires when something happens). The source emits events; the pipeline subscribes. Cadence is whatever the source produces, not what the pipeline asked for.
File drop
Source places a file in a known location
Partners and batch upstream systems write CSV, JSON, or Parquet to S3 or SFTP on their own schedule. The pipeline watches for new files.
API
Pipeline calls an HTTP endpoint
Stripe, Salesforce, HubSpot. The pipeline polls REST or GraphQL endpoints, paginates results, and respects rate limits.

Why the Shape Matters

Every operational concern downstream depends on the ingestion shape. Pull ingestion has to track what was pulled last so the next pull is incremental. Push ingestion has to track an offset or a sequence number so consumption can resume after a crash. File drop ingestion has to remember which files have been processed so a partner re-uploading does not cause duplicate processing. API ingestion has to handle pagination, rate limits, and the occasional 503 from a vendor that is not the pipeline's vendor. The shapes do not have a winner; they have different costs.
ShapeMain State to TrackMost Common Failure
PullLast successful watermark (timestamp or sequence id)Watermark drift: rows that updated outside the bound get missed
PushConsumer group offset or stream cursorOffset reset: consumer rewinds and reprocesses or skips
File dropSet of file paths already processedSame filename re-uploaded with different contents
APIPagination cursor, rate limit budget, auth token expiryVendor returns 429 or 5xx mid-pagination
The four ingestion shapes summed up:
  • Pull is on the pipeline's schedule, with a watermark to know how far the last run got
  • Push is on the source's schedule, with an offset to know how far the consumer got
  • File drops are on the source's schedule, with a manifest of what has been processed
  • API ingestion is pipeline-initiated but rate-limited by the vendor; pagination plus retries plus auth

Mixing Shapes in One Pipeline

A real pipeline almost always uses more than one shape. A typical analytical pipeline might pull customer dimensions from a Postgres replica nightly, consume click events from Kafka continuously, accept a daily file drop of partner inventory from an S3 prefix, and poll the Stripe API every fifteen minutes for new charges. Each source forces the shape that fits it; the pipeline accommodates all four. Trying to force every source into the same ingestion pattern is a category error and is the reason in-house ingestion code becomes unmaintainable.
1Postgres customers Kafka clickstream S3 partner inventory Stripe API charges | v transforms downstream

Four sources, four shapes, one raw zone. The raw zone exists precisely so that every downstream transform sees a uniform interface, regardless of how the data arrived.

query
Pull and API ingestion are pipeline-initiated; the pipeline decides when to ask.
alert
Push and file drop ingestion are source-initiated; the source decides when to deliver.
check
Each shape forces a different piece of state to be tracked: watermark, offset, manifest, or pagination cursor.
TIP
Before writing any ingestion code, name which of the four shapes the source forces. The shape determines half the design.
pull: query DB
database
push: Kafka/webhook
queue
file drop: S3
file drop
ingestion layer
ingest
raw layer
lake

The shapes of ingestion: pull (query a source DB on a schedule), push (consume a queue or webhook), and file drop (pick up files from object storage). Each lands in the raw layer.

Pull Ingestion from a Database

Daily Life
Interviews

Implement a pull-ingestion job with a high-water mark and choose between full and incremental loads.

Pull ingestion is the workhorse of analytical data engineering. A scheduled job opens a JDBC or SQLAlchemy connection to an operational database, runs a SELECT, writes the results to a raw landing zone, and exits. The pattern is decades old, well understood, and still the right answer for a large fraction of ingestion problems. The mechanics are simple. The trap is that simple mechanics tempt engineers to ignore the questions that decide whether the simple mechanics are correct.

The Two Flavors: Full and Incremental

Full Load
  • Reads every row in the source table on every run
  • Simplest to implement and reason about
  • Cost scales with table size, not change rate
  • Acceptable for small dimensions; quickly painful for fact tables
Incremental Load
  • Reads only rows changed since the last successful run
  • Requires a reliable change-tracking column (updated_at, modified_at, sequence id)
  • Cost scales with change rate, not table size
  • Standard for any source larger than a few million rows
The choice between full and incremental is the first design decision in pull ingestion. A 10000 row reference table is easy: read it whole every time, replace the destination, move on. A 10 billion row events table is impossible to read whole every run. Somewhere between those extremes the pipeline must learn to read only what changed. That learning shows up as the high-water mark.

The High-Water Mark

A high-water mark is a single value the pipeline persists between runs. Its job is to record how far the last successful run got, so the next run can pick up from there. The most common high-water mark is a timestamp column on the source table. The next run queries WHERE updated_at > last_watermark. After a successful run, the watermark advances to the maximum updated_at seen during the run.
1SELECT
2 order_id,
3 customer_id,
4 total_cents,
5 updated_at
6FROM orders
7WHERE updated_at >= : last_watermark AND updated_at < : this_run_started_at
8ORDER BY updated_at ;
9
10
The query has three pieces. The lower bound is the previous watermark. The upper bound is the start of this run, fixed at run launch so the boundary is stable even if the query takes ten minutes. The ORDER BY is what makes the boundary meaningful: rows are pulled in watermark order, and the new watermark is the maximum value seen. Skipping any of these three pieces is how watermark drift creeps in.
What can go wrong with a watermark:
  • Source clock skew: rows committed with timestamps slightly in the past relative to the watermark
  • Long-running transactions: a row updated at 2:01am may not be visible to a query at 2:05am
  • Updates without a touched updated_at: triggers or batch jobs that bypass the column
  • Boundary resolution: integer-second watermarks miss rows updated within the same second

A Pull Job in Python, End to End

1# A minimal pull-ingestion script. Reads watermark, runs query, writes file, advances watermark.
2import datetime
3
4# Pretend the watermark store returns the last watermark for this source.
5def read_watermark(source):
6 return datetime.datetime(2026, 4, 24, 23, 0, 0)
7
8# Pretend the source has these rows since the last watermark.
9def pull_rows_since(watermark, run_started_at):
10 rows = [
11 {"order_id": 1001, "updated_at": datetime.datetime(2026, 4, 24, 23, 15)},
12 {"order_id": 1002, "updated_at": datetime.datetime(2026, 4, 25, 0, 32)},
13 {"order_id": 1003, "updated_at": datetime.datetime(2026, 4, 25, 1, 11)},
14 ]
15 return [r for r in rows if r["updated_at"] >= watermark and r["updated_at"] < run_started_at]
16
17run_started_at = datetime.datetime(2026, 4, 25, 2, 0, 0)
18watermark = read_watermark("orders")
19rows = pull_rows_since(watermark, run_started_at)
20
21print(f"Pulled {len(rows)} rows since {watermark}")
22new_watermark = max((r["updated_at"] for r in rows), default=watermark)
23print(f"New watermark: {new_watermark}")
>>>Output
Pulled 3 rows since 2026-04-24 23:00:00
New watermark: 2026-04-25 01:11:00
The script has all the moving parts of a real pull-ingestion job in twenty lines. It reads the watermark from a state store, runs the bounded query, processes the rows, and persists the new watermark. A production version replaces the in-memory functions with a JDBC connection and a metadata table, adds error handling, batches the writes to the raw zone, and emits metrics. The skeleton does not change.

Production Pull Ingestion Tools

ToolWhat It DoesWhen It Fits
AirbyteOpen-source connectors for hundreds of sources; manages watermarks per streamStandard SaaS connectors; small-to-medium teams
FivetranManaged JDBC and SaaS ingestion with automatic schema drift handlingBuy-not-build for non-differentiated sources
Singer / MeltanoOpen-source taps and targets; tap-postgres pulls incrementally with bookmarksWhen the catalog of needed connectors is small and customizable
Custom Python or SparkHand-rolled extract job with a metadata table for watermarksInternal sources, complex business logic, no off-the-shelf connector
Do
  • Persist a per-source watermark in a metadata table the pipeline owns
  • Use the run start time as the upper bound so the boundary is stable
  • Pull from a read replica when one exists, never from the primary write node
  • Track row counts pulled per run so a sudden drop is visible
Don't
  • Use the current wall clock as the upper bound mid-query (rows can shift)
  • Trust source-side updated_at columns without sampling them for skew
  • Pull at faster than 5 minute cadence without a managed CDC story (change data capture, covered in the intermediate tier)
  • Forget to handle the bootstrapping case where no watermark exists yet

Push from Queues and Webhooks

Daily Life
Interviews

Build a push-ingestion consumer that commits offsets safely and routes webhook events through a queue.

Push ingestion inverts the control flow. The source produces events at whatever rate suits it. The pipeline subscribes to those events and consumes them as they arrive. Kafka is the canonical push source for internal event streams. Webhooks are the canonical push source for SaaS vendors. Kinesis, Pub/Sub, and Pulsar fit the same shape. The pattern is fundamentally different from pull because the pipeline does not control the cadence.

Kafka and Friends

A Kafka topic is an append-only log. Producers write events to the end. Consumers read from a position they track called an offset. The broker retains events for some configured period (commonly 7 days). A consumer group coordinates multiple consumers reading the same topic so each event is processed exactly once across the group. The model is the foundation of most modern event-driven ingestion.
ConceptPlain DefinitionWhy It Matters
TopicNamed, partitioned, append-only log of eventsThe unit of subscription
PartitionAn ordered subset of a topic; events within a partition are orderedParallelism and ordering boundary
OffsetSequential position within a partition the consumer has read up toHow resume-after-crash works
Consumer groupSet of consumers cooperating to read a topic without overlapHorizontal scaling without duplicating work
RetentionTime or size limit after which events are deletedDetermines how far back replay can go

A Minimal Kafka Consumer

1from kafka import KafkaConsumer
2
3consumer = KafkaConsumer(
4 "order_events",
5 bootstrap_servers=["kafka:9092"],
6 group_id="raw_zone_writer",
7 enable_auto_commit=False, # Commit only after successful write
8 auto_offset_reset="earliest",
9)
10
11for message in consumer:
12 write_to_raw_zone(message.value)
13 consumer.commit()
The shape is straightforward. The consumer joins a group, the broker assigns it partitions, and the loop processes events as they arrive. The critical detail is the order of operations inside the loop: write first, then commit the offset. Reversing the order produces a class of bugs called message loss, where a crash between commit and write loses the in-flight event forever. Writing first and committing after creates the opposite class of bugs called duplicate delivery, which is the right class of bugs to have because they can be deduplicated downstream.
The two cardinal sins of push ingestion:
  • Committing before writing: a crash loses messages and the loss is silent
  • Holding messages in memory across many cycles: a restart loses the buffer
  • Treating the offset as a side effect: it is the contract with the broker, not a metric
  • Sharing one consumer group across environments: dev consumes prod events

Webhooks: Push for Vendors

Webhooks are the SaaS equivalent of Kafka. A vendor like Stripe or Shopify offers to call a URL on the consumer's side when an event of interest occurs. The consumer side runs a small HTTP server that receives the POST, verifies a signature, and writes the payload to a queue or directly to a raw zone. Webhooks are push: the vendor decides when to call. The pipeline reacts.
1from flask import Flask, request, abort
2import hmac, hashlib
3
4app = Flask(__name__)
5WEBHOOK_SECRET = b"..."
6
7@app.post("/stripe")
8def stripe_webhook():
9 sig = request.headers.get("Stripe-Signature", "")
10 expected = hmac.new(WEBHOOK_SECRET, request.data, hashlib.sha256).hexdigest()
11 if not hmac.compare_digest(sig, expected):
12 abort(401)
13 enqueue_for_processing(request.json)
14 return "", 204
The handler does three things: verifies the request really came from the vendor, enqueues the event onto a durable queue (SQS, Kafka, even a Postgres table), and returns 204 immediately. The actual processing happens asynchronously off the queue. Doing real work inside the webhook handler is a recipe for vendor-side timeouts and delivery failures. The vendor wants a fast acknowledgement; the pipeline wants durability. The queue between the two satisfies both.
Pull Ingestion
  • Pipeline decides when to ask
  • State is a watermark in a metadata table
  • Cadence is a knob the pipeline controls
  • Failure mode: missed window, drift, slow source
Push Ingestion
  • Source decides when to deliver
  • State is an offset or sequence number
  • Cadence is whatever the source produces
  • Failure mode: lost offset, slow consumer, runaway producer

When to Reach for Push Ingestion

Push ingestion is the right answer when downstream consumers care about freshness measured in seconds rather than minutes, when the source is already producing events to a broker, or when the data shape is naturally event-shaped (clicks, payments, sensor readings). It is the wrong answer when the source is a transactional database whose authoritative state is in tables; in that case CDC (covered in the intermediate tier) is the right way to bridge the database into a push stream.
query
Push ingestion is consumer-driven on the wire but source-driven in cadence.
alert
Always write before committing the offset. The reverse loses messages silently.
check
Webhooks need a queue between the HTTP handler and the actual processing.
TIP
Treat the offset (or webhook delivery confirmation) as the most precious piece of state in a push pipeline. It is the only thing standing between the pipeline and lost events.

File Drop Ingestion

Daily Life
Interviews

Implement file drop ingestion with a manifest, sentinel handling, and defensive validation.

File drop ingestion is the lowest-tech and most common shape in cross-company integrations. A partner system writes a file to a known location at a known cadence, and the pipeline picks it up. The location is usually an S3 prefix, a GCS bucket, an Azure Blob container, or an SFTP directory. The cadence is whatever the partner promised, which is often whatever the partner happens to do. Senior engineers respect file drops because they handle the messy real world, and they distrust file drops because the messy real world keeps showing up in production.

The Mechanics of a File Drop

StepWhat HappensWhere State Lives
DiscoveryPipeline lists the prefix to find new filesManifest of already-processed file paths
ValidationFilename pattern, size, encoding, header row checkedValidation rules in pipeline config
ReadFile is parsed into rows or recordsParser config (CSV dialect, schema, compression)
LandRows are written into the raw zone, partitioned by ingest dateRaw zone (S3 prefix or table)
Mark processedFile path added to the manifest so it is skipped next runManifest table or move-to-processed prefix

The Manifest Pattern

The single most important piece of state in file drop ingestion is the manifest: the record of which files have already been processed. Without it, the pipeline reprocesses every file every run. With a buggy one, the pipeline reprocesses some files twice (duplicates) or skips some forever (silent loss). Two implementations dominate. The first writes the path of every processed file to a table and consults it on every run. The second moves processed files to a separate prefix (raw/processed/) so the search prefix only contains unprocessed files.
1# Simulating a manifest-based file drop processor.
2processed_manifest = {"inventory/processed/2026-04-24-partA.csv"}
3
4inbox = [
5 "inventory/inbox/2026-04-25-partA.csv",
6 "inventory/inbox/2026-04-25-partB.csv",
7 "inventory/inbox/2026-04-24-partA.csv", # already processed yesterday
8 "inventory/inbox/.uploading-2026-04-25-partC.csv", # half-written sentinel
9]
10
11for key in inbox:
12 if key.startswith("inventory/inbox/.uploading"):
13 continue
14 moved_key = key.replace("inbox/", "processed/")
15 if moved_key in processed_manifest:
16 continue
17 print(f"Processing: {key}")
18 processed_manifest.add(moved_key)
19
20print(f"Manifest size after run: {len(processed_manifest)}")
>>>Output
Processing: inventory/inbox/2026-04-25-partA.csv
Processing: inventory/inbox/2026-04-25-partB.csv
Manifest size after run: 3
The simulation walks the inbox, skips sentinels, consults the manifest, and processes only the unseen files. The atomic move is the manifest in production: lifting the file from inbox/ to processed/ records the work was done. A separate manifest table works equally well and is sometimes preferable when files cannot be moved (read-only buckets, partner-controlled keys).

Surprises File Drops Bring to Production

What partner file drops do that breaks pipelines:
  • Re-upload yesterday's file with the same name but different contents
  • Switch from UTF-8 to Windows-1252 in the middle of a quarter, no notice
  • Drop a file that is half-written when the pipeline starts reading
  • Add a column without notice (additive changes are usually fine)
  • Remove a column without notice (breaking changes that fail loudly is the better outcome)
  • Send a 0-byte sentinel file as a 'no records today' marker the pipeline does not recognize
  • Skip a daily drop entirely because their SFTP cron failed and nobody noticed
Each surprise is a story from someone's production. None are exotic. File drop ingestion has to be defensive about all of them. A two-line size check (skip files smaller than 100 bytes; alert if the file is more than 5x the historical average) catches most weirdness before the parser does. A header-row check catches schema drift. An encoding sniff catches the Windows-1252 surprise. The parser is not a place to be optimistic.

The Half-Written File Problem

S3 uploads are not atomic from the consumer's point of view by default. A partner writing a 2 GB file uploads in chunks; a pipeline that lists the prefix in the middle of the upload sees a partial file. The conventional fix is a sentinel: the partner uploads the data file as my_drop.csv.uploading and renames to my_drop.csv after the upload completes. The pipeline ignores .uploading files. Modern systems use S3 multipart upload completion events instead, but the sentinel pattern is older, simpler, and still common in SFTP and cross-cloud transfers.
1Partner uploads : my_drop.csv.uploading(IN progress, pipeline ignores) Partner finishes : renames to my_drop.csv(now visible to pipeline) Pipeline lists prefix : sees my_drop.csv, processes it, moves to processed /
check
The manifest (or move-to-processed) is the durable state that prevents duplicate processing.
alert
Validate filename, size, and encoding before parsing. The parser is not the validator.
query
Use a sentinel pattern (.uploading suffix) to avoid reading half-written files.
TIP
Assume every partner file drop will at some point be a 0-byte sentinel, an encoding flip, a re-uploaded duplicate, and a half-written file. Defensive parsing is not pessimism; it is realism.

API Ingestion: The Bug Magnet

Daily Life
Interviews

Build an API ingestion job that paginates, respects rate limits, and tolerates the failure modes vendors create.

API ingestion is what a SaaS connector does. Stripe, Salesforce, HubSpot, Google Ads, Shopify, and a long tail of vendors expose REST or GraphQL endpoints that return paginated results. Pipelines call those endpoints, paginate, and write the results to a raw zone. The pattern is conceptually identical to pull ingestion from a database, but the operational profile is dramatically worse because every concern is now mediated by HTTP and a vendor-controlled rate limit.

The Four Concerns of API Ingestion

PaginationRate limitsAuthReliability
Pagination
Walking the result set
Page numbers, cursors, or time ranges. Stop when the cursor is null or the next page is empty.
Rate limits
Vendor cap on requests
Per-second, per-minute, or per-hour budgets. Smooth rate; back off on 429; do not burst.
Auth
Tokens that expire
OAuth refresh, API keys, signed requests. Refresh proactively, before mid-pagination expiry.
Reliability
5xx and timeouts from vendors
Retry with jitter and exponential backoff. Vendor outages outlast the pipeline's patience by default.
ConcernWhat It IsTypical Mistake
PaginationWalking through pages of results to retrieve the full setStopping at the first empty page without checking the cursor
Rate limitsVendor cap on requests per second, minute, or hourBurning the budget in five minutes and being throttled all hour
AuthOAuth tokens, API keys, refresh flowsLetting a token expire mid-pagination with no refresh path
ReliabilityVendor returns 5xx, 429, or simply hangsRetrying without backoff and amplifying the vendor's outage

Pagination Styles

Three pagination styles cover almost every API. Page-based pagination uses page=1, page=2 query parameters and is the easiest to reason about; it suffers from drift if records are added during pagination. Cursor-based pagination returns an opaque cursor in each response that the next request submits; this is the strongest pattern because the cursor encodes the position. Time-based pagination uses since and until parameters; it works for append-only feeds and breaks for sources where records can be updated.
Page-Based Pagination
  • page=1, page=2, page=3 query parameters
  • Drift risk: new records during pagination shift the page contents
  • Easy to reason about, easy to retry a single page
  • Common in older REST APIs
Cursor-Based Pagination
  • next_cursor returned in each response, submitted on next request
  • Stable across additions because cursor encodes a precise position
  • Harder to retry mid-pagination without storing the cursor history
  • Standard in modern APIs (Stripe, Shopify, GraphQL connections)

A Paginated, Rate-Limited Pull

1# A simulation of a paginated, rate-limited API pull. The mock 'api' returns three pages.
2import time
3
4def api_call(cursor):
5 pages = {
6 None: {"items": [{"id": 1}, {"id": 2}], "next_cursor": "c2"},
7 "c2": {"items": [{"id": 3}, {"id": 4}], "next_cursor": "c3"},
8 "c3": {"items": [{"id": 5}], "next_cursor": None},
9 }
10 return pages[cursor]
11
12MIN_INTERVAL_SEC = 0.05 # respect a hypothetical rate limit
13last_call = 0.0
14cursor = None
15all_items = []
16
17while True:
18 elapsed = time.monotonic() - last_call
19 if elapsed < MIN_INTERVAL_SEC:
20 time.sleep(MIN_INTERVAL_SEC - elapsed)
21 response = api_call(cursor)
22 last_call = time.monotonic()
23 all_items.extend(response["items"])
24 cursor = response["next_cursor"]
25 if cursor is None:
26 break
27
28print(f"Pulled {len(all_items)} items across paginated calls")
29print(f"Items: {[i['id'] for i in all_items]}")
>>>Output
Pulled 5 items across paginated calls
Items: [1, 2, 3, 4, 5]
The loop has the rhythm of every production API client. Wait long enough to respect the rate limit. Make the call. Append items. Advance the cursor. Stop when the cursor is null. Real implementations add retry-with-backoff on 429 and 5xx, refresh tokens before they expire, parallelize across non-overlapping cursor ranges where the API allows it, and snapshot intermediate state so a crash mid-pagination does not lose progress.

Rate Limits and Why They Bite

Rate limit failure modes that show up in real pipelines:
  • Burning the daily quota in the first hour and being locked out for the rest of the day
  • Triggering a global vendor-side circuit breaker because retries amplified a transient 5xx
  • Sharing the rate limit across multiple internal teams without coordination
  • Hitting a per-endpoint limit nobody documented because it is implementation-specific
  • Building bursty traffic at the start of the hour that the vendor rate-limits as a single hour-window
The fix is unglamorous. Build a rate-limit-aware client that respects the published budget, smooths the request rate across the window rather than bursting, backs off exponentially on 429 with jitter, and refreshes tokens proactively rather than reactively. None of this is hard, and all of it is annoying enough that off-the-shelf connectors (Airbyte, Fivetran, Stitch) earn their cost by handling these details across hundreds of vendors so each individual team does not have to.

Why API Ingestion Is the Bug Magnet

API ingestion concentrates more failure surface than any other shape. The vendor controls the rate limit and changes it without notice. The vendor changes the schema without notice. The vendor returns 5xx during their incidents and the pipeline must tolerate them. The vendor rotates auth credentials and pipelines that did not implement refresh flows break overnight. The vendor adds new pagination semantics in v3 of the API and the v2 client silently misses records. None of this is the pipeline's fault, and all of it lands in the pipeline's on-call rotation.
API SurpriseReal ExampleMitigation
Rate limit halved overnightTwitter API in 2023Adaptive backoff; quota monitoring
Schema field type changeStripe occasionally widens enum setsSchema-on-read; permissive parsing
Pagination cursor expirationSalesforce bulk API cursors expire after 24hBound run length below cursor TTL
Endpoint deprecationGoogle Ads sunsets v14 every yearTrack API version pinning per source

Three of the four ingestion shapes have failure modes the pipeline can prevent. API ingestion has failure modes the pipeline can only absorb. The defensive posture compounds with every API source added.

Do
  • Use cursor-based pagination when the API offers it
  • Smooth request rate across the window; do not burst
  • Persist intermediate cursors so a mid-pagination crash does not restart from page one
  • Refresh OAuth tokens proactively, before they expire
Don't
  • Retry 429 responses without exponential backoff and jitter
  • Assume the API schema is fixed; assume it is changing right now
  • Hold a paginated cursor in memory across a long-running run
  • Build a custom connector for a commodity SaaS source when off-the-shelf works
PUTTING IT ALL TOGETHER

> A new data engineer joins a company that already has Postgres operational data, a Kafka clickstream, a partner sending hourly inventory CSVs to S3, and a Stripe billing source. The team lead asks: 'Sketch the ingestion layer. Where does each source land, what state does each ingestion job track, and what is the most likely first incident on each one?'

Postgres maps to pull ingestion. State tracked is the high-water mark per table (likely updated_at). First incident is watermark drift caused by source-side clock skew or untouched updated_at columns.
Kafka clickstream maps to push ingestion. State tracked is the consumer group offset. First incident is lost messages caused by committing offsets before writing to the raw zone.
Hourly inventory CSV maps to file drop ingestion. State tracked is the manifest of processed files. First incident is a half-written file processed mid-upload, or a partner re-uploading the same filename with new contents.
Stripe billing maps to API ingestion. State tracked is the pagination cursor and the OAuth token expiration. First incident is rate-limit exhaustion or a token refresh failure mid-pagination. All four shapes land in one raw zone, partitioned by source and ingest date, so downstream transforms see a uniform interface.
KEY TAKEAWAYS
Four shapes cover ingestion: pull, push, file drop, API. Each forces a different piece of state and a different failure profile.
Pull ingestion runs on a watermark: the lower bound is the previous run, the upper bound is the run start time, and the order is the watermark column.
Push ingestion writes before committing: the offset is the contract with the broker. Commit-before-write loses messages on crash.
File drops need a manifest and a sentinel: the manifest prevents duplicate processing; the sentinel pattern prevents reading a half-written file.
API ingestion concentrates failure surface: rate limits, pagination, auth, and vendor reliability. Off-the-shelf connectors earn their cost on commodity sources.

Data enters the pipeline through four shapes; the shape decides every later concern

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: The Four Shapes of Ingestion, Pull Ingestion from a Database, Push from Queues and Webhooks, File Drop Ingestion, API Ingestion: The Bug Magnet

Lesson Sections

  1. The Four Shapes of Ingestion (concepts: paBatchVsStreaming)

    Every byte that enters a pipeline arrives through one of four shapes. The shape is determined by who initiates the transfer and what kind of artifact is being transferred. Naming the four shapes is the first move because every later concern, from scheduling to error handling to schema validation, is shaped by the choice. A pipeline that ingests from a Postgres database is structurally different from a pipeline that consumes from a Kafka topic, and pretending the difference does not matter is the

  2. Pull Ingestion from a Database (concepts: paFullVsIncremental)

    Pull ingestion is the workhorse of analytical data engineering. A scheduled job opens a JDBC or SQLAlchemy connection to an operational database, runs a SELECT, writes the results to a raw landing zone, and exits. The pattern is decades old, well understood, and still the right answer for a large fraction of ingestion problems. The mechanics are simple. The trap is that simple mechanics tempt engineers to ignore the questions that decide whether the simple mechanics are correct. The Two Flavors:

  3. Push from Queues and Webhooks (concepts: paStreamProcessing)

    Push ingestion inverts the control flow. The source produces events at whatever rate suits it. The pipeline subscribes to those events and consumes them as they arrive. Kafka is the canonical push source for internal event streams. Webhooks are the canonical push source for SaaS vendors. Kinesis, Pub/Sub, and Pulsar fit the same shape. The pattern is fundamentally different from pull because the pipeline does not control the cadence. Kafka and Friends A Kafka topic is an append-only log. Produce

  4. File Drop Ingestion (concepts: paFileIngestion)

    File drop ingestion is the lowest-tech and most common shape in cross-company integrations. A partner system writes a file to a known location at a known cadence, and the pipeline picks it up. The location is usually an S3 prefix, a GCS bucket, an Azure Blob container, or an SFTP directory. The cadence is whatever the partner promised, which is often whatever the partner happens to do. Senior engineers respect file drops because they handle the messy real world, and they distrust file drops beca

  5. API Ingestion: The Bug Magnet (concepts: paApiIngestion)

    API ingestion is what a SaaS connector does. Stripe, Salesforce, HubSpot, Google Ads, Shopify, and a long tail of vendors expose REST or GraphQL endpoints that return paginated results. Pipelines call those endpoints, paginate, and write the results to a raw zone. The pattern is conceptually identical to pull ingestion from a database, but the operational profile is dramatically worse because every concern is now mediated by HTTP and a vendor-controlled rate limit. The Four Concerns of API Inges