Loading lesson...

Ingestion Patterns: Beginner

Data enters the pipeline through four shapes; the shape decides every later concern

Data enters the pipeline through four shapes; the shape decides every later concern

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: The Four Shapes of Ingestion, Pull Ingestion from a Database, Push from Queues and Webhooks, File Drop Ingestion, API Ingestion: The Bug Magnet

Lesson Sections

  1. The Four Shapes of Ingestion (concepts: paIngestionShapes)

    Every byte that enters a pipeline arrives through one of four shapes. The shape is determined by who initiates the transfer and what kind of artifact is being transferred. Naming the four shapes is the first move because every later concern, from scheduling to error handling to schema validation, is shaped by the choice. A pipeline that ingests from a Postgres database is structurally different from a pipeline that consumes from a Kafka topic, and pretending the difference does not matter is the

  2. Pull Ingestion from a Database (concepts: paPullIngestion, paFullVsIncremental)

    Pull ingestion is the workhorse of analytical data engineering. A scheduled job opens a JDBC or SQLAlchemy connection to an operational database, runs a SELECT, writes the results to a raw landing zone, and exits. The pattern is decades old, well understood, and still the right answer for a large fraction of ingestion problems. The mechanics are simple. The trap is that simple mechanics tempt engineers to ignore the questions that decide whether the simple mechanics are correct. The Two Flavors:

  3. Push from Queues and Webhooks (concepts: paPushIngestion, paEventPlatforms)

    Push ingestion inverts the control flow. The source produces events at whatever rate suits it. The pipeline subscribes to those events and consumes them as they arrive. Kafka is the canonical push source for internal event streams. Webhooks are the canonical push source for SaaS vendors. Kinesis, Pub/Sub, and Pulsar fit the same shape. The pattern is fundamentally different from pull because the pipeline does not control the cadence. Kafka and Friends A Kafka topic is an append-only log. Produce

  4. File Drop Ingestion (concepts: paFileIngestion)

    File drop ingestion is the lowest-tech and most common shape in cross-company integrations. A partner system writes a file to a known location at a known cadence, and the pipeline picks it up. The location is usually an S3 prefix, a GCS bucket, an Azure Blob container, or an SFTP directory. The cadence is whatever the partner promised, which is often whatever the partner happens to do. Senior engineers respect file drops because they handle the messy real world, and they distrust file drops beca

  5. API Ingestion: The Bug Magnet (concepts: paApiIngestion)

    API ingestion is what a SaaS connector does. Stripe, Salesforce, HubSpot, Google Ads, Shopify, and a long tail of vendors expose REST or GraphQL endpoints that return paginated results. Pipelines call those endpoints, paginate, and write the results to a raw zone. The pattern is conceptually identical to pull ingestion from a database, but the operational profile is dramatically worse because every concern is now mediated by HTTP and a vendor-controlled rate limit. The Four Concerns of API Inges