The Ingestion Layer
Concepts covered: paCdc
When the interviewer asks how data gets into your pipeline, they are probing for a real choice, not a list. The three source patterns (file drops, API pulls, and Change Data Capture) have different reliability, latency, and cost profiles. Name which one you chose and why before they have to ask. File-Based Ingestion The simplest and most common pattern: a source system drops files (CSV, JSON, Parquet) into cloud storage (S3, GCS, ADLS). Your pipeline picks them up on a schedule. This is the default for vendor data feeds, data exports from SaaS tools, and any system where you don't control the source. It's batch by nature - latency is measured in hours, not seconds. API-Based Ingestion When the source is a SaaS API (Salesforce, Stripe, HubSpot), you pull data via REST or GraphQL endpoints
About This Interactive Section
This section is part of the Design a Pipeline: Intermediate lesson on DataDriven, a free data engineering interview prep platform. Each section includes explanations, worked examples, and hands-on code challenges that execute in real time. SQL queries run against a live PostgreSQL database. Python runs in a sandboxed Docker container. Data modeling problems validate against interactive schema canvases. All content is framed around what data engineering interviewers actually test at companies like Meta, Google, Amazon, Netflix, Stripe, and Databricks.
How DataDriven Lessons Work
DataDriven combines four interview rounds (SQL, Python, Data Modeling, Pipeline Architecture) with adaptive difficulty and spaced repetition. Easy problems get harder as you improve. Weak concepts resurface until you master them. Your readiness score tracks progress across every topic interviewers test. Every lesson section ends with problems you solve by writing and running real code, not by picking multiple-choice answers.