How Data Moves: Beginner

'Walk me through how data moves in your pipeline.' This question opens 80% of pipeline architecture interviews. The interviewer is not looking for a textbook definition. They want to see if you can explain batch vs streaming, file vs API ingestion, and most importantly, defend your choice of one over the other. Here is exactly how to answer.

Batch Processing

Daily Life
Interviews

Answer batch processing questions with confidence

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How would you process yesterday's data?"
  • "Walk me through a daily ETL job."
  • "When would you NOT use streaming?"

What They Want to Hear

'Batch processing collects data over a period and processes it all at once on a schedule. I would use it here because the consumers check dashboards once a day, so paying for real-time freshness would be waste.' That is the answer. Batch = scheduled, chunked, predictable. Say it in one sentence, then prove you understand the tradeoffs.
What to Whiteboard
scheduled triggerfull datasetwrite results
Source Database
Accumulates records all day
Extract
Read all new rows since last run
Transform
Clean, deduplicate, aggregate
Data Warehouse
Dashboard-ready tables
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What if the job takes longer than the schedule interval?" You need job-level locking or a queue. Two overlapping runs will corrupt the data.
  • "What about data that changes after your batch runs?" That is the freshness tradeoff. If it matters, consider micro-batch (5-minute windows) before jumping to streaming.
  • "How do you handle the job failing halfway?" Idempotent writes. MERGE or partition REPLACE so re-running produces the same result.
KEY TAKEAWAYS
Say: 'Batch collects and processes data on a schedule. It is the default unless real-time freshness has clear business value.'
The one tradeoff that matters: batch is simpler and cheaper, but data is always slightly stale
Always ask the interviewer: 'How fresh does the consumer actually need the data?'

Stream Processing

Daily Life
Interviews

Answer streaming questions and justify the cost

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What if we need real-time data?"
  • "How would you handle events as they happen?"
  • "Tell me about Kafka."

What They Want to Hear

'Streaming processes each event as it arrives, continuously. I would use it when stale data directly costs money or breaks the user experience, like fraud detection or live driver tracking.' Then name the tools: Kafka is the message broker that holds events. Flink or Spark Streaming is the processor that acts on them. You do not need to know how to configure these tools. You need to know what role each plays.
When to Say Streaming
  • Blocking a fraudulent transaction in real-time
  • Updating a driver's location on a map
  • Triggering an alert when a metric spikes
  • Sub-second freshness directly prevents revenue loss
When to Say Batch
  • Daily sales dashboard nobody checks until 9 AM
  • ML model that retrains weekly
  • Monthly compliance reports
  • Data consumers tolerate hours of staleness

The Tools to Name-Drop

ToolRoleOne-Liner for Interviews
Apache KafkaMessage brokerHolds events in durable topics; producers write, consumers read
Apache FlinkStream processorProcesses streams with exactly-once guarantees
Spark StreamingMicro-batch processorProcesses in small time windows; simpler than true streaming
AWS KinesisManaged streamingKafka-like but fully managed on AWS
The Curveball Follow-ups

After your initial answer, expect these probes

  • "Streaming is more complex. What makes it harder?" Ordering (events arrive out of sequence), state management (tracking across events), and failure recovery (what if the processor crashes mid-stream).
  • "How much more does streaming cost?" 3-5x more than batch for the same throughput. Compute runs 24/7 instead of spinning up and shutting down.
  • "Can you combine batch and streaming?" Yes. Micro-batch (Spark Structured Streaming) processes in small windows. Lambda architecture runs both in parallel.
KEY TAKEAWAYS
Say: 'Streaming processes events as they arrive. Use it when stale data costs money.'
Kafka = message broker, Flink = stream processor. Know the distinction.
Streaming is 3-5x more expensive than batch. Always justify the cost with a business reason.

File Ingestion

Daily Life
Interviews

Answer file ingestion questions with production awareness

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How does data get into your pipeline?"
  • "A vendor sends us CSV files daily..."
  • "What file formats do you work with?"

What They Want to Hear

'Files are the most common ingestion method. A vendor drops a CSV on S3, an event notification triggers the pipeline, and we validate before ingesting.' That is the baseline answer. Then show depth by naming formats and their tradeoffs: CSV is universal but slow at scale. Parquet is columnar, compressed, and 10-30x faster for analytics. JSON is flexible but verbose.
What to Whiteboard
file dropevent triggerpassed checks
External Source
Vendor, partner, internal system
Landing Zone
S3 bucket or SFTP folder
Validate
Schema check, row count, checksums
Ingest to Pipeline
Move to processing layer

Format Cheat Sheet

FormatSay This in the Interview
CSVUniversal but no schema enforcement, no types, terrible at scale
JSONFlexible for nested data but verbose and slow to parse
ParquetColumnar, compressed, predicate pushdown. The answer to 'why Parquet?'
AvroSchema embedded in the file, good for streaming with schema registry
The Curveball Follow-ups

After your initial answer, expect these probes

  • "The vendor sends bad data. How do you handle it?" Validate in the landing zone before ingesting. Check row counts, schema match, and file completeness. Reject and alert on failure.
  • "How do you handle duplicate file deliveries?" Idempotent ingestion: track file names or checksums, skip files already processed.
  • "Why not just use an API instead of files?" Files are simpler, more reliable for large volumes, and universally supported. APIs are better for fresher data but add failure modes.
KEY TAKEAWAYS
Say: 'Files land in a cloud bucket, an event triggers the pipeline, and we validate before ingesting.'
Know the Parquet answer cold: columnar, compressed, predicate pushdown
Always mention validation: schema check, row count, duplicate detection

API Ingestion

Daily Life
Interviews

Answer API ingestion questions like a production engineer

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How do you pull data from a third-party service?"
  • "What about rate limits?"
  • "REST vs webhook: when would you use each?"

What They Want to Hear

'We use pull-based REST APIs for scheduled data extraction and push-based webhooks for near-real-time event delivery.' Then immediately mention the two things that separate a production answer from a toy answer: rate limits and idempotency.
Pull vs Push
Pull (REST Polling)
Pipeline calls the API on a schedule
You control when data is fetched
Must handle pagination and rate limits
Example: CRM data, analytics exports
Push (Webhooks)
Source sends data to your endpoint
Near real-time delivery
Must handle retries and ordering
Example: Stripe payment events
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What happens when you hit the rate limit?" Exponential backoff with jitter. Track remaining quota from response headers. Never hard-loop against a rate limit.
  • "How do you backfill historical data from an API?" Request a bulk export from the vendor for history. Use the API for incremental daily pulls going forward. Paginated backfills through a rate-limited API take days.
  • "What if the API schema changes without warning?" Store the raw API response before transforming. If you transform on ingestion and the schema changes, your historical data is lost in the original format.
TIP
The two words that impress interviewers on API questions: rate limits and idempotency. Always mention both.
KEY TAKEAWAYS
Say: 'Pull for scheduled extraction, push for real-time events. Always handle rate limits and store raw responses.'
Rate limits + pagination + retries: the three API challenges you must name
Raw response storage is non-negotiable. Transform after ingestion, never during.

Batch vs Streaming

Daily Life
Interviews

Nail the batch vs streaming decision question

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Should this pipeline be batch or streaming?"
  • "Defend your choice."
  • "What latency does the business actually need?"

The #1 Pipeline Interview Question

This question is designed to test your judgment, not your knowledge. The interviewer describes a scenario and wants to see you reason through the decision, not recite definitions. Here is the framework that works every time:
Step 1: Ask 'If this data is 1 hour old, does anyone lose money or make a bad decision?' If no, batch. Step 2: If yes, ask 'Does a 5-minute delay cause the same problem?' If 5 minutes is fine, micro-batch. If sub-minute matters, true streaming. Step 3: State the tradeoff. 'Streaming costs 3-5x more in compute and engineering time. The business value of freshness needs to justify that.'
Your Decision Framework
Say Batch When
Consumers check hourly or daily
Processing needs complex joins/aggs
Cost matters more than freshness
Nobody loses money from stale data
Say Streaming When
Sub-minute freshness prevents revenue loss
Actions trigger from individual events
The source is already event-based
Stale data breaks the user experience

Practice Scenarios

ScenarioYour AnswerWhy
Daily sales dashboardBatchChecked once per morning; hourly is overkill
Fraud detectionStreamingMust block transactions before they clear
Inventory updatesMicro-batch5-min freshness prevents overselling
ML training dataBatchModels retrain daily, not per-event
Live driver trackingStreamingStale locations break the app
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What if the PM says 'we need real-time' but nobody checks the dashboard in real time?" Push back respectfully. 'Real-time sounds good but costs 3-5x more. Let me check what freshness the actual consumers need.'
  • "Can you start with batch and migrate to streaming later?" Yes, and this is the right default. Start batch, prove the business case for freshness, then add streaming for the specific use cases that justify it.
  • "What about micro-batch?" Spark Structured Streaming: processes in small time windows (seconds to minutes). 90% of 'we need real-time' use cases are actually micro-batch.
Do
  • Ask 'what latency does the consumer need?' before choosing
  • Default to batch and justify streaming with a business case
  • Consider micro-batch as the pragmatic middle ground
Don't
  • Say streaming because it sounds more impressive
  • Build streaming when nobody checks the data in real-time
  • Say 'it depends' without providing the decision framework
KEY TAKEAWAYS
Lead with: 'My first question would be: how fresh does the consumer actually need this data?'
Batch is the default. Streaming requires a business case that justifies 3-5x cost.
Micro-batch (5-minute windows) solves 90% of 'we need real-time' requests

Nail the batch vs streaming question and defend your choice

Category
Pipeline Architecture
Difficulty
beginner
Duration
20 minutes
Challenges
0 hands-on challenges

Topics covered: Batch Processing, Stream Processing, File Ingestion, API Ingestion, Batch vs Streaming

Lesson Sections

  1. Batch Processing (concepts: paBatchProcessing)

    What They Want to Hear 'Batch processing collects data over a period and processes it all at once on a schedule. I would use it here because the consumers check dashboards once a day, so paying for real-time freshness would be waste.' That is the answer. Batch = scheduled, chunked, predictable. Say it in one sentence, then prove you understand the tradeoffs.

  2. Stream Processing (concepts: paStreamProcessing)

    What They Want to Hear 'Streaming processes each event as it arrives, continuously. I would use it when stale data directly costs money or breaks the user experience, like fraud detection or live driver tracking.' Then name the tools: Kafka is the message broker that holds events. Flink or Spark Streaming is the processor that acts on them. You do not need to know how to configure these tools. You need to know what role each plays. The Tools to Name-Drop

  3. File Ingestion (concepts: paFileIngestion)

    What They Want to Hear 'Files are the most common ingestion method. A vendor drops a CSV on S3, an event notification triggers the pipeline, and we validate before ingesting.' That is the baseline answer. Then show depth by naming formats and their tradeoffs: CSV is universal but slow at scale. Parquet is columnar, compressed, and 10-30x faster for analytics. JSON is flexible but verbose. Format Cheat Sheet

  4. API Ingestion (concepts: paApiIngestion)

    What They Want to Hear 'We use pull-based REST APIs for scheduled data extraction and push-based webhooks for near-real-time event delivery.' Then immediately mention the two things that separate a production answer from a toy answer: rate limits and idempotency.

  5. Batch vs Streaming (concepts: paBatchVsStreaming)

    The #1 Pipeline Interview Question This question is designed to test your judgment, not your knowledge. The interviewer describes a scenario and wants to see you reason through the decision, not recite definitions. Here is the framework that works every time: Step 1: Ask 'If this data is 1 hour old, does anyone lose money or make a bad decision?' If no, batch. Step 2: If yes, ask 'Does a 5-minute delay cause the same problem?' If 5 minutes is fine, micro-batch. If sub-minute matters, true stream