How Data Moves: Beginner

'Walk me through how data moves in your pipeline.' This question opens 80% of pipeline architecture interviews. The interviewer is not looking for a textbook definition. They want to see if you can explain batch vs streaming, file vs API ingestion, and most importantly, defend your choice of one over the other. Here is exactly how to answer.

What you will be able to do

Answer 'batch or streaming?' with a confident one-liner and defend it

Explain file and API ingestion like you have built both

Never get caught saying 'it depends' without a framework

Batch Processing

Daily Life

Interviews

Answer batch processing questions with confidence

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"How would you process yesterday's data?"
▸"Walk me through a daily ETL job."
▸"When would you NOT use streaming?"

What They Want to Hear

'Batch processing collects data over a period and processes it all at once on a schedule. I would use it here because the consumers check dashboards once a day, so paying for real-time freshness would be waste.' That is the answer. Batch = scheduled, chunked, predictable. Say it in one sentence, then prove you understand the tradeoffs.

Storage

Source Database

Source

Extract

Transform

Storage

Data Warehouse

What to Whiteboard

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What if the job takes longer than the schedule interval?" You need job-level locking or a queue. Two overlapping runs will corrupt the data.
▸"What about data that changes after your batch runs?" That is the freshness tradeoff. If it matters, consider micro-batch (5-minute windows) before jumping to streaming.
▸"How do you handle the job failing halfway?" Idempotent writes. MERGE or partition REPLACE so re-running produces the same result.

KEY TAKEAWAYS

Say: 'Batch collects and processes data on a schedule. It is the default unless real-time freshness has clear business value.'

The one tradeoff that matters: batch is simpler and cheaper, but data is always slightly stale

Always ask the interviewer: 'How fresh does the consumer actually need the data?'

Stream Processing

Daily Life

Interviews

Answer streaming questions and justify the cost

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What if we need real-time data?"
▸"How would you handle events as they happen?"
▸"Tell me about Kafka."

What They Want to Hear

'Streaming processes each event as it arrives, continuously. I would use it when stale data directly costs money or breaks the user experience, like fraud detection or live driver tracking.' Then name the tools: Kafka is the message broker that holds events. Flink or Spark Streaming is the processor that acts on them. You do not need to know how to configure these tools. You need to know what role each plays.

•When to Say Streaming

Blocking a fraudulent transaction in real-time
Updating a driver's location on a map
Triggering an alert when a metric spikes
Sub-second freshness directly prevents revenue loss

•When to Say Batch

Daily sales dashboard nobody checks until 9 AM
ML model that retrains weekly
Monthly compliance reports
Data consumers tolerate hours of staleness

The Tools to Name-Drop

Tool	Role	One-Liner for Interviews
Apache Kafka	Message broker	Holds events in durable topics; producers write, consumers read
Apache Flink	Stream processor	Processes streams with exactly-once guarantees
Spark Streaming	Micro-batch processor	Processes in small time windows; simpler than true streaming
AWS Kinesis	Managed streaming	Kafka-like but fully managed on AWS

The Curveball Follow-ups

After your initial answer, expect these probes

▸"Streaming is more complex. What makes it harder?" Ordering (events arrive out of sequence), state management (tracking across events), and failure recovery (what if the processor crashes mid-stream).
▸"How much more does streaming cost?" 3-5x more than batch for the same throughput. Compute runs 24/7 instead of spinning up and shutting down.
▸"Can you combine batch and streaming?" Yes. Micro-batch (Spark Structured Streaming) processes in small windows. Lambda architecture runs both in parallel.

KEY TAKEAWAYS

Say: 'Streaming processes events as they arrive. Use it when stale data costs money.'

Kafka = message broker, Flink = stream processor. Know the distinction.

Streaming is 3-5x more expensive than batch. Always justify the cost with a business reason.

File Ingestion

Daily Life

Interviews

Answer file ingestion questions with production awareness

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"How does data get into your pipeline?"
▸"A vendor sends us CSV files daily..."
▸"What file formats do you work with?"

What They Want to Hear

'Files are the most common ingestion method. A vendor drops a CSV on S3, an event notification triggers the pipeline, and we validate before ingesting.' That is the baseline answer. Then show depth by naming formats and their tradeoffs: CSV is universal but slow at scale. Parquet is columnar, compressed, and 10-30x faster for analytics. JSON is flexible but verbose.

Source

External Source

Transform

Landing Zone

Quality

Validate

Source

Ingest to Pipeline

What to Whiteboard

Format Cheat Sheet

Format	Say This in the Interview
CSV	Universal but no schema enforcement, no types, terrible at scale
JSON	Flexible for nested data but verbose and slow to parse
Parquet	Columnar, compressed, predicate pushdown. The answer to 'why Parquet?'
Avro	Schema embedded in the file, good for streaming with schema registry

The Curveball Follow-ups

After your initial answer, expect these probes

▸"The vendor sends bad data. How do you handle it?" Validate in the landing zone before ingesting. Check row counts, schema match, and file completeness. Reject and alert on failure.
▸"How do you handle duplicate file deliveries?" Idempotent ingestion: track file names or checksums, skip files already processed.
▸"Why not just use an API instead of files?" Files are simpler, more reliable for large volumes, and universally supported. APIs are better for fresher data but add failure modes.

KEY TAKEAWAYS

Say: 'Files land in a cloud bucket, an event triggers the pipeline, and we validate before ingesting.'

Know the Parquet answer cold: columnar, compressed, predicate pushdown

Always mention validation: schema check, row count, duplicate detection

API Ingestion

Daily Life

Interviews

Answer API ingestion questions like a production engineer

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"How do you pull data from a third-party service?"
▸"What about rate limits?"
▸"REST vs webhook: when would you use each?"

What They Want to Hear

'We use pull-based REST APIs for scheduled data extraction and push-based webhooks for near-real-time event delivery.' Then immediately mention the two things that separate a production answer from a toy answer: rate limits and idempotency.

✓Pull (REST Polling)

Pipeline calls the API on a schedule
You control when data is fetched
Must handle pagination and rate limits
Example: CRM data, analytics exports

✗Push (Webhooks)

Source sends data to your endpoint
Near real-time delivery
Must handle retries and ordering
Example: Stripe payment events

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What happens when you hit the rate limit?" Exponential backoff with jitter. Track remaining quota from response headers. Never hard-loop against a rate limit.
▸"How do you backfill historical data from an API?" Request a bulk export from the vendor for history. Use the API for incremental daily pulls going forward. Paginated backfills through a rate-limited API take days.
▸"What if the API schema changes without warning?" Store the raw API response before transforming. If you transform on ingestion and the schema changes, your historical data is lost in the original format.

TIP

The two words that impress interviewers on API questions: rate limits and idempotency. Always mention both.

KEY TAKEAWAYS

Say: 'Pull for scheduled extraction, push for real-time events. Always handle rate limits and store raw responses.'

Rate limits + pagination + retries: the three API challenges you must name

Raw response storage is non-negotiable. Transform after ingestion, never during.

Batch vs Streaming

Daily Life

Interviews

Nail the batch vs streaming decision question

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"Should this pipeline be batch or streaming?"
▸"Defend your choice."
▸"What latency does the business actually need?"

The #1 Pipeline Interview Question

This question is designed to test your judgment, not your knowledge. The interviewer describes a scenario and wants to see you reason through the decision, not recite definitions. Here is the framework that works every time:

Step 1: Ask 'If this data is 1 hour old, does anyone lose money or make a bad decision?' If no, batch. Step 2: If yes, ask 'Does a 5-minute delay cause the same problem?' If 5 minutes is fine, micro-batch. If sub-minute matters, true streaming. Step 3: State the tradeoff. 'Streaming costs 3-5x more in compute and engineering time. The business value of freshness needs to justify that.'

✓Say Batch When

Consumers check hourly or daily
Processing needs complex joins/aggs
Cost matters more than freshness
Nobody loses money from stale data

✗Say Streaming When

Sub-minute freshness prevents revenue loss
Actions trigger from individual events
The source is already event-based
Stale data breaks the user experience

Practice Scenarios

Scenario	Your Answer	Why
Daily sales dashboard	Batch	Checked once per morning; hourly is overkill
Fraud detection	Streaming	Must block transactions before they clear
Inventory updates	Micro-batch	5-min freshness prevents overselling
ML training data	Batch	Models retrain daily, not per-event
Live driver tracking	Streaming	Stale locations break the app

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What if the PM says 'we need real-time' but nobody checks the dashboard in real time?" Push back respectfully. 'Real-time sounds good but costs 3-5x more. Let me check what freshness the actual consumers need.'
▸"Can you start with batch and migrate to streaming later?" Yes, and this is the right default. Start batch, prove the business case for freshness, then add streaming for the specific use cases that justify it.
▸"What about micro-batch?" Spark Structured Streaming: processes in small time windows (seconds to minutes). 90% of 'we need real-time' use cases are actually micro-batch.

✓Do

Ask 'what latency does the consumer need?' before choosing
Default to batch and justify streaming with a business case
Consider micro-batch as the pragmatic middle ground

✗Don't

Say streaming because it sounds more impressive
Build streaming when nobody checks the data in real-time
Say 'it depends' without providing the decision framework

KEY TAKEAWAYS

Lead with: 'My first question would be: how fresh does the consumer actually need this data?'

Batch is the default. Streaming requires a business case that justifies 3-5x cost.

Micro-batch (5-minute windows) solves 90% of 'we need real-time' requests

Nail the batch vs streaming question and defend your choice

Category: Pipeline Architecture
Difficulty: beginner
Duration: 20 minutes
Challenges: 0 hands-on challenges

Topics covered: Batch Processing, Stream Processing, File Ingestion, API Ingestion, Batch vs Streaming

Lesson Sections

Batch Processing (concepts: paBatchProcessing)
What They Want to Hear 'Batch processing collects data over a period and processes it all at once on a schedule. I would use it here because the consumers check dashboards once a day, so paying for real-time freshness would be waste.' That is the answer. Batch = scheduled, chunked, predictable. Say it in one sentence, then prove you understand the tradeoffs.
Stream Processing (concepts: paStreamProcessing)
What They Want to Hear 'Streaming processes each event as it arrives, continuously. I would use it when stale data directly costs money or breaks the user experience, like fraud detection or live driver tracking.' Then name the tools: Kafka is the message broker that holds events. Flink or Spark Streaming is the processor that acts on them. You do not need to know how to configure these tools. You need to know what role each plays. The Tools to Name-Drop
File Ingestion (concepts: paFileIngestion)
What They Want to Hear 'Files are the most common ingestion method. A vendor drops a CSV on S3, an event notification triggers the pipeline, and we validate before ingesting.' That is the baseline answer. Then show depth by naming formats and their tradeoffs: CSV is universal but slow at scale. Parquet is columnar, compressed, and 10-30x faster for analytics. JSON is flexible but verbose. Format Cheat Sheet
API Ingestion (concepts: paApiIngestion)
What They Want to Hear 'We use pull-based REST APIs for scheduled data extraction and push-based webhooks for near-real-time event delivery.' Then immediately mention the two things that separate a production answer from a toy answer: rate limits and idempotency.
Batch vs Streaming (concepts: paBatchVsStreaming)
The #1 Pipeline Interview Question This question is designed to test your judgment, not your knowledge. The interviewer describes a scenario and wants to see you reason through the decision, not recite definitions. Here is the framework that works every time: Step 1: Ask 'If this data is 1 hour old, does anyone lose money or make a bad decision?' If no, batch. Step 2: If yes, ask 'Does a 5-minute delay cause the same problem?' If 5 minutes is fine, micro-batch. If sub-minute matters, true stream