How Data Moves: Beginner
'Walk me through how data moves in your pipeline.' This question opens 80% of pipeline architecture interviews. The interviewer is not looking for a textbook definition. They want to see if you can explain batch vs streaming, file vs API ingestion, and most importantly, defend your choice of one over the other. Here is exactly how to answer.
Batch Processing
Answer batch processing questions with confidence
When you hear these in an interview, this is the concept being tested
- ▸"How would you process yesterday's data?"
- ▸"Walk me through a daily ETL job."
- ▸"When would you NOT use streaming?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"What if the job takes longer than the schedule interval?" You need job-level locking or a queue. Two overlapping runs will corrupt the data.
- ▸"What about data that changes after your batch runs?" That is the freshness tradeoff. If it matters, consider micro-batch (5-minute windows) before jumping to streaming.
- ▸"How do you handle the job failing halfway?" Idempotent writes. MERGE or partition REPLACE so re-running produces the same result.
Stream Processing
Answer streaming questions and justify the cost
When you hear these in an interview, this is the concept being tested
- ▸"What if we need real-time data?"
- ▸"How would you handle events as they happen?"
- ▸"Tell me about Kafka."
What They Want to Hear
- Blocking a fraudulent transaction in real-time
- Updating a driver's location on a map
- Triggering an alert when a metric spikes
- Sub-second freshness directly prevents revenue loss
- Daily sales dashboard nobody checks until 9 AM
- ML model that retrains weekly
- Monthly compliance reports
- Data consumers tolerate hours of staleness
The Tools to Name-Drop
| Tool | Role | One-Liner for Interviews |
|---|---|---|
| Apache Kafka | Message broker | Holds events in durable topics; producers write, consumers read |
| Apache Flink | Stream processor | Processes streams with exactly-once guarantees |
| Spark Streaming | Micro-batch processor | Processes in small time windows; simpler than true streaming |
| AWS Kinesis | Managed streaming | Kafka-like but fully managed on AWS |
After your initial answer, expect these probes
- ▸"Streaming is more complex. What makes it harder?" Ordering (events arrive out of sequence), state management (tracking across events), and failure recovery (what if the processor crashes mid-stream).
- ▸"How much more does streaming cost?" 3-5x more than batch for the same throughput. Compute runs 24/7 instead of spinning up and shutting down.
- ▸"Can you combine batch and streaming?" Yes. Micro-batch (Spark Structured Streaming) processes in small windows. Lambda architecture runs both in parallel.
File Ingestion
Answer file ingestion questions with production awareness
When you hear these in an interview, this is the concept being tested
- ▸"How does data get into your pipeline?"
- ▸"A vendor sends us CSV files daily..."
- ▸"What file formats do you work with?"
What They Want to Hear
Format Cheat Sheet
| Format | Say This in the Interview |
|---|---|
| CSV | Universal but no schema enforcement, no types, terrible at scale |
| JSON | Flexible for nested data but verbose and slow to parse |
| Parquet | Columnar, compressed, predicate pushdown. The answer to 'why Parquet?' |
| Avro | Schema embedded in the file, good for streaming with schema registry |
After your initial answer, expect these probes
- ▸"The vendor sends bad data. How do you handle it?" Validate in the landing zone before ingesting. Check row counts, schema match, and file completeness. Reject and alert on failure.
- ▸"How do you handle duplicate file deliveries?" Idempotent ingestion: track file names or checksums, skip files already processed.
- ▸"Why not just use an API instead of files?" Files are simpler, more reliable for large volumes, and universally supported. APIs are better for fresher data but add failure modes.
API Ingestion
Answer API ingestion questions like a production engineer
When you hear these in an interview, this is the concept being tested
- ▸"How do you pull data from a third-party service?"
- ▸"What about rate limits?"
- ▸"REST vs webhook: when would you use each?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"What happens when you hit the rate limit?" Exponential backoff with jitter. Track remaining quota from response headers. Never hard-loop against a rate limit.
- ▸"How do you backfill historical data from an API?" Request a bulk export from the vendor for history. Use the API for incremental daily pulls going forward. Paginated backfills through a rate-limited API take days.
- ▸"What if the API schema changes without warning?" Store the raw API response before transforming. If you transform on ingestion and the schema changes, your historical data is lost in the original format.
Batch vs Streaming
Nail the batch vs streaming decision question
When you hear these in an interview, this is the concept being tested
- ▸"Should this pipeline be batch or streaming?"
- ▸"Defend your choice."
- ▸"What latency does the business actually need?"
The #1 Pipeline Interview Question
Practice Scenarios
| Scenario | Your Answer | Why |
|---|---|---|
| Daily sales dashboard | Batch | Checked once per morning; hourly is overkill |
| Fraud detection | Streaming | Must block transactions before they clear |
| Inventory updates | Micro-batch | 5-min freshness prevents overselling |
| ML training data | Batch | Models retrain daily, not per-event |
| Live driver tracking | Streaming | Stale locations break the app |
After your initial answer, expect these probes
- ▸"What if the PM says 'we need real-time' but nobody checks the dashboard in real time?" Push back respectfully. 'Real-time sounds good but costs 3-5x more. Let me check what freshness the actual consumers need.'
- ▸"Can you start with batch and migrate to streaming later?" Yes, and this is the right default. Start batch, prove the business case for freshness, then add streaming for the specific use cases that justify it.
- ▸"What about micro-batch?" Spark Structured Streaming: processes in small time windows (seconds to minutes). 90% of 'we need real-time' use cases are actually micro-batch.
- Ask 'what latency does the consumer need?' before choosing
- Default to batch and justify streaming with a business case
- Consider micro-batch as the pragmatic middle ground
- Say streaming because it sounds more impressive
- Build streaming when nobody checks the data in real-time
- Say 'it depends' without providing the decision framework
Nail the batch vs streaming question and defend your choice
- Category
- Pipeline Architecture
- Difficulty
- beginner
- Duration
- 20 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Batch Processing, Stream Processing, File Ingestion, API Ingestion, Batch vs Streaming
Lesson Sections
- Batch Processing (concepts: paBatchProcessing)
What They Want to Hear 'Batch processing collects data over a period and processes it all at once on a schedule. I would use it here because the consumers check dashboards once a day, so paying for real-time freshness would be waste.' That is the answer. Batch = scheduled, chunked, predictable. Say it in one sentence, then prove you understand the tradeoffs.
- Stream Processing (concepts: paStreamProcessing)
What They Want to Hear 'Streaming processes each event as it arrives, continuously. I would use it when stale data directly costs money or breaks the user experience, like fraud detection or live driver tracking.' Then name the tools: Kafka is the message broker that holds events. Flink or Spark Streaming is the processor that acts on them. You do not need to know how to configure these tools. You need to know what role each plays. The Tools to Name-Drop
- File Ingestion (concepts: paFileIngestion)
What They Want to Hear 'Files are the most common ingestion method. A vendor drops a CSV on S3, an event notification triggers the pipeline, and we validate before ingesting.' That is the baseline answer. Then show depth by naming formats and their tradeoffs: CSV is universal but slow at scale. Parquet is columnar, compressed, and 10-30x faster for analytics. JSON is flexible but verbose. Format Cheat Sheet
- API Ingestion (concepts: paApiIngestion)
What They Want to Hear 'We use pull-based REST APIs for scheduled data extraction and push-based webhooks for near-real-time event delivery.' Then immediately mention the two things that separate a production answer from a toy answer: rate limits and idempotency.
- Batch vs Streaming (concepts: paBatchVsStreaming)
The #1 Pipeline Interview Question This question is designed to test your judgment, not your knowledge. The interviewer describes a scenario and wants to see you reason through the decision, not recite definitions. Here is the framework that works every time: Step 1: Ask 'If this data is 1 hour old, does anyone lose money or make a bad decision?' If no, batch. Step 2: If yes, ask 'Does a 5-minute delay cause the same problem?' If 5 minutes is fine, micro-batch. If sub-minute matters, true stream