Pipeline Architecture
System design appears in only 2.8% of DE interview rounds. But when it does, batch vs streaming is the defining question. Most pipelines should be batch. Streaming adds real value only when low latency is a hard requirement.
SQL and Python dominate the vast majority of DE interview loops. System design is rare, but getting it wrong signals you cannot reason about trade-offs.
The Reliable Workhorse
Collect data over a time window (hourly, daily, weekly), then process it all at once. The pipeline runs on a schedule, produces output, and stops. Think: "run every night at 2 AM."
The Speed Specialist
Process each event as it arrives, continuously. The pipeline is always running, consuming from an event stream. Think: "process every click the moment it happens."
Minutes to hours. Data is collected over a window (hourly, daily), then processed all at once. Most analytics workloads are fine with this delay.
Seconds to milliseconds. Events are processed as they arrive. Necessary when stale data has real consequences.
Lower. Easier to reason about, test, and debug. A batch job either succeeds or fails. Retries are straightforward: re-run the whole batch.
Higher. You deal with late-arriving events, ordering guarantees, exactly-once semantics, and backpressure. More failure modes, harder to test.
Generally cheaper. Compute spins up, processes the batch, and shuts down. You pay for what you use. Spot instances work well for batch jobs.
Generally more expensive. Consumers must run continuously, even during low-traffic periods. Infrastructure is always on.
Well-established: Airflow, dbt, Spark (batch mode), cron jobs, cloud-native schedulers. Large talent pool. Plenty of documentation.
More specialized: Kafka, Flink, Spark Streaming, Kinesis, Pub/Sub. Smaller talent pool. Operational overhead is higher.
Easier. You can inspect the input, re-run the job, and compare outputs. Logs are scoped to a single run. Failed batches are isolated.
Harder. Issues may be intermittent, tied to event ordering, or caused by late data. Reproducing a bug often requires replaying events from a specific offset.
Daily/hourly reports, data warehouse refreshes, ML training pipelines, backfills, any workload where hours-old data is acceptable.
Fraud detection, real-time dashboards, alerting, session tracking, any workload where data must be acted on within seconds.
The right choice most of the time
When latency is a hard constraint
System design makes up only 2.8% of DE interview rounds, but it carries outsized weight when it appears. You'll get a scenario and need to choose batch or streaming with clear reasoning. The strongest signal is knowing when streaming is unnecessary.
Batch. The report is consumed once per day. Hourly or daily batch processing is simpler, cheaper, and sufficient. Proposing streaming here signals you don't consider cost or complexity.
Streaming. A fraudulent transaction must be flagged before the charge completes. Batch processing with even a 1-hour delay means thousands of fraudulent charges go through. Low latency is a hard requirement.
Depends on the requirements. If search results can be a few hours stale (product catalog), batch is fine. If users expect to see their own content immediately after posting (social feed), you need streaming or near-real-time micro-batching.
Batch. Model training runs on historical data. You collect a training dataset, train the model, evaluate it, and deploy. Streaming adds no value here. However, model serving (inference) might need low latency, which is a separate pipeline.
Proposing streaming when batch is sufficient. It signals that you optimize for technical complexity over practical trade-offs. Experienced engineers default to the simplest solution that meets the latency requirement. If the data can be hours old, batch wins on every other dimension: cost, complexity, debuggability, and reliability.
SQL and Python are the two most-tested skills in DE interviews. Every challenge runs real code against real data.