Data Engineering Interview Prep
Pipeline questions test operational depth: debugging production failures, designing backfill strategies, handling schema drift in running systems, and building incremental pipelines that actually work. These are code-level and implementation-level questions, not whiteboard architecture.
For high-level architecture and trade-off analysis, see our system design guide. This page covers the operational side: 18 questions across 6 topics that test whether you have built and maintained real pipelines.
Your daily pipeline failed at 3am. The on-call page fired. Interviewers use this scenario to test whether you have real operational experience or just textbook knowledge.
Your daily pipeline failed at 3am. Walk through your runbook from the moment you get the page to the moment data is flowing again.
A pipeline succeeded but loaded zero rows. Downstream dashboards show blanks. How do you diagnose whether the issue is in your pipeline or in the source?
A transformation that has worked for 6 months suddenly produces wrong results. Nothing in your code changed. What do you investigate?
Scheduling, dependency management, and task granularity. Interviewers test whether you can translate business requirements into a DAG that is debuggable, restartable, and does not cascade failures.
You have 20 tables that need to be refreshed daily with complex dependencies between them. How do you model this as a DAG? What granularity do you choose for each task?
A task in the middle of a 15-step pipeline fails. Walk through your recovery strategy. How do you restart from the failure point without re-running upstream tasks?
What are the trade-offs between time-based scheduling and event-driven triggers? Give a concrete example where each is the right choice.
Production pipelines encounter transient API failures, malformed records, rate limits, and upstream schema changes. Interviewers want to see that you build systems that degrade gracefully.
How do you handle a source API that intermittently returns 500 errors? Walk through your retry strategy including backoff, jitter, and max attempts.
A pipeline processes 10M records and 50 fail parsing. Should the pipeline halt, skip, or quarantine? Defend your choice for a specific business context.
Describe how you would implement a dead letter queue for records that fail validation. How do you reprocess them after fixing the issue?
Backfilling is reprocessing historical data after a bug fix, schema change, or new requirement. It sounds simple but is one of the hardest operational problems in data engineering.
You discover a transformation bug that has been producing incorrect results for 3 months. Walk through your backfill plan, including how you avoid disrupting live pipelines.
How would you backfill a pipeline that depends on an external API with rate limits? What if the API does not support historical queries?
Explain the difference between full backfill and incremental backfill. When would you choose each, and what are the risks of each approach?
Schemas change without warning. Columns get added, types get modified, fields get deprecated. These questions test whether your pipelines handle drift without silent data loss.
A source system adds a new column without warning. How does your pipeline detect this, and what does it do? Walk through the code-level handling.
How do you handle a breaking schema change (like renaming a column) in a pipeline that serves multiple downstream teams? What is your migration strategy?
Describe a data quality gate you would place between ingestion and transformation. What specific checks does it run, and what happens when a check fails?
Full table scans do not scale. Interviewers test whether you can build pipelines that process only new or changed data, track state correctly, and handle late-arriving records.
How do you implement incremental ingestion from a source that does not have a reliable updated_at column? What are your options?
Your pipeline uses a high-water mark to track progress. What happens if records arrive with timestamps older than your watermark? How do you handle late-arriving data?
Describe how you would convert a full-refresh pipeline to incremental. What are the risks during the migration, and how do you validate correctness?
Most candidates can describe what a pipeline does. Few can walk through what they would actually do when one breaks at 3am. Interviewers use operational questions to test the gap between textbook knowledge and production experience.
Debugging is the real test. When a pipeline fails, where do you look first? Logs, row counts, upstream source health, schema diffs? Your debugging instincts reveal more about your experience than your architecture diagrams.
Backfill plans expose your depth. If you find a bug that corrupted 90 days of data, can you reprocess without affecting live pipelines? Without exceeding API rate limits? Without producing duplicate downstream records? Most candidates have never thought through this end to end.
Error handling is where code quality shows. Does your pipeline halt on the first bad record, or does it quarantine failures and keep processing? Can you explain your retry strategy with specific numbers (backoff intervals, max attempts, jitter)? Vague answers get zero credit.
Incremental processing is expected at senior levels. Full table refreshes are the junior approach. Senior candidates explain watermarks, change data capture, and how they handle late-arriving records that land after the watermark has advanced.
Pipeline questions test your SQL, Python, and system thinking together. Practice each skill with real execution.