Data Engineering Interview Prep

Data Pipeline Interview Questions

Pipeline questions test operational depth: debugging production failures, designing backfill strategies, handling schema drift in running systems, and building incremental pipelines that actually work. These are code-level and implementation-level questions, not whiteboard architecture.

For high-level architecture and trade-off analysis, see our system design guide. This page covers the operational side: 18 questions across 6 topics that test whether you have built and maintained real pipelines.

1. Debugging Production Failures
+
Difficulty: HardFrequency: Very High

Your daily pipeline failed at 3am. The on-call page fired. Interviewers use this scenario to test whether you have real operational experience or just textbook knowledge.

Q1

Your daily pipeline failed at 3am. Walk through your runbook from the moment you get the page to the moment data is flowing again.

Q2

A pipeline succeeded but loaded zero rows. Downstream dashboards show blanks. How do you diagnose whether the issue is in your pipeline or in the source?

Q3

A transformation that has worked for 6 months suddenly produces wrong results. Nothing in your code changed. What do you investigate?

2. Orchestration and DAG Design
+
Difficulty: Medium-HardFrequency: High

Scheduling, dependency management, and task granularity. Interviewers test whether you can translate business requirements into a DAG that is debuggable, restartable, and does not cascade failures.

Q1

You have 20 tables that need to be refreshed daily with complex dependencies between them. How do you model this as a DAG? What granularity do you choose for each task?

Q2

A task in the middle of a 15-step pipeline fails. Walk through your recovery strategy. How do you restart from the failure point without re-running upstream tasks?

Q3

What are the trade-offs between time-based scheduling and event-driven triggers? Give a concrete example where each is the right choice.

3. Error Handling and Retry Logic
+
Difficulty: MediumFrequency: High

Production pipelines encounter transient API failures, malformed records, rate limits, and upstream schema changes. Interviewers want to see that you build systems that degrade gracefully.

Q1

How do you handle a source API that intermittently returns 500 errors? Walk through your retry strategy including backoff, jitter, and max attempts.

Q2

A pipeline processes 10M records and 50 fail parsing. Should the pipeline halt, skip, or quarantine? Defend your choice for a specific business context.

Q3

Describe how you would implement a dead letter queue for records that fail validation. How do you reprocess them after fixing the issue?

4. Backfill and Reprocessing
+
Difficulty: HardFrequency: High

Backfilling is reprocessing historical data after a bug fix, schema change, or new requirement. It sounds simple but is one of the hardest operational problems in data engineering.

Q1

You discover a transformation bug that has been producing incorrect results for 3 months. Walk through your backfill plan, including how you avoid disrupting live pipelines.

Q2

How would you backfill a pipeline that depends on an external API with rate limits? What if the API does not support historical queries?

Q3

Explain the difference between full backfill and incremental backfill. When would you choose each, and what are the risks of each approach?

5. Schema Evolution in Running Pipelines
+
Difficulty: Medium-HardFrequency: Medium

Schemas change without warning. Columns get added, types get modified, fields get deprecated. These questions test whether your pipelines handle drift without silent data loss.

Q1

A source system adds a new column without warning. How does your pipeline detect this, and what does it do? Walk through the code-level handling.

Q2

How do you handle a breaking schema change (like renaming a column) in a pipeline that serves multiple downstream teams? What is your migration strategy?

Q3

Describe a data quality gate you would place between ingestion and transformation. What specific checks does it run, and what happens when a check fails?

6. Incremental Processing and State Management
+
Difficulty: Medium-HardFrequency: Medium

Full table scans do not scale. Interviewers test whether you can build pipelines that process only new or changed data, track state correctly, and handle late-arriving records.

Q1

How do you implement incremental ingestion from a source that does not have a reliable updated_at column? What are your options?

Q2

Your pipeline uses a high-water mark to track progress. What happens if records arrive with timestamps older than your watermark? How do you handle late-arriving data?

Q3

Describe how you would convert a full-refresh pipeline to incremental. What are the risks during the migration, and how do you validate correctness?

What Separates Pipeline Builders from Pipeline Describers

Most candidates can describe what a pipeline does. Few can walk through what they would actually do when one breaks at 3am. Interviewers use operational questions to test the gap between textbook knowledge and production experience.

Debugging is the real test. When a pipeline fails, where do you look first? Logs, row counts, upstream source health, schema diffs? Your debugging instincts reveal more about your experience than your architecture diagrams.

Backfill plans expose your depth. If you find a bug that corrupted 90 days of data, can you reprocess without affecting live pipelines? Without exceeding API rate limits? Without producing duplicate downstream records? Most candidates have never thought through this end to end.

Error handling is where code quality shows. Does your pipeline halt on the first bad record, or does it quarantine failures and keep processing? Can you explain your retry strategy with specific numbers (backoff intervals, max attempts, jitter)? Vague answers get zero credit.

Incremental processing is expected at senior levels. Full table refreshes are the junior approach. Senior candidates explain watermarks, change data capture, and how they handle late-arriving records that land after the watermark has advanced.

Data Pipeline Interview FAQ

How are pipeline questions different from system design questions?+
System design questions ask you to architect a system on a whiteboard: pick components, discuss trade-offs between consistency and availability, estimate capacity. Pipeline questions ask you to build and operate: write the DAG, handle the 3am failure, backfill the corrupted data, evolve the schema without downtime. System design is about the diagram. Pipeline questions are about the code and the runbook.
What tools should I know for pipeline interview questions?+
You should be familiar with at least one orchestrator (Airflow, Dagster, or Prefect), one streaming framework (Kafka, Flink, or Spark Streaming), and one warehouse (Snowflake, BigQuery, or Redshift). More important than the tool itself: can you explain your retry configuration, your backfill parameterization, and your schema validation logic? Interviewers want implementation detail, not product names.
How do I practice pipeline questions without production experience?+
Build a small pipeline end to end. Ingest a public API (weather data, stock prices, GitHub events) into a local database on a schedule. Then break it on purpose: change the schema, introduce bad records, simulate API downtime. Practice diagnosing and fixing each failure. The debugging experience is what interviewers are testing for.
Why do interviewers focus so much on failure scenarios?+
Because production pipelines fail constantly. Networks drop, APIs change, disks fill up, upstream schemas drift. An engineer who only designs for the happy path will build brittle systems. Interviewers ask about failures to assess whether you have operational experience and whether you think defensively about data quality.

Practice the Skills Pipelines Require

Pipeline questions test your SQL, Python, and system thinking together. Practice each skill with real execution.