Making It Repeatable: Beginner

'How do you orchestrate your pipelines?' is where the interview shifts from data concepts to engineering practices. The interviewer wants to hear about DAGs and Airflow, dependency management, safe retries, environments, and CI/CD. These answers prove you can ship and maintain production systems.

What you will be able to do

Explain DAGs and orchestration without overcomplicating it

Answer 'How do you make retries safe?' with the idempotency pattern

Walk through dev/staging/prod like someone who has deployed to all three

DAG Orchestration

Daily Life

Interviews

Answer orchestration questions with DAG fundamentals

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"How do you orchestrate your pipelines?"
▸"What is a DAG?"
▸"Why not just use cron?"

What They Want to Hear

'A DAG (Directed Acyclic Graph) defines tasks and their dependencies. Airflow is the standard orchestrator. Each task is a node, edges define ordering. If Task B depends on Task A, Airflow will not start B until A succeeds.' Then the key insight: 'Cron can schedule a job, but it cannot manage dependencies between jobs, retry failed tasks, or show you the state of your entire pipeline at a glance. That is why Airflow exists.'

Source

Extract

Transform

Load

Quality

Quality Checks

What to Whiteboard

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What happens when a task in the middle fails?" Downstream tasks do not run. Airflow marks them as 'upstream failed.' The on-call fixes the failed task and re-runs it. Airflow picks up from that point, not from the beginning.
▸"What other orchestrators exist?" Dagster (more opinionated, better testing), Prefect (Python-native, cloud-first), Mage (newer, built for ELT). Airflow is the most common, but Dagster is gaining fast.
▸"What is the DAG of DAGs problem?" When you have 200+ DAGs and they depend on each other across team boundaries. Cross-DAG dependencies are harder than within-DAG dependencies.

KEY TAKEAWAYS

Say: 'Airflow orchestrates tasks as a DAG: define dependencies, retry on failure, track state.'

Why not cron: no dependency management, no retry logic, no visibility into pipeline state

Name-drop alternatives: Dagster (better testing), Prefect (Python-native)

Task Dependencies

Daily Life

Interviews

Answer dependency management questions

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"How do you handle dependencies between pipelines?"
▸"What if Pipeline B needs data from Pipeline A?"
▸"Sensor vs trigger: when do you use each?"

What They Want to Hear

'Within a DAG, dependencies are edges: Task A runs before Task B. Across DAGs, it gets harder. Sensors poll for a signal (a file exists, a partition is populated). Event triggers fire when data is ready (Airflow datasets since v2.4). The tradeoff: sensors waste compute by polling, event triggers are more efficient but require the producer to publish a signal.'

Source

Producer DAG

Transform

Publish Readiness

Consumer

Consumer DAG Waits

Consumer

Consumer Processes

Cross-DAG Dependency

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What if the producer DAG is late?" The sensor times out. You need an SLA: if the data is not ready by X time, alert the producer team and serve stale data to the consumer.
▸"How do you avoid a web of cross-DAG dependencies?" Limit cross-DAG dependencies to published interfaces (tables with SLAs). Teams should depend on tables, not on other teams' DAGs directly.
▸"What are Airflow datasets?" Data-aware scheduling introduced in Airflow 2.4. A producer DAG declares it updates a dataset. Consumer DAGs trigger automatically when that dataset is updated. No polling needed.

KEY TAKEAWAYS

Say: 'Within DAG: edges. Across DAGs: sensors or event triggers. Depend on tables, not on other teams' DAGs.'

Sensors poll (waste compute). Event triggers react (efficient but need producer cooperation).

The pro tip: 'Airflow datasets (v2.4+) eliminate polling for cross-DAG dependencies.'

Safe Retries

Daily Life

Interviews

Answer retry safety questions with production patterns

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What happens when a pipeline fails?"
▸"How do you make retries safe?"
▸"What is exponential backoff?"

What They Want to Hear

'Two things make retries safe: idempotent writes (running twice produces the same result) and exponential backoff (wait longer between each retry). I classify failures as transient (network timeout, API rate limit: retry with backoff) or permanent (schema mismatch, bad data: alert human, do not retry).'

Transform

Task Runs

Transform

Success

Transform

Transient Failure

Transform

Permanent Failure

Retry Logic

The Curveball Follow-ups

After your initial answer, expect these probes

▸"How many times do you retry?" 3 retries for transient failures with exponential backoff (1 min, 5 min, 30 min). After that, alert the on-call.
▸"What is exponential backoff?" Double the wait between each retry: 1s, 2s, 4s, 8s. Add random jitter to avoid thundering herd (all retries firing at the same time).
▸"What if the pipeline is not idempotent and you retry?" You get duplicate data. This is why idempotency is the FIRST thing to build. Without it, retries are dangerous.

KEY TAKEAWAYS

Say: 'Idempotent writes + exponential backoff + classifying transient vs permanent failures.'

Transient: retry with backoff. Permanent: alert human, stop retrying.

Without idempotency, retries create duplicate data. Always build idempotency first.

Dev/Staging/Prod

Daily Life

Interviews

Walk through environment strategy for pipelines

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What environments do you use?"
▸"How do you test pipeline changes before production?"
▸"What is 'thin staging'?"

What They Want to Hear

'Three environments: dev (fast iteration, sample data), staging (production-like infrastructure, recent data snapshot), prod (real data, real consumers). Staging catches issues that dev misses: permission errors, data volume differences, infrastructure differences. A common cost optimization is thin staging: production-like infrastructure but only the last 7 days of data instead of the full history.'

Source

Dev

Transform

Staging

Consumer

Prod

Deployment Path

The Curveball Follow-ups

After your initial answer, expect these probes

▸"Is staging worth the cost?" Yes for data pipelines. Data bugs are silent: they do not throw errors, they produce wrong numbers. Staging catches these before consumers see them.
▸"What is thin staging?" Same infrastructure as prod but with a subset of data (last 7 days instead of 5 years). This reduces storage cost by 95% while still catching infrastructure and data volume issues.
▸"How do you promote changes from staging to prod?" Same CI/CD pipeline: merge to main, automated tests in staging, deploy to prod if all pass. Never manual promotion.

KEY TAKEAWAYS

Say: 'Dev for iteration, staging for validation, prod for real data. Thin staging saves 95% of staging cost.'

Data bugs are silent. Staging catches wrong numbers before consumers see them.

Never manual promotion. CI/CD pipeline handles dev -> staging -> prod.

CI/CD for Data

Daily Life

Interviews

Answer CI/CD for data pipelines questions

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"Do you have CI/CD for your pipelines?"
▸"What tests run before a pipeline change ships?"
▸"How do you deploy pipeline changes safely?"

What They Want to Hear

'Every PR triggers: linting, unit tests, schema validation (fast, seconds). On merge to main: integration tests in staging (minutes). Before release: data diff on a staging subset (optional, hours). On deploy: canary rollout if the change affects critical pipelines.' The key insight: fast checks on every push, slow checks before release. Never skip the fast checks.

Source

Pull Request

Quality

Lint + Unit Test

Quality

Integration Test

Consumer

Deploy to Prod

CI/CD Pipeline

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What if a test passes in staging but fails in prod?" Data volume differences. Staging has 7 days of data, prod has 5 years. The test might pass on small data but fail at scale. This is why data diff on a prod-like subset matters for critical changes.
▸"How do you rollback a bad deployment?" Revert the PR and redeploy. If the pipeline is idempotent, the old code re-runs on the same data and produces the correct output.
▸"What about schema migrations?" Schema changes need their own deployment process: backwards-compatible changes first (add column), then code change (use new column), then cleanup (remove old column). Never deploy a breaking schema change and a code change in the same release.

KEY TAKEAWAYS

Say: 'Lint and unit test on every PR (seconds). Integration test on merge (minutes). Deploy if all pass.'

Fast checks on every push, slow checks before release. Never skip the fast checks.

Schema changes: backwards-compatible first, code change second, cleanup third

Answer orchestration, retries, environments, and CI/CD questions

Category: Pipeline Architecture
Difficulty: beginner
Duration: 20 minutes
Challenges: 0 hands-on challenges

Topics covered: DAG Orchestration, Task Dependencies, Safe Retries, Dev/Staging/Prod, CI/CD for Data

Lesson Sections

DAG Orchestration (concepts: paDagOrchestration)
What They Want to Hear 'A DAG (Directed Acyclic Graph) defines tasks and their dependencies. Airflow is the standard orchestrator. Each task is a node, edges define ordering. If Task B depends on Task A, Airflow will not start B until A succeeds.' Then the key insight: 'Cron can schedule a job, but it cannot manage dependencies between jobs, retry failed tasks, or show you the state of your entire pipeline at a glance. That is why Airflow exists.'
Task Dependencies (concepts: paDependencyMgmt)
What They Want to Hear 'Within a DAG, dependencies are edges: Task A runs before Task B. Across DAGs, it gets harder. Sensors poll for a signal (a file exists, a partition is populated). Event triggers fire when data is ready (Airflow datasets since v2.4). The tradeoff: sensors waste compute by polling, event triggers are more efficient but require the producer to publish a signal.'
Safe Retries (concepts: paRetryHandling)
What They Want to Hear 'Two things make retries safe: idempotent writes (running twice produces the same result) and exponential backoff (wait longer between each retry). I classify failures as transient (network timeout, API rate limit: retry with backoff) or permanent (schema mismatch, bad data: alert human, do not retry).'
Dev/Staging/Prod (concepts: paEnvironmentMgmt)
What They Want to Hear 'Three environments: dev (fast iteration, sample data), staging (production-like infrastructure, recent data snapshot), prod (real data, real consumers). Staging catches issues that dev misses: permission errors, data volume differences, infrastructure differences. A common cost optimization is thin staging: production-like infrastructure but only the last 7 days of data instead of the full history.'
CI/CD for Data (concepts: paCiCd)
What They Want to Hear 'Every PR triggers: linting, unit tests, schema validation (fast, seconds). On merge to main: integration tests in staging (minutes). Before release: data diff on a staging subset (optional, hours). On deploy: canary rollout if the change affects critical pipelines.' The key insight: fast checks on every push, slow checks before release. Never skip the fast checks.