Making It Repeatable: Beginner

'How do you orchestrate your pipelines?' is where the interview shifts from data concepts to engineering practices. The interviewer wants to hear about DAGs and Airflow, dependency management, safe retries, environments, and CI/CD. These answers prove you can ship and maintain production systems.

DAG Orchestration

Daily Life
Interviews

Answer orchestration questions with DAG fundamentals

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How do you orchestrate your pipelines?"
  • "What is a DAG?"
  • "Why not just use cron?"

What They Want to Hear

'A DAG (Directed Acyclic Graph) defines tasks and their dependencies. Airflow is the standard orchestrator. Each task is a node, edges define ordering. If Task B depends on Task A, Airflow will not start B until A succeeds.' Then the key insight: 'Cron can schedule a job, but it cannot manage dependencies between jobs, retry failed tasks, or show you the state of your entire pipeline at a glance. That is why Airflow exists.'
What to Whiteboard
on successon successon success
Extract
Pull data from source
Transform
Clean, deduplicate
Load
Write to warehouse
Quality Checks
Validate output
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What happens when a task in the middle fails?" Downstream tasks do not run. Airflow marks them as 'upstream failed.' The on-call fixes the failed task and re-runs it. Airflow picks up from that point, not from the beginning.
  • "What other orchestrators exist?" Dagster (more opinionated, better testing), Prefect (Python-native, cloud-first), Mage (newer, built for ELT). Airflow is the most common, but Dagster is gaining fast.
  • "What is the DAG of DAGs problem?" When you have 200+ DAGs and they depend on each other across team boundaries. Cross-DAG dependencies are harder than within-DAG dependencies.
KEY TAKEAWAYS
Say: 'Airflow orchestrates tasks as a DAG: define dependencies, retry on failure, track state.'
Why not cron: no dependency management, no retry logic, no visibility into pipeline state
Name-drop alternatives: Dagster (better testing), Prefect (Python-native)

Task Dependencies

Daily Life
Interviews

Answer dependency management questions

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How do you handle dependencies between pipelines?"
  • "What if Pipeline B needs data from Pipeline A?"
  • "Sensor vs trigger: when do you use each?"

What They Want to Hear

'Within a DAG, dependencies are edges: Task A runs before Task B. Across DAGs, it gets harder. Sensors poll for a signal (a file exists, a partition is populated). Event triggers fire when data is ready (Airflow datasets since v2.4). The tradeoff: sensors waste compute by polling, event triggers are more efficient but require the producer to publish a signal.'
Cross-DAG Dependency
on successsignalsignal received
Producer DAG
Team A writes orders_silver
Publish Readiness
Write marker file or update metadata
Consumer DAG Waits
Sensor checks for readiness signal
Consumer Processes
Team B reads orders_silver
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What if the producer DAG is late?" The sensor times out. You need an SLA: if the data is not ready by X time, alert the producer team and serve stale data to the consumer.
  • "How do you avoid a web of cross-DAG dependencies?" Limit cross-DAG dependencies to published interfaces (tables with SLAs). Teams should depend on tables, not on other teams' DAGs directly.
  • "What are Airflow datasets?" Data-aware scheduling introduced in Airflow 2.4. A producer DAG declares it updates a dataset. Consumer DAGs trigger automatically when that dataset is updated. No polling needed.
KEY TAKEAWAYS
Say: 'Within DAG: edges. Across DAGs: sensors or event triggers. Depend on tables, not on other teams' DAGs.'
Sensors poll (waste compute). Event triggers react (efficient but need producer cooperation).
The pro tip: 'Airflow datasets (v2.4+) eliminate polling for cross-DAG dependencies.'

Safe Retries

Daily Life
Interviews

Answer retry safety questions with production patterns

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What happens when a pipeline fails?"
  • "How do you make retries safe?"
  • "What is exponential backoff?"

What They Want to Hear

'Two things make retries safe: idempotent writes (running twice produces the same result) and exponential backoff (wait longer between each retry). I classify failures as transient (network timeout, API rate limit: retry with backoff) or permanent (schema mismatch, bad data: alert human, do not retry).'
Retry Logic
completestimeout/rate limitretry after delayschema error/bad data
Task Runs
Attempt execution
Success
Proceed to next task
Transient Failure
Retry with backoff
Permanent Failure
Alert human, stop retrying
The Curveball Follow-ups

After your initial answer, expect these probes

  • "How many times do you retry?" 3 retries for transient failures with exponential backoff (1 min, 5 min, 30 min). After that, alert the on-call.
  • "What is exponential backoff?" Double the wait between each retry: 1s, 2s, 4s, 8s. Add random jitter to avoid thundering herd (all retries firing at the same time).
  • "What if the pipeline is not idempotent and you retry?" You get duplicate data. This is why idempotency is the FIRST thing to build. Without it, retries are dangerous.
KEY TAKEAWAYS
Say: 'Idempotent writes + exponential backoff + classifying transient vs permanent failures.'
Transient: retry with backoff. Permanent: alert human, stop retrying.
Without idempotency, retries create duplicate data. Always build idempotency first.

Dev/Staging/Prod

Daily Life
Interviews

Walk through environment strategy for pipelines

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What environments do you use?"
  • "How do you test pipeline changes before production?"
  • "What is 'thin staging'?"

What They Want to Hear

'Three environments: dev (fast iteration, sample data), staging (production-like infrastructure, recent data snapshot), prod (real data, real consumers). Staging catches issues that dev misses: permission errors, data volume differences, infrastructure differences. A common cost optimization is thin staging: production-like infrastructure but only the last 7 days of data instead of the full history.'
Deployment Path
PR mergedtests pass
Dev
Local or cloud sandbox, sample data
Staging
Prod-like infra, recent data snapshot
Prod
Real data, real consumers
The Curveball Follow-ups

After your initial answer, expect these probes

  • "Is staging worth the cost?" Yes for data pipelines. Data bugs are silent: they do not throw errors, they produce wrong numbers. Staging catches these before consumers see them.
  • "What is thin staging?" Same infrastructure as prod but with a subset of data (last 7 days instead of 5 years). This reduces storage cost by 95% while still catching infrastructure and data volume issues.
  • "How do you promote changes from staging to prod?" Same CI/CD pipeline: merge to main, automated tests in staging, deploy to prod if all pass. Never manual promotion.
KEY TAKEAWAYS
Say: 'Dev for iteration, staging for validation, prod for real data. Thin staging saves 95% of staging cost.'
Data bugs are silent. Staging catches wrong numbers before consumers see them.
Never manual promotion. CI/CD pipeline handles dev -> staging -> prod.

CI/CD for Data

Daily Life
Interviews

Answer CI/CD for data pipelines questions

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Do you have CI/CD for your pipelines?"
  • "What tests run before a pipeline change ships?"
  • "How do you deploy pipeline changes safely?"

What They Want to Hear

'Every PR triggers: linting, unit tests, schema validation (fast, seconds). On merge to main: integration tests in staging (minutes). Before release: data diff on a staging subset (optional, hours). On deploy: canary rollout if the change affects critical pipelines.' The key insight: fast checks on every push, slow checks before release. Never skip the fast checks.
CI/CD Pipeline
automaticPR mergedtests pass
Pull Request
Developer opens PR
Lint + Unit Test
Seconds. Every PR.
Integration Test
Minutes. On merge.
Deploy to Prod
Automated if tests pass
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What if a test passes in staging but fails in prod?" Data volume differences. Staging has 7 days of data, prod has 5 years. The test might pass on small data but fail at scale. This is why data diff on a prod-like subset matters for critical changes.
  • "How do you rollback a bad deployment?" Revert the PR and redeploy. If the pipeline is idempotent, the old code re-runs on the same data and produces the correct output.
  • "What about schema migrations?" Schema changes need their own deployment process: backwards-compatible changes first (add column), then code change (use new column), then cleanup (remove old column). Never deploy a breaking schema change and a code change in the same release.
KEY TAKEAWAYS
Say: 'Lint and unit test on every PR (seconds). Integration test on merge (minutes). Deploy if all pass.'
Fast checks on every push, slow checks before release. Never skip the fast checks.
Schema changes: backwards-compatible first, code change second, cleanup third

Answer orchestration, retries, environments, and CI/CD questions

Category
Pipeline Architecture
Difficulty
beginner
Duration
20 minutes
Challenges
0 hands-on challenges

Topics covered: DAG Orchestration, Task Dependencies, Safe Retries, Dev/Staging/Prod, CI/CD for Data

Lesson Sections

  1. DAG Orchestration (concepts: paDagOrchestration)

    What They Want to Hear 'A DAG (Directed Acyclic Graph) defines tasks and their dependencies. Airflow is the standard orchestrator. Each task is a node, edges define ordering. If Task B depends on Task A, Airflow will not start B until A succeeds.' Then the key insight: 'Cron can schedule a job, but it cannot manage dependencies between jobs, retry failed tasks, or show you the state of your entire pipeline at a glance. That is why Airflow exists.'

  2. Task Dependencies (concepts: paDependencyMgmt)

    What They Want to Hear 'Within a DAG, dependencies are edges: Task A runs before Task B. Across DAGs, it gets harder. Sensors poll for a signal (a file exists, a partition is populated). Event triggers fire when data is ready (Airflow datasets since v2.4). The tradeoff: sensors waste compute by polling, event triggers are more efficient but require the producer to publish a signal.'

  3. Safe Retries (concepts: paRetryHandling)

    What They Want to Hear 'Two things make retries safe: idempotent writes (running twice produces the same result) and exponential backoff (wait longer between each retry). I classify failures as transient (network timeout, API rate limit: retry with backoff) or permanent (schema mismatch, bad data: alert human, do not retry).'

  4. Dev/Staging/Prod (concepts: paEnvironmentMgmt)

    What They Want to Hear 'Three environments: dev (fast iteration, sample data), staging (production-like infrastructure, recent data snapshot), prod (real data, real consumers). Staging catches issues that dev misses: permission errors, data volume differences, infrastructure differences. A common cost optimization is thin staging: production-like infrastructure but only the last 7 days of data instead of the full history.'

  5. CI/CD for Data (concepts: paCiCd)

    What They Want to Hear 'Every PR triggers: linting, unit tests, schema validation (fast, seconds). On merge to main: integration tests in staging (minutes). Before release: data diff on a staging subset (optional, hours). On deploy: canary rollout if the change affects critical pipelines.' The key insight: fast checks on every push, slow checks before release. Never skip the fast checks.