DataDriven
LearnPracticeInterviewDiscussDaily
HelpContactPrivacyTermsSecurityiOS App

© 2026 DataDriven

Loading lesson...

  1. Home
  2. Learn
  3. Making It Repeatable

Making It Repeatable

Answer orchestration, retries, environments, and CI/CD questions

Answer orchestration, retries, environments, and CI/CD questions

Category
Pipeline Architecture
Difficulty
beginner
Duration
20 minutes
Challenges
0 hands-on challenges

Topics covered: DAG Orchestration, Task Dependencies, Safe Retries, Dev/Staging/Prod, CI/CD for Data

Lesson Sections

  1. DAG Orchestration (concepts: paDagOrchestration)

    What They Want to Hear 'A DAG (Directed Acyclic Graph) defines tasks and their dependencies. Airflow is the standard orchestrator. Each task is a node, edges define ordering. If Task B depends on Task A, Airflow will not start B until A succeeds.' Then the key insight: 'Cron can schedule a job, but it cannot manage dependencies between jobs, retry failed tasks, or show you the state of your entire pipeline at a glance. That is why Airflow exists.'

  2. Task Dependencies (concepts: paDependencyMgmt)

    What They Want to Hear 'Within a DAG, dependencies are edges: Task A runs before Task B. Across DAGs, it gets harder. Sensors poll for a signal (a file exists, a partition is populated). Event triggers fire when data is ready (Airflow datasets since v2.4). The tradeoff: sensors waste compute by polling, event triggers are more efficient but require the producer to publish a signal.'

  3. Safe Retries (concepts: paRetryHandling)

    What They Want to Hear 'Two things make retries safe: idempotent writes (running twice produces the same result) and exponential backoff (wait longer between each retry). I classify failures as transient (network timeout, API rate limit: retry with backoff) or permanent (schema mismatch, bad data: alert human, do not retry).'

  4. Dev/Staging/Prod (concepts: paEnvironmentMgmt)

    What They Want to Hear 'Three environments: dev (fast iteration, sample data), staging (production-like infrastructure, recent data snapshot), prod (real data, real consumers). Staging catches issues that dev misses: permission errors, data volume differences, infrastructure differences. A common cost optimization is thin staging: production-like infrastructure but only the last 7 days of data instead of the full history.'

  5. CI/CD for Data (concepts: paCiCd)

    What They Want to Hear 'Every PR triggers: linting, unit tests, schema validation (fast, seconds). On merge to main: integration tests in staging (minutes). Before release: data diff on a staging subset (optional, hours). On deploy: canary rollout if the change affects critical pipelines.' The key insight: fast checks on every push, slow checks before release. Never skip the fast checks.

Related

  • All Lessons
  • Practice Problems
  • Mock Interview Practice
  • Daily Challenges