Making It Repeatable: Beginner
'How do you orchestrate your pipelines?' is where the interview shifts from data concepts to engineering practices. The interviewer wants to hear about DAGs and Airflow, dependency management, safe retries, environments, and CI/CD. These answers prove you can ship and maintain production systems.
DAG Orchestration
Answer orchestration questions with DAG fundamentals
When you hear these in an interview, this is the concept being tested
- ▸"How do you orchestrate your pipelines?"
- ▸"What is a DAG?"
- ▸"Why not just use cron?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"What happens when a task in the middle fails?" Downstream tasks do not run. Airflow marks them as 'upstream failed.' The on-call fixes the failed task and re-runs it. Airflow picks up from that point, not from the beginning.
- ▸"What other orchestrators exist?" Dagster (more opinionated, better testing), Prefect (Python-native, cloud-first), Mage (newer, built for ELT). Airflow is the most common, but Dagster is gaining fast.
- ▸"What is the DAG of DAGs problem?" When you have 200+ DAGs and they depend on each other across team boundaries. Cross-DAG dependencies are harder than within-DAG dependencies.
Task Dependencies
Answer dependency management questions
When you hear these in an interview, this is the concept being tested
- ▸"How do you handle dependencies between pipelines?"
- ▸"What if Pipeline B needs data from Pipeline A?"
- ▸"Sensor vs trigger: when do you use each?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"What if the producer DAG is late?" The sensor times out. You need an SLA: if the data is not ready by X time, alert the producer team and serve stale data to the consumer.
- ▸"How do you avoid a web of cross-DAG dependencies?" Limit cross-DAG dependencies to published interfaces (tables with SLAs). Teams should depend on tables, not on other teams' DAGs directly.
- ▸"What are Airflow datasets?" Data-aware scheduling introduced in Airflow 2.4. A producer DAG declares it updates a dataset. Consumer DAGs trigger automatically when that dataset is updated. No polling needed.
Safe Retries
Answer retry safety questions with production patterns
When you hear these in an interview, this is the concept being tested
- ▸"What happens when a pipeline fails?"
- ▸"How do you make retries safe?"
- ▸"What is exponential backoff?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"How many times do you retry?" 3 retries for transient failures with exponential backoff (1 min, 5 min, 30 min). After that, alert the on-call.
- ▸"What is exponential backoff?" Double the wait between each retry: 1s, 2s, 4s, 8s. Add random jitter to avoid thundering herd (all retries firing at the same time).
- ▸"What if the pipeline is not idempotent and you retry?" You get duplicate data. This is why idempotency is the FIRST thing to build. Without it, retries are dangerous.
Dev/Staging/Prod
Walk through environment strategy for pipelines
When you hear these in an interview, this is the concept being tested
- ▸"What environments do you use?"
- ▸"How do you test pipeline changes before production?"
- ▸"What is 'thin staging'?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"Is staging worth the cost?" Yes for data pipelines. Data bugs are silent: they do not throw errors, they produce wrong numbers. Staging catches these before consumers see them.
- ▸"What is thin staging?" Same infrastructure as prod but with a subset of data (last 7 days instead of 5 years). This reduces storage cost by 95% while still catching infrastructure and data volume issues.
- ▸"How do you promote changes from staging to prod?" Same CI/CD pipeline: merge to main, automated tests in staging, deploy to prod if all pass. Never manual promotion.
CI/CD for Data
Answer CI/CD for data pipelines questions
When you hear these in an interview, this is the concept being tested
- ▸"Do you have CI/CD for your pipelines?"
- ▸"What tests run before a pipeline change ships?"
- ▸"How do you deploy pipeline changes safely?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"What if a test passes in staging but fails in prod?" Data volume differences. Staging has 7 days of data, prod has 5 years. The test might pass on small data but fail at scale. This is why data diff on a prod-like subset matters for critical changes.
- ▸"How do you rollback a bad deployment?" Revert the PR and redeploy. If the pipeline is idempotent, the old code re-runs on the same data and produces the correct output.
- ▸"What about schema migrations?" Schema changes need their own deployment process: backwards-compatible changes first (add column), then code change (use new column), then cleanup (remove old column). Never deploy a breaking schema change and a code change in the same release.
Answer orchestration, retries, environments, and CI/CD questions
- Category
- Pipeline Architecture
- Difficulty
- beginner
- Duration
- 20 minutes
- Challenges
- 0 hands-on challenges
Topics covered: DAG Orchestration, Task Dependencies, Safe Retries, Dev/Staging/Prod, CI/CD for Data
Lesson Sections
- DAG Orchestration (concepts: paDagOrchestration)
What They Want to Hear 'A DAG (Directed Acyclic Graph) defines tasks and their dependencies. Airflow is the standard orchestrator. Each task is a node, edges define ordering. If Task B depends on Task A, Airflow will not start B until A succeeds.' Then the key insight: 'Cron can schedule a job, but it cannot manage dependencies between jobs, retry failed tasks, or show you the state of your entire pipeline at a glance. That is why Airflow exists.'
- Task Dependencies (concepts: paDependencyMgmt)
What They Want to Hear 'Within a DAG, dependencies are edges: Task A runs before Task B. Across DAGs, it gets harder. Sensors poll for a signal (a file exists, a partition is populated). Event triggers fire when data is ready (Airflow datasets since v2.4). The tradeoff: sensors waste compute by polling, event triggers are more efficient but require the producer to publish a signal.'
- Safe Retries (concepts: paRetryHandling)
What They Want to Hear 'Two things make retries safe: idempotent writes (running twice produces the same result) and exponential backoff (wait longer between each retry). I classify failures as transient (network timeout, API rate limit: retry with backoff) or permanent (schema mismatch, bad data: alert human, do not retry).'
- Dev/Staging/Prod (concepts: paEnvironmentMgmt)
What They Want to Hear 'Three environments: dev (fast iteration, sample data), staging (production-like infrastructure, recent data snapshot), prod (real data, real consumers). Staging catches issues that dev misses: permission errors, data volume differences, infrastructure differences. A common cost optimization is thin staging: production-like infrastructure but only the last 7 days of data instead of the full history.'
- CI/CD for Data (concepts: paCiCd)
What They Want to Hear 'Every PR triggers: linting, unit tests, schema validation (fast, seconds). On merge to main: integration tests in staging (minutes). Before release: data diff on a staging subset (optional, hours). On deploy: canary rollout if the change affects critical pipelines.' The key insight: fast checks on every push, slow checks before release. Never skip the fast checks.