DataDriven
LearnPracticeInterviewDiscussDailyJobs

A daily orders pipeline runs its heavy 12-hour SQL aggregation inside the Airflow scheduler itself,

A medium Pipeline Design interview practice problem on DataDriven. Write and execute real pipeline design code with instant grading.

Domain
Pipeline Design
Difficulty
medium

Problem

A daily orders pipeline runs its heavy 12-hour SQL aggregation inside the Airflow scheduler itself, using a PythonOperator that executes the SQL in-process. The aggregation is starving the scheduler: other DAGs sit waiting, the orchestrator UI lags, and a single slow task degrades visibility for every other pipeline on the same Airflow instance. This section is explicit that the orchestrator owns four responsibilities (scheduling, dependency resolution, retries, visibility) and delegates the actual transform work (the section names a Snowflake warehouse, a Spark cluster, or a Python container as the worker categories). Replace the in-process Airflow PythonOperator transform with a delegated worker transform whose name states what aggregation it runs and whose tech_label is one of the section's worker categories: Snowflake, BigQuery, Spark, PySpark, Databricks, or Python. Wire the Postgres source into the new worker transform and the new worker transform into the Snowflake daily_orders mart; the Morning dashboard reads from the mart. Keep the Airflow orchestrator node on the canvas; it continues to own when the work runs, the order it runs in, the per-task retry policy, and the on-call UI.

Practice This Problem

Solve this Pipeline Design problem with real code execution. DataDriven runs your solution and grades it automatically.

Related

  • All Practice Problems
  • Mock Interview Mode
  • System Design Interview Questions
  • Data Engineering Interview Prep Guide
  • Daily Challenge
  • Data Engineering Lessons