A startup data team has six cron jobs glued together: pull from Postgres, pull from Stripe, clean or
A medium Pipeline Design mock interview question on DataDriven. Practice with AI-powered feedback, real code execution, and a hire/no-hire decision.
- Domain
- Pipeline Design
- Difficulty
- medium
Interview Prompt
A startup data team has six cron jobs glued together: pull from Postgres, pull from Stripe, clean orders, clean payments, join the two, publish a fact table. Last week the Postgres pull ran two hours long and the dashboard showed yesterday's numbers. Apply the entire L4 beginner tier on this canvas: (b-s0) replace the cron chain with an orchestrator that owns dependency resolution; (b-s1) build a DAG with explicit edges (extract_orders, extract_payments, clean_orders, clean_payments, join_orders_payments, publish_fact); (b-s2) delegate the heavy compute to a worker engine (dbt, Spark, PySpark, Databricks, Snowflake, or BigQuery), not in-process to the orchestrator; (b-s3) pick one orchestrator (Airflow, Dagster, or Prefect) appropriate for this small new build; (b-s4) wire the 6-task chain under the orchestrator with a daily schedule and a retry policy. Add a warehouse storage destination (Snowflake, BigQuery, Redshift, or Databricks) for the published fact table. The dashboard reads from the warehouse via the orchestrator-managed pipeline.
How This Interview Works
- Read the vague prompt (just like a real interview)
- Ask clarifying questions to the AI interviewer
- Write your pipeline design solution with real code execution
- Get instant feedback and a hire/no-hire decision