Loading lesson...

Design a Pipeline

The universal pipeline architecture question

Challenges: 0 hands-on challenges

Lesson Sections

Decomposing the Prompt (concepts: paEltVsEtl, paSparkExecutionModel)
Decomposing a pipeline into five layers is expected - it's table stakes. To score 'strong hire,' you need to go deeper. The key signal is recognizing that the prompt is underspecified on purpose, and using that ambiguity to demonstrate platform thinking. From Pipeline to Platform The shift from good to great is the shift from 'I'd build a pipeline for this use case' to 'I'd build a platform that serves this use case and the next ten.' When the interviewer says 'design a pipeline for clickstrea
The Ingestion Layer (concepts: paFileIngestion, paApiIngestion, paCdc)
Ingestion isn't about choosing between Kafka and file drops. It's about designing a schema-aware, multi-source ingestion framework that handles schema evolution, late-arriving data, and cross-source reconciliation without manual intervention. Schema Registry: The Non-Negotiable The conventional wisdom is 'use Avro with a schema registry.' That's correct but incomplete. The deeper question is: what's your schema evolution policy? Backward compatible changes (adding nullable columns) should auto-p
The Transformation Layer (concepts: paEltVsEtl, paMedallion, paPartitioning, paSparkExecutionModel, paColumnarVsRow)
Justifying ELT over ETL and describing medallion tiers is expected. To score 'strong hire,' the interviewer expects you to articulate when ELT breaks down, how to handle transforms that span multiple data sources, and the cost implications of your compute choices at scale. When ELT Breaks Down The conventional wisdom is ELT for everything. In practice, there are cases where transform-before-load is correct. PII scrubbing: you don't want raw PII landing in the warehouse, even in bronze, if your w
The Serving Layer (concepts: paColumnarVsRow, paPartitioning)
The serving layer isn't just 'analysts query a gold table.' It's a platform concern: how do you serve 15 teams with different latency requirements, access controls, and cost budgets from the same underlying data? Multi-Tenancy and Cost Attribution The hardest serving problem isn't performance - it's economics. When 15 teams query the same Snowflake warehouse, who pays? If the marketing team runs a 4-hour query that scans 50TB, does the data platform team eat the cost? The strong answer is: sep
The Meta Layer (concepts: paDagOrchestration, paMonitoring, paDataQuality)
The meta layer isn't just 'use Airflow for orchestration and add some data quality checks.' The meta layer is the difference between a pipeline and a platform. You're designing orchestration that handles cross-pipeline dependencies, quality frameworks that prevent bad data from ever reaching consumers, and cost monitoring that catches $50K/month runaway queries before they hit the bill. Cross-Pipeline Orchestration Real platforms have hundreds of DAGs with cross-DAG dependencies. The marketing p