Embedding Pipelines Are Just ETL With Higher Stakes
Every embedding pipeline data engineer who's managed batch ingestion already understands embedding orchestration patterns. The concepts are identical: batch sizing, idempotent writes, handling upstream changes, managing refresh schedules. The stakes are just higher because mistakes cost more to fix.
Re-embedding 50M documents requires a weekend migration window. Embedding model providers announce deprecation with 90-day timelines, and teams scramble with no eval suite, no rollback path, and no existing tooling at scale. If you've survived a warehouse migration, you know this flavor of panic. Same energy, different vectors.
The tools changed. The failure modes didn't. Schema drift became embedding drift. Late-arriving data became stale vectors. Upstream contract violations became model deprecation notices with 90-day timelines.
# Idempotent batch embedding upsert
# Keys on doc_id to prevent duplicates during retries
def embed_and_upsert(documents, collection, embed_model, batch_size=1000):
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
vectors = embed_model.encode([doc["text"] for doc in batch])
points = [
PointStruct(
id=doc["doc_id"],
vector=vec.tolist(),
payload={
"source": doc["source"],
"embed_model_version": MODEL_VERSION,
"embedded_at": datetime.utcnow().isoformat()
}
)
for doc, vec in zip(batch, vectors)
]
collection.upsert(points=points)
Cost optimization is in-scope for these interviews now. The spread across vector database providers is staggering: cost per billion queries ranges from $84 to $7,088 across common configurations on a 10M-document corpus. Embedding 10M documents at 500 tokens each costs $100 with OpenAI's small model versus $650 with large. If you've ever argued that storage is cheap and engineer time is expensive, the same logic applies here. Interviewers want to hear you reason about embedding economics the same way you'd reason about warehouse compute costs.
Refresh pattern design is where candidates who've only built batch pipelines get tripped up. Schedule-based refresh, trigger-on-content-update, TTL-based expiration; each has tradeoffs between re-embedding cost, freshness SLA, and query latency. Change Data Capture with event-driven architecture for real-time embedding synchronization is expected knowledge. You need to explain how to keep vectors fresh when source data changes without re-embedding the entire corpus. If you've worked with idempotent pipeline patterns, these concepts map directly.