Loading lesson...

Pipeline Operations: Beginner

A pipeline that runs once is a script; one that survives Monday morning is operated

A pipeline that runs once is a script; one that survives Monday morning is operated

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: Script vs Operable Pipeline, Logs, Metrics, and Traces, Day-One Monitoring, Alerting That Stays Useful, A First Runbook

Lesson Sections

  1. Script vs Operable Pipeline (concepts: paOperability)

    A working script is a piece of code that produces the right answer when nothing goes wrong. An operable pipeline is a piece of code that someone can run, watch, debug, and recover from at three in the morning, six months after it was written, by a person who has never read its source. The two are not on the same axis. A script can be technically excellent and operationally useless. A pipeline can have ugly code and survive years of production because it tells operators what is happening. The bar

  2. Logs, Metrics, and Traces (concepts: paObservabilitySignals)

    Three classes of signal show up in every observability discussion: logs, metrics, and traces. The vocabulary matters because each one answers a different question and has different storage and cost characteristics. Mixing them up produces dashboards that cost too much, alerts that fire on the wrong condition, and debugging sessions that bog down because the right signal is missing. The three are sometimes called the three pillars of observability. The framing comes out of the SRE community at Go

  3. Day-One Monitoring (concepts: paDayOneMonitoring, paFreshnessCheck, paVolumeCheck)

    A new pipeline does not need fifty monitors. It needs three. Did it run, did it succeed, and was the output the right size. Those three monitors catch most of the failure modes that show up in the first month. Adding more monitors before those three exist is premature optimization; adding fewer leaves blind spots that consumers will discover before the pipeline does. The Three Day-One Monitors Did It Run The simplest monitor is also the most embarrassing one to forget. A pipeline scheduled for 2

  4. Alerting That Stays Useful (concepts: paAlertingTiers, paAlarmFatigue)

    An alert is a request for human attention. Every alert that fires is a withdrawal from the on-call engineer's attention budget. A pipeline that pages on every minor anomaly bankrupts its on-call within weeks; the engineers stop reading the channel and the next real outage is missed. The discipline is to ration alerts so that the ones that fire are the ones that need a human to act now. The economics are stark: an engineer who responds to twenty pages a week treats the twenty-first as another rou

  5. A First Runbook (concepts: paRunbooks, paIncidentResponse)

    A runbook is a document that tells an on-call engineer what to do when a specific alert fires. It is not architecture documentation. It is not design rationale. It is a checklist tuned for the moment when something is wrong, the on-call has been paged, and the question is what to check first. A good runbook can be followed by an engineer who has never seen the pipeline before. A bad runbook is a wiki page that says 'contact Eric.' The shape of a useful runbook is closer to an emergency-room inta