Pipeline Operations: Beginner

An e-commerce company at 80 engineers had a Python script that pulled orders from MySQL every night at 2am, transformed them, and loaded the results into Snowflake. The script had been written eighteen months earlier by an engineer who had since changed teams. On a Tuesday in October, the script ran for forty seconds and exited with a zero status. Nobody noticed. The dashboard the script fed kept showing data, but the data stopped advancing. Five days later the head of finance asked why October revenue had been flat since the third. The script had silently started skipping rows because of a timezone change in the source. The cost of finding the bug was fifteen engineering days. The cost of preventing it was a single freshness check that would have fired the first morning the table did not advance. The difference between a script and a pipeline is the existence of those checks, the alerts they raise, and the runbook that explains what to do when they fire. This lesson is about the smallest set of operational practices that turn the second outcome into the first. None of the practices are exotic. All of them are skipped routinely in early-stage data work, and skipping them is the most common reason a pipeline that ships fast ends up costing more than the chart it produced.

What you will be able to do

Recognize the operational gap between a script that works and a pipeline that can be run as a service

Distinguish logs, metrics, and traces and identify what each one answers

Choose a small day-one monitoring set and an alert routing policy that does not get ignored

Script vs Operable Pipeline

Daily Life

Interviews

Recognize the operational gap between a working script and a pipeline that can be run as a service.

A working script is a piece of code that produces the right answer when nothing goes wrong. An operable pipeline is a piece of code that someone can run, watch, debug, and recover from at three in the morning, six months after it was written, by a person who has never read its source. The two are not on the same axis. A script can be technically excellent and operationally useless. A pipeline can have ugly code and survive years of production because it tells operators what is happening. The bar for operability is set not by the original author but by the worst-case responder: a tired engineer who has never seen the pipeline before, has limited context on the surrounding system, and has perhaps fifteen minutes before the consequences become visible to consumers. Code that holds up under that bar is operable code. Code that does not is, by definition, a script.

Five Things a Script Does Not Have

Operational Property	What a Script Lacks	What an Operable Pipeline Provides
Run identity	No record of which run produced which output	A run_id stamped on every artifact and every log line
Visibility	Standard out scrolls past and disappears	Structured logs and metrics flow to a durable store
Failure signal	A nonzero exit code that nobody is watching	Alerts routed to a channel where someone is paged
Recovery	A human reads the source code and guesses	A runbook names the symptom and the response
Idempotent retries	Re-running corrupts state or duplicates rows	Re-running produces the same result as the first run

The Operability Gap, Made Concrete

Consider two implementations of the same daily orders aggregation. One is a Python file that loads orders, groups them, and writes the result to a table. The other does all of that and emits a structured log line at start, a row count metric at finish, a duration metric, and a heartbeat that increments every minute it runs. The first works. The second can be operated. The cost of writing the second is roughly thirty extra lines. The benefit shows up the first time the pipeline behaves strangely and an on-call engineer needs to understand what happened without reading the source.

	# A script: runs, exits, leaves no trace
	def daily_orders_agg(run_date):
	rows = pull_orders(run_date)
	agg = aggregate(rows)
	write(agg, run_date)

	# An operable pipeline: same logic, plus the operational shell
	def daily_orders_agg_v2(run_date, run_id):
	log.info('start', run_id=run_id, run_date=run_date)
	rows = pull_orders(run_date)
	metric('rows_pulled', len(rows), tags={'pipeline': 'orders_agg'})
	agg = aggregate(rows)
	metric('rows_written', len(agg), tags={'pipeline': 'orders_agg'})
	write(agg, run_date, run_id=run_id)
	log.info('done', run_id=run_id, duration_s=elapsed())

What changed in the operable version:

▸Every log line carries the run_id, which threads logs back to a single execution
▸Two metrics turn 'did anything happen' into a number on a dashboard
▸The start and done lines bound the run; their absence is itself a signal
▸The output write carries the run_id so a downstream reader can audit which run produced what

The Three Audiences for an Operable Pipeline

An operable pipeline produces evidence for three different audiences. Engineers debugging a failure want detailed logs, ideally indexed by run_id. Operators watching the system want metrics: counts, durations, error rates, dashboard tiles. Auditors and consumers want traces of which run wrote which output, with timestamps and provenance. The same pipeline, well instrumented, serves all three audiences without making any of them dig through the others' material.

•Script Mindset

Success is defined as 'it ran without an exception'
Output is the only artifact that matters
Logs are stdout, examined when something goes wrong
Re-running is dangerous because state is unpredictable

✓Pipeline Mindset

Success is defined as 'it ran, succeeded, and produced output of the right shape'
Output and metadata are both first-class artifacts
Structured logs and metrics flow continuously to a durable store
Re-running is safe by construction; idempotency is built in

Operability is a separate property from correctness; a correct script can still be unoperable.

Run identity, visibility, and recovery instructions are the smallest operability set.

Operable pipelines emit evidence aimed at three audiences: engineers, operators, auditors.

TIP

Before adding a new pipeline to production, list what an on-call engineer would see if it failed at 3am. If the answer is 'a stack trace in stdout,' the pipeline is not yet operable.

The shift from script to pipeline tracks a shift in mindset. A script's author optimizes for the moment the code is written: short, clever, expressive. A pipeline's author optimizes for the moments years later when the code is debugged, modified, and eventually retired by people the original author has never met. Those audiences want different things. The script's author wants brevity; the pipeline's audience wants evidence. Reconciling the two is a discipline, not a personality trait, and the discipline is built one operability practice at a time. The remainder of this lesson is the smallest set of practices that brings a script across the line.

Logs, Metrics, and Traces

Daily Life

Interviews

Distinguish logs, metrics, and traces and pick the right signal for a given operational question.

Three classes of signal show up in every observability discussion: logs, metrics, and traces. The vocabulary matters because each one answers a different question and has different storage and cost characteristics. Mixing them up produces dashboards that cost too much, alerts that fire on the wrong condition, and debugging sessions that bog down because the right signal is missing. The three are sometimes called the three pillars of observability. The framing comes out of the SRE community at Google and the distributed-systems community more broadly; it predates the data engineering specialization but applies cleanly to it. A pipeline is a distributed system whether the operator thinks of it that way or not, and the same observability vocabulary that serves Kubernetes clusters serves DAGs.

Logs in One Paragraph

A log is a timestamped record of an event. It is meant to be read by a human, eventually, when something needs to be understood after the fact. Good logs are structured: each line is a JSON object with a timestamp, a level, a message, and a small set of fields like run_id, pipeline_name, and table_name. Bad logs are free text concatenations that nobody can grep effectively. Logs are expensive to store at high volume and cheap to store at low volume; the discipline is to log enough to reconstruct what happened, not enough to reconstruct what could have happened.

Metrics in One Paragraph

A metric is a numeric measurement at a point in time. It is meant to be aggregated, averaged, summed, plotted on a dashboard, and alerted on when it crosses a threshold. Good metrics are small in cardinality: 'rows_written by pipeline_name' is a metric. 'Rows_written by user_id' is a cardinality bomb that explodes in cost. Metrics are cheap to store at low cardinality and expensive at high cardinality, the opposite tradeoff from logs. The mental model is that metrics tell whether something is happening; logs tell what specifically happened.

Traces in One Paragraph

A trace records the path a request takes through a system. In a pipeline context, a trace ties together the operations that make up one logical run: extract started, extract finished, transform started, transform finished, write started, write finished, with parent-child relationships and durations. Traces are the right signal for understanding latency and dependency: if a daily DAG runs in two hours instead of forty minutes, a trace shows which task spent the extra time. OpenTelemetry is the standard wire format; Honeycomb, Datadog APM, and Tempo are common backends.

Signal	Best Question to Answer	Worst Question to Force It to Answer
Logs	What happened on this specific run, in detail	What is the average duration over the last quarter
Metrics	Is the system healthy right now; how does it trend	What was the exact error message the pipeline emitted at 03:14
Traces	Where is the latency in this multi-step run	What is the row count of the third table written yesterday

LogsMetricsTraces

Logs

Discrete events, structured

Timestamped records of what happened. Indexed for search. Examples: errors, warnings, lifecycle events. Tools: CloudWatch, Datadog Logs, Loki.

Metrics

Numeric time series

Aggregated measurements over windows. Cheap at low cardinality. Examples: rows_written, duration_seconds, error_rate. Tools: Prometheus, Datadog Metrics, CloudWatch Metrics.

Traces

Request flow with timing

Parent-child operation tree across a single logical run. Examples: extract -> transform -> write spans. Tools: Honeycomb, Datadog APM, Tempo, Jaeger.

A Tiny Worked Example

	# Same operation instrumented three ways
	import structlog, time
	from opentelemetry import trace

	tracer = trace.get_tracer(__name__)
	log = structlog.get_logger()

	def extract_orders(run_id, run_date):
	with tracer.start_as_current_span('extract_orders') as span:
	span.set_attribute('run_date', str(run_date))
	start = time.time()
	rows = pull_from_mysql(run_date)
	duration = time.time() - start
	# log: what happened, for humans
	log.info('extract_finished', run_id=run_id, rows=len(rows), duration_s=duration)
	# metric: aggregated time series
	metric_emit('orders.rows_extracted', len(rows), tags={'pipeline': 'orders_daily'})
	metric_emit('orders.extract_duration_s', duration, tags={'pipeline': 'orders_daily'})
	# trace: the span itself, plus attributes for cross-step analysis
	span.set_attribute('rows', len(rows))
	return rows

When confused about which signal to add:

▸Need to debug a single failure after the fact: add a log line
▸Need a dashboard or threshold alert: add a metric
▸Need to understand where time was spent across multiple steps: add a span (trace)
▸Never use a log search to compute aggregate trends; metrics are cheaper and faster

Logs answer 'what.' Metrics answer 'how much.' Traces answer 'where.' A pipeline missing any one of the three has a blind spot.

✓Do

Log structured JSON with a stable set of fields, not free text
Keep metric cardinality small: pipeline_name and table_name are fine, user_id is not
Wrap multi-step pipelines in trace spans to make latency attributable

✗Don't

Use log search to compute trend metrics; the cost grows linearly with retention
Tag metrics with high-cardinality fields like email or order_id
Skip structured logging because 'print is fine for now'

Day-One Monitoring

Daily Life

Interviews

Choose the smallest set of monitors that makes a new pipeline operable on day one.

A new pipeline does not need fifty monitors. It needs three. Did it run, did it succeed, and was the output the right size. Those three monitors catch most of the failure modes that show up in the first month. Adding more monitors before those three exist is premature optimization; adding fewer leaves blind spots that consumers will discover before the pipeline does.

The Three Day-One Monitors

Monitor	Question Answered	Failure Mode It Catches
Did it run	Did the scheduled job actually fire today	Scheduler outage, deployment removed the job, cron expression broken
Did it succeed	Did the run exit with a success status	Code error, source unavailable, downstream write rejected
Was the output the right size	Is the row count within the expected range	Source schema change, silent filter, partial extract, empty join

Did It Run

The simplest monitor is also the most embarrassing one to forget. A pipeline scheduled for 2am is supposed to start at 2am. If no run record exists for today by 2:30am, something is wrong with the scheduler, not the pipeline. The check is one query against the orchestrator's state: count of runs for this DAG today, expected to be at least 1. Most orchestrators (Airflow, Dagster, Prefect) emit this as a metric or expose it through their API. The threshold is set against the schedule plus a generous tolerance for clock skew and queue lag.

Did It Succeed

A run that started but did not finish, or finished with a failure status, is the second class of failure. The monitor watches the most recent run's terminal status. The orchestrator is the source of truth for this signal; bolted-on checks that look at output tables can be fooled by a partial write. The right alert is on the orchestrator's run state, not on the data.

Was the Output the Right Size

A run that started, finished with success, and produced an output of zero rows is technically successful and operationally a disaster. The third monitor catches the case where the pipeline ran cleanly but the data is wrong. The cheapest version is a row count check: today's partition row count is between 80% and 120% of the trailing seven-day average, or a fixed range like 'between 5,000 and 50,000 per day.' This single check catches a remarkable share of silent failures: empty extracts, filters that swallowed everything, joins that lost the join key. The reason it works so well is that most pipeline failures show up as volume anomalies before they show up as anything else. A schema change that drops rows shows up as fewer rows. A timezone bug that misses a chunk of events shows up as fewer rows. A new filter accidentally added in a refactor shows up as fewer rows. The volume monitor is a generic failure detector dressed up as a row count.

The minimum monitor set for a new pipeline:

▸A scheduler-level alert when no run record exists by the expected start time plus tolerance
▸An orchestrator-level alert on the most recent run's failure status
▸A data-level alert on the latest partition's row count being outside the expected band
▸A heartbeat for long-running tasks so a hung process is distinguishable from a finished one

Why These Three and Not Others

Sophisticated checks (null-rate by column, distribution drift, schema validation) are useful but secondary. The first three answer the questions consumers ask first: is there fresh data, did the pipeline finish, does the volume look right. The next tier answers questions consumers ask second, after the first three are passing. Building the second tier first is a common mistake: dashboards full of green checks while the pipeline silently failed to start. The order matters because each layer of monitoring catches a class of failure that the next layer presupposes. A schema validation that runs after a pipeline that did not run today validates yesterday's data and reports green; the alert that should have fired never reaches anyone. Putting the run-and-success checks in place first ensures every later check has the right substrate to operate on.

✓Day-One Monitor Set

Three checks: ran, succeeded, right size
Each check has a clear owner and a clear response
Alerts route to a channel where someone is on call
False positive rate is tolerable; alarm fatigue is low

•Premature Monitoring

Twenty checks copied from a vendor template
Half of them fire weekly with no clear owner
Alerts route to email; nobody reads the channel
Alarm fatigue is high; real alerts get missed

TIP

When inheriting a pipeline that has no monitors, the cheapest first move is to add the three day-one checks and watch them for two weeks before adding anything else.

The two-week observation period serves a specific purpose: it surfaces the false-positive rate of each check against real production data. A volume threshold set blindly will fire too often or never. The two-week window catches one or two weekend-vs-weekday patterns, holiday-traffic anomalies, or end-of-month batch effects, and the thresholds get tuned against what actually happens. Pipelines instrumented without this observation period tend to produce alerts that fire frequently in the first month and get muted by week six, which is the worst outcome: monitors that exist on paper but produce no useful signal.

pipeline run

pipeline

metrics + logs

metrics

SLA breach?

alerting

page on-call

oncall

An operable pipeline emits logs, metrics, and traces; monitoring compares them to SLAs and pages on-call when one breaks. Without this, you find out a pipeline failed when a VP asks why the numbers are wrong.

Alerting That Stays Useful

Daily Life

Interviews

Route alerts by severity so on-call engineers can respond without burning out on noise.

An alert is a request for human attention. Every alert that fires is a withdrawal from the on-call engineer's attention budget. A pipeline that pages on every minor anomaly bankrupts its on-call within weeks; the engineers stop reading the channel and the next real outage is missed. The discipline is to ration alerts so that the ones that fire are the ones that need a human to act now. The economics are stark: an engineer who responds to twenty pages a week treats the twenty-first as another routine interruption, which is exactly when the page that mattered slips through. The same engineer who responds to two pages a week treats both as serious by default, and the response rate stays high. The tuning that produces the second outcome is not subtle; it is restraint applied early and consistently.

Three Tiers of Severity

Tier	Routing	Example Trigger
Page	PagerDuty, phone wake-up, on-call rotation	Pipeline feeds a customer-facing system and has missed its SLA
Slack channel	Notification channel watched during business hours	Daily DAG failed; will retry at next scheduled run
Email digest	Daily roll-up that nobody opens until something is wrong	Row count drifted by 5% over the past week

The Test for Page-Worthy

An alert deserves a page if and only if it requires action within the hour, and that action cannot wait for business hours, and the on-call engineer can in fact do something about it. Alerts that fail any of those three tests belong in a lower tier. A pipeline that runs nightly and fails has roughly twenty-four hours before it matters; a Slack alert in the morning is sufficient. A streaming pipeline that feeds a real-time fraud system has minutes; that one pages.

Three rules for staying out of alarm fatigue:

▸Every alert names the action expected of the on-call engineer; if there is nothing to do, do not page
▸Every recurring false positive is investigated and the threshold is tuned, not silenced
▸An alert that fires more than once a week without a real cause is moved to a lower tier or removed

PageSlackDigest

Page

Wake somebody up

Customer-facing impact, requires action within the hour, action exists. PagerDuty, phone, on-call rotation. Fewer than three of these per pipeline.

Slack

Tell the channel

Daily-cadence pipeline failed; off-band volume signal; non-customer impact. Watched during business hours. The default tier for most pipeline alerts.

Digest

Email the trend

Slow drift on null rates, gradual cost growth, quiet schema additions. Read once a day or once a week; never demands immediate action.

Page on Real Problems

A real problem has three properties. It is happening now. It will not resolve itself. A human can fix it. The classic page-worthy condition for a pipeline is a missed freshness SLA on a critical consumer: the dashboard the executive team looks at every Monday is supposed to be fresh by 6am, it is now 7am, and the data is still from Friday. That is page-worthy. A row count that is 12% lower than the trailing average is not. The lower volume may indicate a real anomaly, but it is not 'fix-this-now' material. It belongs in the morning Slack.

Email on Weird-but-OK

Some signals are interesting without being urgent. Drift on a column's null rate, a slowly growing duration, a row count that has crept up 3% per week for a month. These are observations worth knowing about, but knowing about them at 3am does not help. The right channel is a daily digest email or a weekly review, not a page. The discipline is to keep the digest short enough that someone actually reads it; a digest with two hundred items is the same as no digest at all.

# A typical alert routing config keyed to the three tiers alerts : - name : revenue_dashboard_freshness_breach condition : MAX(table_age_minutes { TABLE = 'fct_revenue' }) > 60 severity : page routes : - pagerduty : data - platform - oncall runbook : https : / / wiki / runbooks / revenue_dashboard_freshness - name : orders_dag_failure condition : airflow_dag_run_status { dag = 'orders_daily' } = = 'failed' severity : slack routes : - slack : '#data-platform-alerts' runbook : https : / / wiki / runbooks / orders_dag_failure - name : signups_volume_drift condition : abs(daily_signups - signups_7d_avg) / signups_7d_avg > 0.05 severity : digest routes : - email : data - anomalies @ company.com

What an Alert Should Contain

The pipeline name and the run identifier, so the alert is unambiguous.

A one-sentence statement of the symptom, not a stack trace.

A link to the runbook for this specific alert, so the responder is not searching at 3am.

•Alerts That Get Ignored

Generic stack traces with no symptom summary
No link to a runbook; on-call has to read source code
Fire on weak signals like minor row-count drift
Route everything to one channel regardless of urgency

✓Alerts That Get Acted On

One-line symptom: 'fct_revenue is 65 minutes stale, SLA 60 minutes'
Link to a runbook with the standard response steps
Fire only on conditions that demand action within the alert tier's window
Route by severity: page for now, Slack for soon, email for FYI

Alarm fatigue is not the on-call engineer's failure of attention. It is the alert author's failure of restraint. Every alert that fires for nothing trains the team to ignore alerts.

Mature teams treat alert tuning as ongoing work. After a pipeline ships, the alerts produced by it generate weekly review data: which fired, which were actionable, which produced a runbook update, which produced no action at all. Alerts in the last category are candidates for tuning down or removal. The process is undramatic and quiet, but the absence of it is loud: a team that never reviews its alerts ends up with a noise floor it cannot hear over. The cost of the discipline is roughly an hour a week per pipeline owner; the saving is the on-call attention budget that pays back every time something real breaks.

A First Runbook

Daily Life

Interviews

Draft a runbook with symptom, impact, diagnosis, response, and escalation sections that an unfamiliar engineer can follow.

A runbook is a document that tells an on-call engineer what to do when a specific alert fires. It is not architecture documentation. It is not design rationale. It is a checklist tuned for the moment when something is wrong, the on-call has been paged, and the question is what to check first. A good runbook can be followed by an engineer who has never seen the pipeline before. A bad runbook is a wiki page that says 'contact Eric.' The shape of a useful runbook is closer to an emergency-room intake form than to a textbook chapter. The responder is under time pressure. Information that is not directly actionable in the next ten minutes is friction. The runbook prunes ruthlessly toward the next action.

What a Runbook Contains

Section	Contents	Purpose
Symptom	The alert text and what it means in plain language	Confirms that the responder is reading the right runbook
Impact	Who is affected and how badly, in business terms	Sets the urgency: customer-facing or internal-only
Diagnosis steps	An ordered list of checks to identify the cause	Threads the responder through the most likely failure modes
Response actions	The fix for each likely cause, with commands or links	Lets the responder act, not just diagnose
Escalation	Who to contact if the runbook does not resolve the issue	Bounds the responder's solo problem-solving time

A Runbook for the Daily Orders DAG

	# Runbook : orders_daily DAG failure # # Symptom The orders_daily DAG has reported a failed run.Alert fires WITHIN 5 minutes of the failure.Slack channel : # data - platform - alerts.# # Impact - analytics.fct_orders will NOT have data for the failed run_date.- Three downstream consumers depend


	ON this TABLE : - revenue_dashboard(Looker) - daily_revenue_email(cron AT 8 am) - ml_revenue_forecast(trains AT 9 am) - A failed run before 6 am has no business impact IF recovered BY 7 am.- A failed run NOT recovered BY 8 am will cause stale data IN the daily email.# # Diagnosis 1. Open the DAG run IN Airflow AND find the failed task.2. CHECK the task log for the exception type : - OperationalError

	FROM MySQL : source DATABASE IS unavailable.- IntegrityError
	FROM Snowflake : SCHEMA mismatch IN target TABLE.- TimeoutError : extract took longer than the 30 - minute task budget.3. CHECK # ops - incidents for an ongoing platform - level issue.# # Response - * * Source unavailable * * : wait 15 minutes, THEN trigger a manual rerun. bash
	airflow dags trigger orders_daily --run-id manual_$(date +%s)
	- * * SCHEMA mismatch * * : CHECK dbt run --select fct_orders for migration errors.Coordinate WITH # data - platform

	ON the target SCHEMA.- * * Timeout * * : CHECK source ROW volume ; volume above 5 M typically requires bumping the task timeout IN the DAG file.# # Escalation IF unresolved after 60 minutes OR IF impact extends past 8 am : - Page : data - platform - oncall via PagerDuty - Backup : @ marina, @ diego(Slack DM)

Why Runbooks Get Written and Then Stop Working

Runbooks rot. The pipeline gets a new dependency, the alert text changes, the fix command stops working because the deployment moved. A runbook that has not been touched in a year is roughly as helpful as no runbook. The discipline is to update the runbook every time it gets used. The on-call engineer who follows the runbook to resolve an incident is the one who knows where it is wrong; their five minutes of editing saves the next responder twenty minutes.

Maintenance habits that keep runbooks alive:

▸Every postmortem produces a runbook update or a new runbook entry
▸Every alert links to its runbook; broken links are a CI check
▸Quarterly runbook review: an engineer who has not read the runbook follows it on a recent incident and edits as needed
▸Runbooks live in version control next to the pipeline code, not in a separate wiki

The First Runbook Is the One That Saves the Most Time

An organization with no runbooks discovers them under pressure. The first time a pipeline fails at 3am with no runbook, the on-call engineer reads the source code, queries the database, asks questions on Slack, and burns three hours. The second time, they remember most of it but have to look up commands. The third time, they finally write the runbook, and from then on every responder benefits. Writing the runbook before the third incident saves more time than the runbook costs to write.

✓Do

Write the runbook the first time the alert fires, not the third
Link every alert directly to its runbook URL
Update the runbook during the resolution, while the context is fresh

✗Don't

Treat runbooks as architecture documentation; they are response checklists
Let runbooks live in a separate wiki that drifts from the code
Write 'contact Eric' as the response; Eric will leave the company

TIP

A good test of a runbook: hand it to an engineer on a different team and ask if they could resolve the alert by following only the runbook. If the answer is no, the runbook is missing context or specificity.

Runbooks accumulate organizationally. A team with thirty pipelines has thirty runbooks; a platform team with three hundred pipelines has three hundred. Discoverability becomes its own problem at that scale. The fix is convention: runbooks live next to the pipeline code, named for the pipeline, linked from the alert, indexed in a known location. A new engineer joining the team learns the convention once and finds every runbook by following it. A team with no convention loses runbooks to wikis, Notion docs, Slack threads, and personal Google Drive folders. The runbook exists; nobody can find it during the incident, which is the only time it matters.

❯❯❯PUTTING IT ALL TOGETHER

> A new data engineer inherits the orders_daily pipeline. There are no monitors, no runbook, and the previous owner left two months ago. The pipeline runs every night at 2am and feeds three downstream consumers. The first task is to make the pipeline operable before the next failure happens. Where does the engineer start, and what do they build first?

Start by adding the three day-one monitors: did it run, did it succeed, was the output the right size. The first two come from the orchestrator. The third is a row-count band against the trailing seven-day average.

Route the failure alert to the team Slack channel, not to a page. The pipeline is daily, so the response window is hours, not minutes. Reserve the page tier for the freshness alert on the customer-facing consumer.

Write a runbook with the five standard sections (symptom, impact, diagnosis, response, escalation) and link it from every alert. The runbook does not need to be perfect; it needs to be present and improvable on the next incident.

Add structured logging and two metrics (rows_pulled, rows_written) to the pipeline itself, so the next investigation does not start from a stack trace. Thirty extra lines of code prevent the next three-hour debugging session.

KEY TAKEAWAYS

Operability is a separate property from correctness: a working script becomes an operable pipeline by adding run identity, structured logs, metrics, and recovery instructions.

Logs answer what, metrics answer how much, traces answer where: each signal has a different cost profile and a different best-use question; missing any one leaves a blind spot.

Three day-one monitors are enough to start: did it run, did it succeed, was the output the right size. Adding more before these exist is premature.

Alerts are a withdrawal from the on-call attention budget: page on real problems, Slack on weird-but-OK, email digest on slow drift. Above one false positive per week, alerts get muted by their audience.

A runbook is a checklist for an unfamiliar responder: five sections (symptom, impact, diagnosis, response, escalation), updated every time it gets used, linked from every alert.

A pipeline that runs once is a script; one that survives Monday morning is operated

Category: Pipeline Architecture
Difficulty: beginner
Duration: 25 minutes
Challenges: 0 hands-on challenges

Topics covered: Script vs Operable Pipeline, Logs, Metrics, and Traces, Day-One Monitoring, Alerting That Stays Useful, A First Runbook

Lesson Sections

Script vs Operable Pipeline (concepts: paMonitoring)
A working script is a piece of code that produces the right answer when nothing goes wrong. An operable pipeline is a piece of code that someone can run, watch, debug, and recover from at three in the morning, six months after it was written, by a person who has never read its source. The two are not on the same axis. A script can be technically excellent and operationally useless. A pipeline can have ugly code and survive years of production because it tells operators what is happening. The bar
Logs, Metrics, and Traces (concepts: paMonitoring)
Three classes of signal show up in every observability discussion: logs, metrics, and traces. The vocabulary matters because each one answers a different question and has different storage and cost characteristics. Mixing them up produces dashboards that cost too much, alerts that fire on the wrong condition, and debugging sessions that bog down because the right signal is missing. The three are sometimes called the three pillars of observability. The framing comes out of the SRE community at Go
Day-One Monitoring (concepts: paMonitoring)
A new pipeline does not need fifty monitors. It needs three. Did it run, did it succeed, and was the output the right size. Those three monitors catch most of the failure modes that show up in the first month. Adding more monitors before those three exist is premature optimization; adding fewer leaves blind spots that consumers will discover before the pipeline does. The Three Day-One Monitors Did It Run The simplest monitor is also the most embarrassing one to forget. A pipeline scheduled for 2
Alerting That Stays Useful (concepts: paMonitoring)
An alert is a request for human attention. Every alert that fires is a withdrawal from the on-call engineer's attention budget. A pipeline that pages on every minor anomaly bankrupts its on-call within weeks; the engineers stop reading the channel and the next real outage is missed. The discipline is to ration alerts so that the ones that fire are the ones that need a human to act now. The economics are stark: an engineer who responds to twenty pages a week treats the twenty-first as another rou
A First Runbook (concepts: paMonitoring)
A runbook is a document that tells an on-call engineer what to do when a specific alert fires. It is not architecture documentation. It is not design rationale. It is a checklist tuned for the moment when something is wrong, the on-call has been paged, and the question is what to check first. A good runbook can be followed by an engineer who has never seen the pipeline before. A bad runbook is a wiki page that says 'contact Eric.' The shape of a useful runbook is closer to an emergency-room inta