An e-commerce company at 80 engineers had a Python script that pulled orders from MySQL every night at 2am, transformed them, and loaded the results into Snowflake. The script had been written eighteen months earlier by an engineer who had since changed teams. On a Tuesday in October, the script ran for forty seconds and exited with a zero status. Nobody noticed. The dashboard the script fed kept showing data, but the data stopped advancing. Five days later the head of finance asked why October revenue had been flat since the third. The script had silently started skipping rows because of a timezone change in the source. The cost of finding the bug was fifteen engineering days. The cost of preventing it was a single freshness check that would have fired the first morning the table did not advance. The difference between a script and a pipeline is the existence of those checks, the alerts they raise, and the runbook that explains what to do when they fire. This lesson is about the smallest set of operational practices that turn the second outcome into the first. None of the practices are exotic. All of them are skipped routinely in early-stage data work, and skipping them is the most common reason a pipeline that ships fast ends up costing more than the chart it produced.
Script vs Operable Pipeline
Daily Life
Interviews
Recognize the operational gap between a working script and a pipeline that can be run as a service.
A working script is a piece of code that produces the right answer when nothing goes wrong. An operable pipeline is a piece of code that someone can run, watch, debug, and recover from at three in the morning, six months after it was written, by a person who has never read its source. The two are not on the same axis. A script can be technically excellent and operationally useless. A pipeline can have ugly code and survive years of production because it tells operators what is happening. The bar for operability is set not by the original author but by the worst-case responder: a tired engineer who has never seen the pipeline before, has limited context on the surrounding system, and has perhaps fifteen minutes before the consequences become visible to consumers. Code that holds up under that bar is operable code. Code that does not is, by definition, a script.
Five Things a Script Does Not Have
Operational Property
What a Script Lacks
What an Operable Pipeline Provides
Run identity
No record of which run produced which output
A run_id stamped on every artifact and every log line
Visibility
Standard out scrolls past and disappears
Structured logs and metrics flow to a durable store
Failure signal
A nonzero exit code that nobody is watching
Alerts routed to a channel where someone is paged
Recovery
A human reads the source code and guesses
A runbook names the symptom and the response
Idempotent retries
Re-running corrupts state or duplicates rows
Re-running produces the same result as the first run
The Operability Gap, Made Concrete
Consider two implementations of the same daily orders aggregation. One is a Python file that loads orders, groups them, and writes the result to a table. The other does all of that and emits a structured log line at start, a row count metric at finish, a duration metric, and a heartbeat that increments every minute it runs. The first works. The second can be operated. The cost of writing the second is roughly thirty extra lines. The benefit shows up the first time the pipeline behaves strangely and an on-call engineer needs to understand what happened without reading the source.
1
# A script: runs, exits, leaves no trace
2
defdaily_orders_agg(run_date):
3
rows=pull_orders(run_date)
4
agg=aggregate(rows)
5
write(agg,run_date)
6
7
# An operable pipeline: same logic, plus the operational shell
▸Every log line carries the run_id, which threads logs back to a single execution
▸Two metrics turn 'did anything happen' into a number on a dashboard
▸The start and done lines bound the run; their absence is itself a signal
▸The output write carries the run_id so a downstream reader can audit which run produced what
The Three Audiences for an Operable Pipeline
An operable pipeline produces evidence for three different audiences. Engineers debugging a failure want detailed logs, ideally indexed by run_id. Operators watching the system want metrics: counts, durations, error rates, dashboard tiles. Auditors and consumers want traces of which run wrote which output, with timestamps and provenance. The same pipeline, well instrumented, serves all three audiences without making any of them dig through the others' material.
•Script Mindset
Success is defined as 'it ran without an exception'
Output is the only artifact that matters
Logs are stdout, examined when something goes wrong
Re-running is dangerous because state is unpredictable
✓Pipeline Mindset
Success is defined as 'it ran, succeeded, and produced output of the right shape'
Output and metadata are both first-class artifacts
Structured logs and metrics flow continuously to a durable store
Re-running is safe by construction; idempotency is built in
Operability is a separate property from correctness; a correct script can still be unoperable.
Run identity, visibility, and recovery instructions are the smallest operability set.
Operable pipelines emit evidence aimed at three audiences: engineers, operators, auditors.
TIP
Before adding a new pipeline to production, list what an on-call engineer would see if it failed at 3am. If the answer is 'a stack trace in stdout,' the pipeline is not yet operable.
The shift from script to pipeline tracks a shift in mindset. A script's author optimizes for the moment the code is written: short, clever, expressive. A pipeline's author optimizes for the moments years later when the code is debugged, modified, and eventually retired by people the original author has never met. Those audiences want different things. The script's author wants brevity; the pipeline's audience wants evidence. Reconciling the two is a discipline, not a personality trait, and the discipline is built one operability practice at a time. The remainder of this lesson is the smallest set of practices that brings a script across the line.
Logs, Metrics, and Traces
Daily Life
Interviews
Distinguish logs, metrics, and traces and pick the right signal for a given operational question.
Three classes of signal show up in every observability discussion: logs, metrics, and traces. The vocabulary matters because each one answers a different question and has different storage and cost characteristics. Mixing them up produces dashboards that cost too much, alerts that fire on the wrong condition, and debugging sessions that bog down because the right signal is missing. The three are sometimes called the three pillars of observability. The framing comes out of the SRE community at Google and the distributed-systems community more broadly; it predates the data engineering specialization but applies cleanly to it. A pipeline is a distributed system whether the operator thinks of it that way or not, and the same observability vocabulary that serves Kubernetes clusters serves DAGs.
Logs in One Paragraph
A log is a timestamped record of an event. It is meant to be read by a human, eventually, when something needs to be understood after the fact. Good logs are structured: each line is a JSON object with a timestamp, a level, a message, and a small set of fields like run_id, pipeline_name, and table_name. Bad logs are free text concatenations that nobody can grep effectively. Logs are expensive to store at high volume and cheap to store at low volume; the discipline is to log enough to reconstruct what happened, not enough to reconstruct what could have happened.
Metrics in One Paragraph
A metric is a numeric measurement at a point in time. It is meant to be aggregated, averaged, summed, plotted on a dashboard, and alerted on when it crosses a threshold. Good metrics are small in cardinality: 'rows_written by pipeline_name' is a metric. 'Rows_written by user_id' is a cardinality bomb that explodes in cost. Metrics are cheap to store at low cardinality and expensive at high cardinality, the opposite tradeoff from logs. The mental model is that metrics tell whether something is happening; logs tell what specifically happened.
Traces in One Paragraph
A trace records the path a request takes through a system. In a pipeline context, a trace ties together the operations that make up one logical run: extract started, extract finished, transform started, transform finished, write started, write finished, with parent-child relationships and durations. Traces are the right signal for understanding latency and dependency: if a daily DAG runs in two hours instead of forty minutes, a trace shows which task spent the extra time. OpenTelemetry is the standard wire format; Honeycomb, Datadog APM, and Tempo are common backends.
Signal
Best Question to Answer
Worst Question to Force It to Answer
Logs
What happened on this specific run, in detail
What is the average duration over the last quarter
Metrics
Is the system healthy right now; how does it trend
What was the exact error message the pipeline emitted at 03:14
Traces
Where is the latency in this multi-step run
What is the row count of the third table written yesterday
LogsMetricsTraces
Logs
Discrete events, structured
Timestamped records of what happened. Indexed for search. Examples: errors, warnings, lifecycle events. Tools: CloudWatch, Datadog Logs, Loki.
Metrics
Numeric time series
Aggregated measurements over windows. Cheap at low cardinality. Examples: rows_written, duration_seconds, error_rate. Tools: Prometheus, Datadog Metrics, CloudWatch Metrics.
Traces
Request flow with timing
Parent-child operation tree across a single logical run. Examples: extract -> transform -> write spans. Tools: Honeycomb, Datadog APM, Tempo, Jaeger.
# trace: the span itself, plus attributes for cross-step analysis
20
span.set_attribute('rows',len(rows))
21
returnrows
When confused about which signal to add:
▸Need to debug a single failure after the fact: add a log line
▸Need a dashboard or threshold alert: add a metric
▸Need to understand where time was spent across multiple steps: add a span (trace)
▸Never use a log search to compute aggregate trends; metrics are cheaper and faster
Logs answer 'what.' Metrics answer 'how much.' Traces answer 'where.' A pipeline missing any one of the three has a blind spot.
1
✓Do
Log structured JSON with a stable set of fields, not free text
Keep metric cardinality small: pipeline_name and table_name are fine, user_id is not
Wrap multi-step pipelines in trace spans to make latency attributable
✗Don't
Use log search to compute trend metrics; the cost grows linearly with retention
Tag metrics with high-cardinality fields like email or order_id
Skip structured logging because 'print is fine for now'
Day-One Monitoring
Daily Life
Interviews
Choose the smallest set of monitors that makes a new pipeline operable on day one.
A new pipeline does not need fifty monitors. It needs three. Did it run, did it succeed, and was the output the right size. Those three monitors catch most of the failure modes that show up in the first month. Adding more monitors before those three exist is premature optimization; adding fewer leaves blind spots that consumers will discover before the pipeline does.
The Three Day-One Monitors
Monitor
Question Answered
Failure Mode It Catches
Did it run
Did the scheduled job actually fire today
Scheduler outage, deployment removed the job, cron expression broken
The simplest monitor is also the most embarrassing one to forget. A pipeline scheduled for 2am is supposed to start at 2am. If no run record exists for today by 2:30am, something is wrong with the scheduler, not the pipeline. The check is one query against the orchestrator's state: count of runs for this DAG today, expected to be at least 1. Most orchestrators (Airflow, Dagster, Prefect) emit this as a metric or expose it through their API. The threshold is set against the schedule plus a generous tolerance for clock skew and queue lag.
Did It Succeed
A run that started but did not finish, or finished with a failure status, is the second class of failure. The monitor watches the most recent run's terminal status. The orchestrator is the source of truth for this signal; bolted-on checks that look at output tables can be fooled by a partial write. The right alert is on the orchestrator's run state, not on the data.
Was the Output the Right Size
A run that started, finished with success, and produced an output of zero rows is technically successful and operationally a disaster. The third monitor catches the case where the pipeline ran cleanly but the data is wrong. The cheapest version is a row count check: today's partition row count is between 80% and 120% of the trailing seven-day average, or a fixed range like 'between 5,000 and 50,000 per day.' This single check catches a remarkable share of silent failures: empty extracts, filters that swallowed everything, joins that lost the join key. The reason it works so well is that most pipeline failures show up as volume anomalies before they show up as anything else. A schema change that drops rows shows up as fewer rows. A timezone bug that misses a chunk of events shows up as fewer rows. A new filter accidentally added in a refactor shows up as fewer rows. The volume monitor is a generic failure detector dressed up as a row count.
1
The minimum monitor set for a new pipeline:
▸A scheduler-level alert when no run record exists by the expected start time plus tolerance
▸An orchestrator-level alert on the most recent run's failure status
▸A data-level alert on the latest partition's row count being outside the expected band
▸A heartbeat for long-running tasks so a hung process is distinguishable from a finished one
Why These Three and Not Others
Sophisticated checks (null-rate by column, distribution drift, schema validation) are useful but secondary. The first three answer the questions consumers ask first: is there fresh data, did the pipeline finish, does the volume look right. The next tier answers questions consumers ask second, after the first three are passing. Building the second tier first is a common mistake: dashboards full of green checks while the pipeline silently failed to start. The order matters because each layer of monitoring catches a class of failure that the next layer presupposes. A schema validation that runs after a pipeline that did not run today validates yesterday's data and reports green; the alert that should have fired never reaches anyone. Putting the run-and-success checks in place first ensures every later check has the right substrate to operate on.
✓Day-One Monitor Set
Three checks: ran, succeeded, right size
Each check has a clear owner and a clear response
Alerts route to a channel where someone is on call
False positive rate is tolerable; alarm fatigue is low
•Premature Monitoring
Twenty checks copied from a vendor template
Half of them fire weekly with no clear owner
Alerts route to email; nobody reads the channel
Alarm fatigue is high; real alerts get missed
TIP
When inheriting a pipeline that has no monitors, the cheapest first move is to add the three day-one checks and watch them for two weeks before adding anything else.
The two-week observation period serves a specific purpose: it surfaces the false-positive rate of each check against real production data. A volume threshold set blindly will fire too often or never. The two-week window catches one or two weekend-vs-weekday patterns, holiday-traffic anomalies, or end-of-month batch effects, and the thresholds get tuned against what actually happens. Pipelines instrumented without this observation period tend to produce alerts that fire frequently in the first month and get muted by week six, which is the worst outcome: monitors that exist on paper but produce no useful signal.
Alerting That Stays Useful
Daily Life
Interviews
Route alerts by severity so on-call engineers can respond without burning out on noise.
An alert is a request for human attention. Every alert that fires is a withdrawal from the on-call engineer's attention budget. A pipeline that pages on every minor anomaly bankrupts its on-call within weeks; the engineers stop reading the channel and the next real outage is missed. The discipline is to ration alerts so that the ones that fire are the ones that need a human to act now. The economics are stark: an engineer who responds to twenty pages a week treats the twenty-first as another routine interruption, which is exactly when the page that mattered slips through. The same engineer who responds to two pages a week treats both as serious by default, and the response rate stays high. The tuning that produces the second outcome is not subtle; it is restraint applied early and consistently.
Three Tiers of Severity
Tier
Routing
Example Trigger
Page
PagerDuty, phone wake-up, on-call rotation
Pipeline feeds a customer-facing system and has missed its SLA
Slack channel
Notification channel watched during business hours
Daily DAG failed; will retry at next scheduled run
Email digest
Daily roll-up that nobody opens until something is wrong
Row count drifted by 5% over the past week
The Test for Page-Worthy
An alert deserves a page if and only if it requires action within the hour, and that action cannot wait for business hours, and the on-call engineer can in fact do something about it. Alerts that fail any of those three tests belong in a lower tier. A pipeline that runs nightly and fails has roughly twenty-four hours before it matters; a Slack alert in the morning is sufficient. A streaming pipeline that feeds a real-time fraud system has minutes; that one pages.
Three rules for staying out of alarm fatigue:
▸Every alert names the action expected of the on-call engineer; if there is nothing to do, do not page
▸Every recurring false positive is investigated and the threshold is tuned, not silenced
▸An alert that fires more than once a week without a real cause is moved to a lower tier or removed
PageSlackDigest
Page
Wake somebody up
Customer-facing impact, requires action within the hour, action exists. PagerDuty, phone, on-call rotation. Fewer than three of these per pipeline.
Slack
Tell the channel
Daily-cadence pipeline failed; off-band volume signal; non-customer impact. Watched during business hours. The default tier for most pipeline alerts.
Digest
Email the trend
Slow drift on null rates, gradual cost growth, quiet schema additions. Read once a day or once a week; never demands immediate action.
Page on Real Problems
A real problem has three properties. It is happening now. It will not resolve itself. A human can fix it. The classic page-worthy condition for a pipeline is a missed freshness SLA on a critical consumer: the dashboard the executive team looks at every Monday is supposed to be fresh by 6am, it is now 7am, and the data is still from Friday. That is page-worthy. A row count that is 12% lower than the trailing average is not. The lower volume may indicate a real anomaly, but it is not 'fix-this-now' material. It belongs in the morning Slack.
Email on Weird-but-OK
Some signals are interesting without being urgent. Drift on a column's null rate, a slowly growing duration, a row count that has crept up 3% per week for a month. These are observations worth knowing about, but knowing about them at 3am does not help. The right channel is a daily digest email or a weekly review, not a page. The discipline is to keep the digest short enough that someone actually reads it; a digest with two hundred items is the same as no digest at all.
The pipeline name and the run identifier, so the alert is unambiguous.
A one-sentence statement of the symptom, not a stack trace.
A link to the runbook for this specific alert, so the responder is not searching at 3am.
•Alerts That Get Ignored
Generic stack traces with no symptom summary
No link to a runbook; on-call has to read source code
Fire on weak signals like minor row-count drift
Route everything to one channel regardless of urgency
✓Alerts That Get Acted On
One-line symptom: 'fct_revenue is 65 minutes stale, SLA 60 minutes'
Link to a runbook with the standard response steps
Fire only on conditions that demand action within the alert tier's window
Route by severity: page for now, Slack for soon, email for FYI
Alarm fatigue is not the on-call engineer's failure of attention. It is the alert author's failure of restraint. Every alert that fires for nothing trains the team to ignore alerts.
Mature teams treat alert tuning as ongoing work. After a pipeline ships, the alerts produced by it generate weekly review data: which fired, which were actionable, which produced a runbook update, which produced no action at all. Alerts in the last category are candidates for tuning down or removal. The process is undramatic and quiet, but the absence of it is loud: a team that never reviews its alerts ends up with a noise floor it cannot hear over. The cost of the discipline is roughly an hour a week per pipeline owner; the saving is the on-call attention budget that pays back every time something real breaks.
A First Runbook
Daily Life
Interviews
Draft a runbook with symptom, impact, diagnosis, response, and escalation sections that an unfamiliar engineer can follow.
A runbook is a document that tells an on-call engineer what to do when a specific alert fires. It is not architecture documentation. It is not design rationale. It is a checklist tuned for the moment when something is wrong, the on-call has been paged, and the question is what to check first. A good runbook can be followed by an engineer who has never seen the pipeline before. A bad runbook is a wiki page that says 'contact Eric.' The shape of a useful runbook is closer to an emergency-room intake form than to a textbook chapter. The responder is under time pressure. Information that is not directly actionable in the next ten minutes is friction. The runbook prunes ruthlessly toward the next action.
What a Runbook Contains
Section
Contents
Purpose
Symptom
The alert text and what it means in plain language
Confirms that the responder is reading the right runbook
Impact
Who is affected and how badly, in business terms
Sets the urgency: customer-facing or internal-only
Diagnosis steps
An ordered list of checks to identify the cause
Threads the responder through the most likely failure modes
Response actions
The fix for each likely cause, with commands or links
Lets the responder act, not just diagnose
Escalation
Who to contact if the runbook does not resolve the issue
Runbooks rot. The pipeline gets a new dependency, the alert text changes, the fix command stops working because the deployment moved. A runbook that has not been touched in a year is roughly as helpful as no runbook. The discipline is to update the runbook every time it gets used. The on-call engineer who follows the runbook to resolve an incident is the one who knows where it is wrong; their five minutes of editing saves the next responder twenty minutes.
Maintenance habits that keep runbooks alive:
▸Every postmortem produces a runbook update or a new runbook entry
▸Every alert links to its runbook; broken links are a CI check
▸Quarterly runbook review: an engineer who has not read the runbook follows it on a recent incident and edits as needed
▸Runbooks live in version control next to the pipeline code, not in a separate wiki
The First Runbook Is the One That Saves the Most Time
An organization with no runbooks discovers them under pressure. The first time a pipeline fails at 3am with no runbook, the on-call engineer reads the source code, queries the database, asks questions on Slack, and burns three hours. The second time, they remember most of it but have to look up commands. The third time, they finally write the runbook, and from then on every responder benefits. Writing the runbook before the third incident saves more time than the runbook costs to write.
1
✓Do
Write the runbook the first time the alert fires, not the third
Link every alert directly to its runbook URL
Update the runbook during the resolution, while the context is fresh
✗Don't
Treat runbooks as architecture documentation; they are response checklists
Let runbooks live in a separate wiki that drifts from the code
Write 'contact Eric' as the response; Eric will leave the company
TIP
A good test of a runbook: hand it to an engineer on a different team and ask if they could resolve the alert by following only the runbook. If the answer is no, the runbook is missing context or specificity.
Runbooks accumulate organizationally. A team with thirty pipelines has thirty runbooks; a platform team with three hundred pipelines has three hundred. Discoverability becomes its own problem at that scale. The fix is convention: runbooks live next to the pipeline code, named for the pipeline, linked from the alert, indexed in a known location. A new engineer joining the team learns the convention once and finds every runbook by following it. A team with no convention loses runbooks to wikis, Notion docs, Slack threads, and personal Google Drive folders. The runbook exists; nobody can find it during the incident, which is the only time it matters.
❯❯❯PUTTING IT ALL TOGETHER
> A new data engineer inherits the orders_daily pipeline. There are no monitors, no runbook, and the previous owner left two months ago. The pipeline runs every night at 2am and feeds three downstream consumers. The first task is to make the pipeline operable before the next failure happens. Where does the engineer start, and what do they build first?
Start by adding the three day-one monitors: did it run, did it succeed, was the output the right size. The first two come from the orchestrator. The third is a row-count band against the trailing seven-day average.
Route the failure alert to the team Slack channel, not to a page. The pipeline is daily, so the response window is hours, not minutes. Reserve the page tier for the freshness alert on the customer-facing consumer.
Write a runbook with the five standard sections (symptom, impact, diagnosis, response, escalation) and link it from every alert. The runbook does not need to be perfect; it needs to be present and improvable on the next incident.
Add structured logging and two metrics (rows_pulled, rows_written) to the pipeline itself, so the next investigation does not start from a stack trace. Thirty extra lines of code prevent the next three-hour debugging session.
KEY TAKEAWAYS
Operability is a separate property from correctness: a working script becomes an operable pipeline by adding run identity, structured logs, metrics, and recovery instructions.
Logs answer what, metrics answer how much, traces answer where: each signal has a different cost profile and a different best-use question; missing any one leaves a blind spot.
Three day-one monitors are enough to start: did it run, did it succeed, was the output the right size. Adding more before these exist is premature.
Alerts are a withdrawal from the on-call attention budget: page on real problems, Slack on weird-but-OK, email digest on slow drift. Above one false positive per week, alerts get muted by their audience.
A runbook is a checklist for an unfamiliar responder: five sections (symptom, impact, diagnosis, response, escalation), updated every time it gets used, linked from every alert.
A pipeline that runs once is a script; one that survives Monday morning is operated
Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges
Topics covered: Script vs Operable Pipeline, Logs, Metrics, and Traces, Day-One Monitoring, Alerting That Stays Useful, A First Runbook
A working script is a piece of code that produces the right answer when nothing goes wrong. An operable pipeline is a piece of code that someone can run, watch, debug, and recover from at three in the morning, six months after it was written, by a person who has never read its source. The two are not on the same axis. A script can be technically excellent and operationally useless. A pipeline can have ugly code and survive years of production because it tells operators what is happening. The bar
Three classes of signal show up in every observability discussion: logs, metrics, and traces. The vocabulary matters because each one answers a different question and has different storage and cost characteristics. Mixing them up produces dashboards that cost too much, alerts that fire on the wrong condition, and debugging sessions that bog down because the right signal is missing. The three are sometimes called the three pillars of observability. The framing comes out of the SRE community at Go
A new pipeline does not need fifty monitors. It needs three. Did it run, did it succeed, and was the output the right size. Those three monitors catch most of the failure modes that show up in the first month. Adding more monitors before those three exist is premature optimization; adding fewer leaves blind spots that consumers will discover before the pipeline does. The Three Day-One Monitors Did It Run The simplest monitor is also the most embarrassing one to forget. A pipeline scheduled for 2
An alert is a request for human attention. Every alert that fires is a withdrawal from the on-call engineer's attention budget. A pipeline that pages on every minor anomaly bankrupts its on-call within weeks; the engineers stop reading the channel and the next real outage is missed. The discipline is to ration alerts so that the ones that fire are the ones that need a human to act now. The economics are stark: an engineer who responds to twenty pages a week treats the twenty-first as another rou
A runbook is a document that tells an on-call engineer what to do when a specific alert fires. It is not architecture documentation. It is not design rationale. It is a checklist tuned for the moment when something is wrong, the on-call has been paged, and the question is what to check first. A good runbook can be followed by an engineer who has never seen the pipeline before. A bad runbook is a wiki page that says 'contact Eric.' The shape of a useful runbook is closer to an emergency-room inta