Reliability

Your On-Call Rotation Is Broken. Here's How to Fix It.

June 18, 2026 7 min read

The engineering lead was getting paged 43 times a week. Not 43 incidents — 43 pages. Some were real. Most were not. High CPU on a batch job that self-resolved in 90 seconds. Disk usage at 74% on a server with a month of headroom. A health check failing for three seconds during a scheduled deploy. Each one woke someone up at 3am. Each one trained the on-call engineer to treat the next page as probably nothing.

That's the trap. When too much pages, nothing matters. And when nothing matters, the real incident — the one that actually needs a human — gets the same level of urgency as the disk-at-74% alert that's been firing every night for a year.

Most on-call rotations don't fail because of bad engineers. They fail because of bad systems, built up incrementally by people who were trying to be safe. More alerts felt more careful. More notifications felt more covered. By the time anyone looked at it directly, the signal-to-noise ratio was so bad that on-call had become a nightly ordeal that nobody wanted, and everyone dreaded their rotation week.

Here's the process we use to fix it.

Step 1: The alert audit — close everything that doesn't require a human

The starting point is brutal honesty about what each alert is actually for. For every alert in the system, ask one question: if this fires at 3am, what does the on-call engineer do?

If the answer is "wait and see if it resolves" — it's not an alert. It's a log entry. Retire it or route it to a dashboard, not a page.

If the answer is "there's nothing they can do until business hours" — it's not an alert. It's a Slack notification. Route it to a low-priority channel, not PagerDuty.

If the answer involves a runbook step that results in a human pressing a button — it's a candidate for automation, not a page.

In practice, we export every alert from the monitoring system, go through each one in a spreadsheet, and classify it: page (needs a human immediately), ticket (important but not urgent), log (informational only), or retire (nobody can articulate why this exists). The retire pile is usually larger than anyone expects.

In the case above: 43 pages a week dropped to 9 actionable pages a week after the audit. No reliability change. Just honesty about what actually needed a human.

Step 2: Thresholds based on real impact, not technical state

Most alert thresholds are set once and never revisited. CPU at 80%? Sure. Disk at 70%? Sounds reasonable. These numbers feel safe but they're decoupled from the actual impact on users.

Replace them with questions about user impact. At what CPU level does response latency degrade enough that users notice? At what disk level is there actual risk of running out in the next 6 hours? Those are the thresholds worth alerting on. If CPU can run at 90% for a week without affecting users, then 80% is noise.

For services with real SLOs, alert on SLO burn rate rather than raw metrics. A spike in CPU that doesn't affect the error rate or latency of your API is irrelevant. A latency p99 that's burning through your error budget faster than it should be is immediately actionable — regardless of what CPU is doing.

Step 3: Runbook before you alert

If an alert doesn't have a runbook, the on-call engineer can't handle it without tribal knowledge. That means the alert escalates to whoever owns the service, who gets woken up, who fixes it in 5 minutes because they know exactly what the problem is — and the fix never gets written down.

The rule we implement: no alert without a runbook link. The alert body must contain a link to a runbook that describes what to look at first, how to determine severity, and what the remediation steps are. Not a 40-page doc — a tight procedure. Usually 5–8 steps. Write it the first time someone fixes the issue, not the second.

The secondary rule: if the runbook is "run this command and the issue resolves," then automate it. On-call is for situations that require judgment, not for situations that require executing a known fix at 3am.

Step 4: Track MTTD and MTTR, actually

Mean time to detect (MTTD) and mean time to resolve (MTTR) are the two numbers that tell you whether your on-call process is working. Most teams don't track them because they're not wired into the incident management workflow — somebody resolves a PagerDuty alert and that's the end of it, no timestamps captured.

Wire them in. PagerDuty, Opsgenie, and most incident platforms can export MTTD and MTTR per service per month. Start reviewing them in your weekly engineering sync. Within two months of tracking, the pattern becomes clear: a handful of services account for most of the pages and most of the resolution time. Those are your highest-leverage improvement targets.

The metric that matters most to individual engineers: how many nights per rotation week did something page overnight? Track it. If it's more than 2 per week on average, that's the threshold at which engineers start gaming their schedules and quietly updating their résumés.

Step 5: Blameless postmortems for every page that took more than 15 minutes to resolve

Most startups do postmortems after major outages. Few do them for the smaller incidents — the 30-minute production degradation, the failed deployment that required a rollback, the database query that held a lock longer than it should have. But that's where the systemic patterns live.

The format we use is deliberately short: five sections, each one or two paragraphs. Timeline. Impact (user-facing, not just technical). Root cause. Contributing factors. Action items. No blame framing anywhere in the document. The question is always how did the system allow this to happen, not who did what.

The action items from a postmortem should be small and completable. "Improve monitoring" is not an action item. "Add an alert on error rate for the payment API with threshold >1% for >5 minutes and link it to runbook #47" is an action item. Assign it, set a deadline, and check it off in the next postmortem review.

After three months of consistent postmortems, the same issues stop appearing. Not because everyone got better — because the action items actually get done, and the system actually changes.

Step 6: Define escalation paths before the incident

When a 2am page turns out to be something the on-call engineer can't resolve alone, who do they call? If the answer is "whoever they can reach" or "DM the team lead and hope," the escalation is going to be slow and stressful.

Define it explicitly: for each service, who is the escalation contact, and at what point does escalation happen? The trigger should be time-based — if I've been on this for 20 minutes and haven't made progress, I escalate — not judgment-based, because judgment at 3am is impaired. The escalation contact should know they're the escalation contact for that service. Ideally they've done a dry run of the relevant scenarios in an incident drill.

Keep the escalation path in the runbook. The on-call engineer opens the runbook, works through the steps, and the last step is: "If unresolved after 20 minutes, escalate to @name via phone call (not Slack)."

The results from doing this systematically

For the team above: 43 pages/week went to 7 pages/week within six weeks. MTTR dropped from 38 minutes average to 12 minutes, because the runbooks existed and the on-call engineers could actually use them. Two engineers who had quietly indicated they were burning out from on-call both said explicitly that the changes made the rotation tolerable again.

None of it required new infrastructure. No observability platform replacement, no new tooling, no migrations. Just an honest audit of what was alerting and why, clear runbooks for the things that actually needed them, and the discipline to run a short postmortem on every incident worth learning from.

"I used to put my phone in another room during my rotation week so my partner couldn't see how often it was going off. Now I leave it on the nightstand. That's the difference."

Where to start if you're in the middle of this right now

Export your last 90 days of alert data. Count how many unique alert types fired. Then count how many of those required a human action that couldn't have been automated or deferred to business hours. That ratio is your baseline.
Pick the three highest-frequency noisy alerts. For each one: what does the on-call engineer actually do? If the answer is "wait, usually it resolves" — retire the alert today.
Find the three services with the highest MTTR. Write a draft runbook for the most common failure mode of each, even if it's rough. Something is better than nothing at 3am.
Set up a monthly 30-minute meeting to review MTTD/MTTR numbers and postmortem action item status. This meeting is the forcing function that makes everything else happen.

The rotation won't be fixed in a day. But if you do one alert audit and write three runbooks this week, next month's on-call will be measurably better than this month's. That's the goal — not perfection, just a direction.

Is your on-call rotation burning out your team?

We audit alerting setups, write runbooks, and help engineering teams build the on-call practices that don't cost you engineers. Book a free conversation.

Talk to Us

← Back to all articles