- An SOP tells you how a process should work; a runbook tells you exactly what to do when something breaks. Most teams blur the two and end up with neither.
- The best runbooks read like a recipe under pressure: numbered steps, real commands, screenshots, and a clear 'when to escalate' line.
- Documentation rots fast. Build in an owner, a review date, and a way to update it without a half-day rewrite.
- Capture the steps while you're actually doing the task, not from memory a week later, when the details are already fuzzy.
It's 2 a.m. The on-call engineer's phone is buzzing. A payment service is down, and the one person who knows how to restart it cleanly is asleep in another time zone. There's a wiki page, technically. It says "see Dave for details." Dave left in March.
We've all worked somewhere like this. The fix isn't more heroics. It's a real IT SOP runbook that a tired stranger can follow at 2 a.m. without guessing. That's the whole job of this kind of documentation: turn one person's head knowledge into something the team can run.
So let's talk about how to document SOPs and runbooks properly. Not the version that sounds nice in an audit and helps nobody. The version people actually open.
SOP vs. runbook: not the same thing
People use these words like they're interchangeable. They're not, and the confusion causes real damage.
An SOP (standard operating procedure) describes how a process should work, start to finish. It's the policy-flavored document. "All production database changes must be peer-reviewed and applied during the change window." It answers what we do and why.
A runbook is the hands-on guide for one specific task or failure. "Database CPU is pegged at 100%. Here's how to find the bad query and kill it." It answers what do I type, right now, to fix this.
Think of the SOP as the driving rules and the runbook as the GPS turn-by-turn. You need both, but you reach for them at different moments. Nobody reads a 12-page SOP while a server is on fire.
Why most IT runbooks fail
We've read a lot of bad runbooks. The failures rhyme. Here's what most teams get wrong.
They're written from memory. Someone sits down a week after solving a problem and tries to reconstruct the steps. The result skips the obvious-at-the-time details, which are exactly the details a newcomer needs.
They explain instead of instruct. Three paragraphs on how the caching layer works, zero commands to actually flush it. Background is fine, but the steps have to be findable in two seconds.
They have no owner. A runbook nobody owns is a runbook nobody updates. Within six months the commands are wrong and the screenshots show an interface that no longer exists.
They assume too much. "Restart the service." Which service? On which host? With what command? "Obvious" is doing a lot of heavy lifting there, and at 2 a.m. nothing is obvious.
They live in five places. Half in a wiki, half in Slack threads, a bit in someone's Notion, a critical piece in a Google Doc only Dave could edit. People give up looking and just ping whoever's online.
What a good runbook actually contains
A good runbook is boring in the best way. Predictable structure, no surprises. Here's what earns its place.
- A title that's searchable. "Restart payment-api after OOM crash," not "Payment stuff." People search in a panic. Match the words they'll type.
- When to use this. One line describing the trigger. The alert that fires, the symptom, the ticket type. This stops people from running the wrong runbook.
- Prerequisites. Access, VPN, permissions, tools. Nothing kills momentum like discovering on step 4 that you don't have the right credentials.
- Numbered steps with real commands. Exact commands, exact button names, exact file paths. Copy-paste ready. Screenshots where a UI is involved.
- Expected output. What does success look like after each risky step? "You should see
active (running)." Otherwise people can't tell if it worked. - Rollback. If a step makes things worse, how do you undo it? This is the part everyone forgets and everyone needs.
- Escalation. A hard line: if you're not unblocked in X minutes, page Y. No hero should burn an hour alone.
- Owner and last-reviewed date. So readers know whether to trust it.
Screenshots matter more than people admit. A line of text says "open the deployment settings." A screenshot with the exact menu circled removes all doubt. The catch is that screenshots go stale and are a pain to keep current. This is one spot where WriteHow earns its keep: you record yourself doing the task once, and it turns the recording into step-by-step text with auto-captured screenshots and annotations, so updating a runbook means re-recording instead of re-screenshotting by hand. The manual screenshot step is the part that quietly kills documentation, so removing it helps a lot.
The IT runbook template
Here's a template you can copy straight into your wiki or docs tool. Keep it tight. If a section doesn't apply, delete it rather than padding it.
IT Runbook Template
- Title: [Action + system + trigger, e.g. "Restart payment-api after OOM crash"]
- Owner: [Name or team]
- Last reviewed: [Date] | Review every: [90 days / quarter]
- Severity / priority: [P1 / P2 / P3]
1. When to use this
[The alert, symptom, or ticket type that triggers this runbook. One or two lines.]
2. Prerequisites
- [Access / role needed, e.g. prod SSH or admin console]
- [VPN or network requirement]
- [Tools or CLI installed]
3. Steps
- [Exact action or command.] Expected result: [what you should see.]
- [Next action or command.] Expected result: [what you should see.]
- [Continue. Add a screenshot for any UI step.]
4. Verify it worked
- [The check that confirms the issue is resolved, e.g. health endpoint returns 200.]
5. Rollback
- [How to undo each risky step if things get worse.]
6. Escalation
[If not resolved in __ minutes, page __ (name / on-call rotation / channel). Link to the incident process.]
7. Related links
- [Dashboards, parent SOP, related runbooks, post-incident docs.]
Notice there's no fluff section for "introduction" or "purpose." A runbook is read under pressure. Every line that isn't a step or a safety net is a line someone has to scroll past while the clock runs.
A writing process that doesn't waste a day
The reason runbooks don't get written isn't that people don't care. It's that writing them the traditional way is slow and miserable. So here's a process that fits in the gaps.
- Write it while you do it. The next time you handle the task, capture as you go. A screen recording, or just narrate each step into a doc. Memory is a liar; the live run is the truth.
- Draft the steps first, prose later. Get the numbered actions down before you worry about phrasing. The steps are the product. Polish is optional.
- Add the safety nets. Now go back and fill in expected output, rollback, and escalation. These take five minutes and save hours.
- Hand it to someone who's never done it. Watch them follow it without help. Every place they pause or ask a question is a gap in the doc. Fix those, then ship.
- Put it where people already look. If your runbooks live somewhere nobody opens, they don't exist. Publish into the same wiki or knowledge base the team uses every day.
That last point is where a tool like WriteHow can save a step or two, since it publishes the finished guide straight into Zendesk, Notion, Confluence, or GitBook instead of you copy-pasting screenshots across tools. But the principle stands no matter what you use: capture live, keep it where people work.
Keeping it current after you hit publish
Here's the uncomfortable truth. A runbook is never finished. Systems change, commands change, that one button moves to a new menu. A wrong runbook is worse than no runbook, because it sends people confidently down the wrong path.
So build maintenance in from the start.
- Every runbook gets an owner. A person or a team, named at the top. Shared ownership is no ownership.
- Set a review date. Quarterly is reasonable for most. Critical incident runbooks, more often. Put the date in the doc so its age is visible.
- Update right after you use it. The best moment to fix a runbook is the second you notice a step is wrong, mid-incident. Leave a note even if you can't fix it then.
- Run a game day. Once in a while, trigger a controlled failure and have someone follow the runbook cold. You'll find the rot fast, on your schedule instead of the outage's.
- Track which ones get opened. If a runbook hasn't been viewed in a year, either nobody needs it or nobody can find it. Both are worth knowing.
None of this is glamorous. But the team that documents its SOPs and runbooks properly is the team where 2 a.m. is a 15-minute fix instead of a four-hour scramble. And nobody has to find Dave.
Frequently asked questions
What is the difference between an SOP and a runbook in IT?
What should every IT runbook include?
How often should you update IT runbooks?
How do you write a runbook quickly without it taking all day?
Where should IT runbooks be stored?
Skip the manual write-up
WriteHow records your process once and turns it into a polished how-to guide — screenshots, annotations, and 50+ languages included.
See how WriteHow helps