IT SOPs and Runbooks: How to Document Them Properly

TL;DR

An SOP tells you how a process should work; a runbook tells you exactly what to do when something breaks. Most teams blur the two and end up with neither.
The best runbooks read like a recipe under pressure: numbered steps, real commands, screenshots, and a clear 'when to escalate' line.
Documentation rots fast. Build in an owner, a review date, and a way to update it without a half-day rewrite.
Capture the steps while you're actually doing the task, not from memory a week later, when the details are already fuzzy.

It's 2 a.m. The on-call engineer's phone is buzzing. A payment service is down, and the one person who knows how to restart it cleanly is asleep in another time zone. There's a wiki page, technically. It says "see Dave for details." Dave left in March.

We've all worked somewhere like this. The fix isn't more heroics. It's a real IT SOP runbook that a tired stranger can follow at 2 a.m. without guessing. That's the whole job of this kind of documentation: turn one person's head knowledge into something the team can run.

So let's talk about how to document SOPs and runbooks properly. Not the version that sounds nice in an audit and helps nobody. The version people actually open.

SOP vs. runbook: not the same thing

People use these words like they're interchangeable. They're not, and the confusion causes real damage.

An SOP (standard operating procedure) describes how a process should work, start to finish. It's the policy-flavored document. "All production database changes must be peer-reviewed and applied during the change window." It answers what we do and why.

A runbook is the hands-on guide for one specific task or failure. "Database CPU is pegged at 100%. Here's how to find the bad query and kill it." It answers what do I type, right now, to fix this.

Think of the SOP as the driving rules and the runbook as the GPS turn-by-turn. You need both, but you reach for them at different moments. Nobody reads a 12-page SOP while a server is on fire.

Pro tip: A quick gut check. If the document would be read calmly during planning, it's an SOP. If it'll be read by someone whose hands are shaking, it's a runbook. Write them differently.

Why most IT runbooks fail

We've read a lot of bad runbooks. The failures rhyme. Here's what most teams get wrong.

They're written from memory. Someone sits down a week after solving a problem and tries to reconstruct the steps. The result skips the obvious-at-the-time details, which are exactly the details a newcomer needs.

They explain instead of instruct. Three paragraphs on how the caching layer works, zero commands to actually flush it. Background is fine, but the steps have to be findable in two seconds.

They have no owner. A runbook nobody owns is a runbook nobody updates. Within six months the commands are wrong and the screenshots show an interface that no longer exists.

They assume too much. "Restart the service." Which service? On which host? With what command? "Obvious" is doing a lot of heavy lifting there, and at 2 a.m. nothing is obvious.

They live in five places. Half in a wiki, half in Slack threads, a bit in someone's Notion, a critical piece in a Google Doc only Dave could edit. People give up looking and just ping whoever's online.

Common mistake: Writing the runbook for the person who already knows the system. The whole point is the person who doesn't. If your most junior teammate can't follow it alone, it isn't done.

What a good runbook actually contains

A good runbook is boring in the best way. Predictable structure, no surprises. Here's what earns its place.

A title that's searchable. "Restart payment-api after OOM crash," not "Payment stuff." People search in a panic. Match the words they'll type.
When to use this. One line describing the trigger. The alert that fires, the symptom, the ticket type. This stops people from running the wrong runbook.
Prerequisites. Access, VPN, permissions, tools. Nothing kills momentum like discovering on step 4 that you don't have the right credentials.
Numbered steps with real commands. Exact commands, exact button names, exact file paths. Copy-paste ready. Screenshots where a UI is involved.
Expected output. What does success look like after each risky step? "You should see active (running)." Otherwise people can't tell if it worked.
Rollback. If a step makes things worse, how do you undo it? This is the part everyone forgets and everyone needs.
Escalation. A hard line: if you're not unblocked in X minutes, page Y. No hero should burn an hour alone.
Owner and last-reviewed date. So readers know whether to trust it.

Screenshots matter more than people admit. A line of text says "open the deployment settings." A screenshot with the exact menu circled removes all doubt. The catch is that screenshots go stale and are a pain to keep current. This is one spot where WriteHow earns its keep: you record yourself doing the task once, and it turns the recording into step-by-step text with auto-captured screenshots and annotations, so updating a runbook means re-recording instead of re-screenshotting by hand. The manual screenshot step is the part that quietly kills documentation, so removing it helps a lot.

The IT runbook template

Here's a template you can copy straight into your wiki or docs tool. Keep it tight. If a section doesn't apply, delete it rather than padding it.

IT Runbook Template

Title: [Action + system + trigger, e.g. "Restart payment-api after OOM crash"]
Owner: [Name or team]
Last reviewed: [Date] | Review every: [90 days / quarter]
Severity / priority: [P1 / P2 / P3]

1. When to use this

[The alert, symptom, or ticket type that triggers this runbook. One or two lines.]

2. Prerequisites

[Access / role needed, e.g. prod SSH or admin console]
[VPN or network requirement]
[Tools or CLI installed]

3. Steps

[Exact action or command.] Expected result: [what you should see.]
[Next action or command.] Expected result: [what you should see.]
[Continue. Add a screenshot for any UI step.]

4. Verify it worked

[The check that confirms the issue is resolved, e.g. health endpoint returns 200.]

5. Rollback

[How to undo each risky step if things get worse.]

6. Escalation

[If not resolved in __ minutes, page __ (name / on-call rotation / channel). Link to the incident process.]

7. Related links

[Dashboards, parent SOP, related runbooks, post-incident docs.]

Notice there's no fluff section for "introduction" or "purpose." A runbook is read under pressure. Every line that isn't a step or a safety net is a line someone has to scroll past while the clock runs.

A writing process that doesn't waste a day

The reason runbooks don't get written isn't that people don't care. It's that writing them the traditional way is slow and miserable. So here's a process that fits in the gaps.

Write it while you do it. The next time you handle the task, capture as you go. A screen recording, or just narrate each step into a doc. Memory is a liar; the live run is the truth.
Draft the steps first, prose later. Get the numbered actions down before you worry about phrasing. The steps are the product. Polish is optional.
Add the safety nets. Now go back and fill in expected output, rollback, and escalation. These take five minutes and save hours.
Hand it to someone who's never done it. Watch them follow it without help. Every place they pause or ask a question is a gap in the doc. Fix those, then ship.
Put it where people already look. If your runbooks live somewhere nobody opens, they don't exist. Publish into the same wiki or knowledge base the team uses every day.

That last point is where a tool like WriteHow can save a step or two, since it publishes the finished guide straight into Zendesk, Notion, Confluence, or GitBook instead of you copy-pasting screenshots across tools. But the principle stands no matter what you use: capture live, keep it where people work.

Pro tip: The "hand it to a stranger" test is the single highest-value thing on this list. A runbook that's never been followed by a second person is a hypothesis, not a procedure.

Keeping it current after you hit publish

Here's the uncomfortable truth. A runbook is never finished. Systems change, commands change, that one button moves to a new menu. A wrong runbook is worse than no runbook, because it sends people confidently down the wrong path.

So build maintenance in from the start.

Every runbook gets an owner. A person or a team, named at the top. Shared ownership is no ownership.
Set a review date. Quarterly is reasonable for most. Critical incident runbooks, more often. Put the date in the doc so its age is visible.
Update right after you use it. The best moment to fix a runbook is the second you notice a step is wrong, mid-incident. Leave a note even if you can't fix it then.
Run a game day. Once in a while, trigger a controlled failure and have someone follow the runbook cold. You'll find the rot fast, on your schedule instead of the outage's.
Track which ones get opened. If a runbook hasn't been viewed in a year, either nobody needs it or nobody can find it. Both are worth knowing.

None of this is glamorous. But the team that documents its SOPs and runbooks properly is the team where 2 a.m. is a 15-minute fix instead of a four-hour scramble. And nobody has to find Dave.

Where to go nextIT & Ops documentation WriteHow pricing WriteHow vs Tango

Frequently asked questions

What is the difference between an SOP and a runbook in IT?

An SOP describes how a process should work overall, including the rules and the reasoning behind it. A runbook is a focused, step-by-step guide for handling one specific task or failure, written to be followed quickly under pressure. You use an SOP during planning and a runbook during an incident.

What should every IT runbook include?

At minimum: a searchable title, a clear trigger for when to use it, prerequisites like access and tools, numbered steps with exact commands, the expected output for each risky step, rollback instructions, an escalation path, and an owner with a last-reviewed date. Screenshots help a lot for any step that involves a user interface.

How often should you update IT runbooks?

Review most runbooks at least quarterly, and review critical incident runbooks more often. Beyond the schedule, update a runbook the moment you notice a step is wrong while using it. A stale runbook can be worse than none because it sends people down the wrong path with confidence.

How do you write a runbook quickly without it taking all day?

Capture the steps while you actually perform the task instead of from memory later. Record your screen or narrate each step, draft the numbered actions first, then add safety nets like rollback and escalation. Finally, have someone who has never done the task follow it to find the gaps.

Where should IT runbooks be stored?

Keep them in the single knowledge base or wiki your team already uses every day, not scattered across Slack threads, personal docs, and email. Centralizing them in one searchable place is what makes people actually find and follow them during an incident.

Skip the manual write-up

WriteHow records your process once and turns it into a polished how-to guide — screenshots, annotations, and 50+ languages included.

See how WriteHow helps

Divya Krishnan · Growth Marketer at WriteHow
Writes about documentation, customer support, and SEO.

SOP vs. runbook: not the same thing

Why most IT runbooks fail

What a good runbook actually contains

The IT runbook template

IT Runbook Template

1. When to use this

2. Prerequisites

3. Steps

4. Verify it worked

5. Rollback

6. Escalation

7. Related links

A writing process that doesn't waste a day

Keeping it current after you hit publish

Frequently asked questions

Skip the manual write-up

Keep reading