[KNOWLEDGE HUB] OPERATIONS RESOURCES

Operations
Resources.

Practical references for the people on the bridge at 02:14 — severity classification, pre-call checklists, escalation workflows, and decision frameworks. Useful whether or not you ever pick up the phone.

[01]

Incident Severity Classification

Severity is decided by blast radius and recoverability, not by alert volume. A loud alert on a redundant subsystem is SEV-3. A silent loss of redundancy on production storage is SEV-1. Use these definitions to classify before you call.

SEV-1 · Production Down

Active business impact. Site/app offline, customer-facing failure, revenue or safety loss. Trigger: immediate engineer dispatch, parallel remote and on-site work, executive notification path engaged.

SEV-2 · Degraded, Redundancy Lost

User impact contained but the system cannot survive a second fault. One host down with HA running, one storage path lost, one PSU dead, one controller offline. Trigger: engineer engagement within 15 minutes, on-site if hardware fault suspected.

SEV-3 · Recovering Or Recoverable

Active resync, missed backup window, predictive failure alerts, single non-critical service degraded with workaround in place. Trigger: scheduled engineer engagement, monitored remotely, change ticket through normal channels.

SEV-4 · Advisory / Planned

Certificate expiry, firmware lag, capacity warning, decommission, asset audit. Trigger: maintenance-window planning, scheduled visit, batched with adjacent work where possible.

[02]

Universal Pre-Call Checklist

Eight items. Stage them before you call. Each one removes a minute of triage on the bridge and lets the on-call engineer start real diagnostic work the moment the call connects.

01System identifier — hostname, service tag, asset ID, VM name, FQDN.
02Physical location — facility, suite, cage, rack, U position, PDU side.
03Exact alert text or error condition. Screenshots over paraphrasing.
04Last known good state — timestamp of the last green dashboard.
05Changes in previous 24–72 hours — patches, firmware, network, certificates.
06Backup posture — last successful job, type, retention, tested restore?
07Authorized actions — read-only, power cycle, replacement, configuration changes.
08Decision-maker names and contact paths if escalation crosses authority.

[03]