[KNOWLEDGE HUB]  OPERATIONS RESOURCES

Operations
Resources.

Practical references for the people on the bridge at 02:14 — severity classification, pre-call checklists, escalation workflows, and decision frameworks. Useful whether or not you ever pick up the phone.

[01]

Incident Severity Classification

Severity is decided by blast radius and recoverability, not by alert volume. A loud alert on a redundant subsystem is SEV-3. A silent loss of redundancy on production storage is SEV-1. Use these definitions to classify before you call.

SEV-1 · Production Down

Active business impact. Site/app offline, customer-facing failure, revenue or safety loss. Trigger: immediate engineer dispatch, parallel remote and on-site work, executive notification path engaged.

SEV-2 · Degraded, Redundancy Lost

User impact contained but the system cannot survive a second fault. One host down with HA running, one storage path lost, one PSU dead, one controller offline. Trigger: engineer engagement within 15 minutes, on-site if hardware fault suspected.

SEV-3 · Recovering Or Recoverable

Active resync, missed backup window, predictive failure alerts, single non-critical service degraded with workaround in place. Trigger: scheduled engineer engagement, monitored remotely, change ticket through normal channels.

SEV-4 · Advisory / Planned

Certificate expiry, firmware lag, capacity warning, decommission, asset audit. Trigger: maintenance-window planning, scheduled visit, batched with adjacent work where possible.

[02]

Universal Pre-Call Checklist

Eight items. Stage them before you call. Each one removes a minute of triage on the bridge and lets the on-call engineer start real diagnostic work the moment the call connects.

  1. 01System identifier — hostname, service tag, asset ID, VM name, FQDN.
  2. 02Physical location — facility, suite, cage, rack, U position, PDU side.
  3. 03Exact alert text or error condition. Screenshots over paraphrasing.
  4. 04Last known good state — timestamp of the last green dashboard.
  5. 05Changes in previous 24–72 hours — patches, firmware, network, certificates.
  6. 06Backup posture — last successful job, type, retention, tested restore?
  7. 07Authorized actions — read-only, power cycle, replacement, configuration changes.
  8. 08Decision-maker names and contact paths if escalation crosses authority.
[03]

Escalation Workflow

00:00
Detect

Monitoring alert fires or user reports degradation. Severity classified using the matrix above.

00:01
Stage

Pull the eight checklist items into the channel. Page the on-call.

00:02
Call

Live engineer pickup in under 60s. Bridge opens, severity confirmed, dispatch decision made.

00:15
Roll

Second engineer rolling from Ashburn staging with the relevant parts cart. Remote diagnosis begins in parallel.

01:00
On-Site

Engineer in the cage. Console access established. Actions logged and photo-documented.

[04]

Vendor Support vs Emergency Engineering

They are not substitutes — they are complements. Use both for the right reasons. Confusing them is the most common cause of extended outages we are called in to recover.

DimensionVendor SupportEmergency Engineering
Best forCode-level bugs, warranty, license, RMAOperational recovery, on-site work, decision support
Response timeHours to days depending on contract tier< 60 seconds to a senior engineer
Physical presenceNoneOn-site < 60 min in NOVA core
Scope of authorityLimited to their productWhole-stack: hypervisor, storage, network, OS
Cost shapeAnnual contractPer-incident or retainer
When to useAfter stabilization, for root causeDuring the incident, for stabilization and recovery