Practical references for the people on the bridge at 02:14 — severity classification, pre-call checklists, escalation workflows, and decision frameworks. Useful whether or not you ever pick up the phone.
Severity is decided by blast radius and recoverability, not by alert volume. A loud alert on a redundant subsystem is SEV-3. A silent loss of redundancy on production storage is SEV-1. Use these definitions to classify before you call.
Active business impact. Site/app offline, customer-facing failure, revenue or safety loss. Trigger: immediate engineer dispatch, parallel remote and on-site work, executive notification path engaged.
User impact contained but the system cannot survive a second fault. One host down with HA running, one storage path lost, one PSU dead, one controller offline. Trigger: engineer engagement within 15 minutes, on-site if hardware fault suspected.
Active resync, missed backup window, predictive failure alerts, single non-critical service degraded with workaround in place. Trigger: scheduled engineer engagement, monitored remotely, change ticket through normal channels.
Certificate expiry, firmware lag, capacity warning, decommission, asset audit. Trigger: maintenance-window planning, scheduled visit, batched with adjacent work where possible.
Eight items. Stage them before you call. Each one removes a minute of triage on the bridge and lets the on-call engineer start real diagnostic work the moment the call connects.
Monitoring alert fires or user reports degradation. Severity classified using the matrix above.
Pull the eight checklist items into the channel. Page the on-call.
Live engineer pickup in under 60s. Bridge opens, severity confirmed, dispatch decision made.
Second engineer rolling from Ashburn staging with the relevant parts cart. Remote diagnosis begins in parallel.
Engineer in the cage. Console access established. Actions logged and photo-documented.
They are not substitutes — they are complements. Use both for the right reasons. Confusing them is the most common cause of extended outages we are called in to recover.
| Dimension | Vendor Support | Emergency Engineering |
|---|---|---|
| Best for | Code-level bugs, warranty, license, RMA | Operational recovery, on-site work, decision support |
| Response time | Hours to days depending on contract tier | < 60 seconds to a senior engineer |
| Physical presence | None | On-site < 60 min in NOVA core |
| Scope of authority | Limited to their product | Whole-stack: hypervisor, storage, network, OS |
| Cost shape | Annual contract | Per-incident or retainer |
| When to use | After stabilization, for root cause | During the incident, for stabilization and recovery |
PSOD, vCenter, vSAN, HA, DRS — full runbook, severity matrix, and field scenarios.
URE math, controller foreign config, write-hole events, controlled rebuild sequencing.
Realistic ETAs per facility, dispatch workflow, badge mechanics across NOVA.
Per-city ETAs, facility-specific access notes, what we stage in Ashburn.
Terminology engineers actually use — APD/PDL, write hole, MMR, vSAN witness, foreign config.