[INC-02]  Storage Incident Response

RAID / SAN
Recovery.

Multi-disk failures, controller crashes, degraded arrays, and emergency rebuilds — handled on-site by senior storage engineers, never by remote chat support.

Escalate Now · +1 (703) 343-9850
[CONTEXT]

What This Actually Is

What a RAID failure actually is

A RAID array is a contract between disks and a controller: the controller promises a virtual disk built from parity, the disks promise to behave. When that contract breaks — through a failed disk, a controller fault, a firmware bug, or a write-hole event during power loss — the array enters a degraded or offline state. Recovery is about restoring the contract without losing the data it protected.

Why second-disk-out is so dangerous

When a RAID 5 array loses one disk, the controller begins a rebuild that reads every sector on every surviving disk. URE (unrecoverable read error) rates on multi-TB SATA drives mean that during a rebuild of a large array, the probability of encountering an unreadable sector is non-trivial. The second disk does not have to fail — it just has to have one bad sector — for the rebuild to fail and the array to go offline.

How recovery actually works

We image the surviving disks bit-for-bit before any controller action. Only then do we attempt parity reconstruction, virtual reassembly, or controlled rebuild. This sequence — image first, recover second — is what separates a successful recovery from a destroyed array. Most failed self-recoveries we are called in to clean up violate this order.

What variables decide the outcome

Array type (5/6/10/50/60), stripe size, controller make and firmware, disk age and SMART history, whether write-back caching was battery-backed at failure, and time since the first disk dropped. Filesystem layer on top (VMFS, NTFS, ext4, btrfs, ZFS) determines what reassembly tools we use after the block layer is restored.

[SYMPTOMS]

Field Indicators

Two or more failed drives in a RAID 5 or RAID 6 array
SAN controller offline, dual-controller failover failed
Synology / QNAP NAS volume crashed, btrfs/ext4 unmountable
Dell PERC / HPE Smart Array foreign configuration
NetApp / Pure Storage degraded aggregate
Backup window missed, ransomware on file shares
[TRIAGE MATRIX]

Severity Classification

Use this to decide whether to page an engineer at 2 a.m. or wait until business hours. Severity is determined by blast radius and recoverability, not how loud the alert is.

TierDefinitionField SignalsResponse
SEV-1Array offline, no read accessMultiple disks failed, controller foreign config, VMs/shares unmountedImmediate freeze + on-site within 60 min
SEV-2Degraded with active rebuildOne disk failed, rebuild in progress, risk of second failureEngineer engagement within 30 min, monitor rebuild
SEV-3Predictive failure / SMART warningDisk reporting reallocated sectors but still onlineScheduled replacement during maintenance window
SEV-4Controller battery / cache warningCache flushed to disk, write-through mode activeBattery replacement scheduled, performance monitored
[RUNBOOK]

Response Procedure

01
Freeze

Halt rebuilds, image disks before any write activity.

02
Diagnose

Controller logs, SMART status, parity reconstruction analysis.

03
Rebuild

Controlled array rebuild with matched replacement spindles.

04
Restore

Veeam, Commvault, or native snapshot restoration validated.

[PRE-CALL CHECKLIST]

Before You Escalate

Have this information staged. It cuts triage time roughly in half and lets the on-call engineer start work the moment the bridge opens.

  1. 01Array type (RAID 5/6/10/50/60), stripe size, and disk count.
  2. 02Controller model and firmware version (MegaRAID, PERC, Smart Array, Adaptec).
  3. 03Exact failure sequence — which disk first, what alert, what action was taken.
  4. 04Whether any rebuild, replace, or initialize was attempted before the call.
  5. 05Most recent backup: date, type (full/incremental/snapshot), tested restore?
  6. 06Filesystem on top (VMFS, NTFS, XFS, ext4, btrfs, ZFS) and consumer (Hyper-V, ESXi, file share, database).
  7. 07Power event in the last 24 hours — UPS, generator, or utility transfer.
[FAILURE MODES]

Common Mistakes

Patterns we see repeatedly on inbound calls. Avoiding any one of them measurably improves recoverability.

Letting the controller auto-rebuild onto a used hot spare

If the spare has aged or has bad sectors, the rebuild becomes the failure event. Always image first, validate spare second.

Replacing the controller before the disks

A new controller with different firmware can refuse to import the foreign config or interpret stripe layout differently — turning a degraded array into an unrecoverable one.

Initializing a 'foreign' configuration to clear the warning

Initialize writes new metadata and destroys the existing array layout. The single most common irrecoverable mistake we see.

Powering the chassis off and on to 'reset' the controller

If write-back cache was holding dirty data and the battery is degraded, the power cycle is the moment data is lost — not the original disk failure.

[FIELD SCENARIOS]

Real-World Incidents

Sanitized accounts of incidents we have actually run in Northern Virginia. Names removed, sequence and decisions intact.

Herndon · Synology RS-series · double disk failure

A 12-bay RS1619xs+ in SHR-2 lost two disks during a Saturday scrub. The on-site IT lead pulled and replaced both before calling. We aborted the second rebuild, imaged the surviving 10 disks, reconstructed the btrfs metadata from the images, and recovered the 38 TB volume with three days of journal replay. Zero file loss.

Chantilly · Dell PERC H750 · controller fault mid-write

A PERC controller faulted during a heavy VMFS write window. On reboot, the virtual disk appeared as foreign on a replacement controller with a newer firmware revision. Rather than import (which would have rewritten metadata), we matched the original firmware on a bench unit, imported cleanly, and let the VMware layer handle the partial-write recovery.

[TERMINOLOGY]

Operational Glossary

URE
Unrecoverable Read Error — a sector the disk cannot read. Specified per ~10^14 bits on consumer SATA; the math is why RAID 5 is risky on large arrays.
Foreign config
An array configuration the current controller did not create. Importing is usually safe; initializing destroys it.
Write hole
Window during which a parity write is partially complete. Power loss here corrupts the stripe; battery-backed cache exists to prevent it.
Hot spare
A disk reserved to replace a failed member automatically. Useful only if regularly health-checked.
Resilver / rebuild
Process of rewriting a replacement disk from parity or mirror. Read-heavy on surviving disks.
SHR
Synology Hybrid RAID — flexible array layout allowing mixed disk sizes. Recovers like RAID 5/6 underneath.
[VENDORS]

Certified Platforms

Dell EMCHPENetAppPure StorageSynologyQNAPVeeamCommvault
[SECTOR]

Service Areas

Ashburn
Reston
Herndon
Sterling
Chantilly
Tysons
Dulles
Leesburg
Fairfax
[QUESTIONS FROM THE FIELD]

Answers Engineers Ask

[How we sourced this]
Pricing benches were assembled from a Q1 2026 audit of seven active storage-recovery partners serving the Northern Virginia metro, cross-referenced against actual invoiced recoveries in our dispatch log (Ashburn, Sterling, Herndon, Chantilly). All responders we connect callers with are fully insured and carry the vendor certifications listed per platform.

What does emergency RAID or SAN recovery cost in Northern Virginia?

+

Most NOVA-area emergency array recoveries land between $1,800 and $9,500 all-in. The deciding factors are array size, whether disks need to be imaged in a clean facility, and how many prior recovery attempts were made before we were called. Diagnostic + written quote is flat $450 and credited to recovery.

ScenarioDisk CountNOVA Cost RangeTypical RTO
RAID 5/6, single disk out4 – 8$1,800 – $3,4006 – 12 hr
RAID 5, double disk failure4 – 12$3,500 – $6,80024 – 48 hr
SAN controller crash12 – 48$4,500 – $9,50012 – 36 hr
NAS (Synology/QNAP) volume crash4 – 12$1,800 – $4,20012 – 24 hr

All ranges include on-site dispatch within the Ashburn–Sterling–Reston corridor. No trip fees inside that ring.

Can you actually recover a RAID 5 array that lost two disks?

+

Yes, the majority of the time — provided no one ran an initialize or forced a rebuild after the second failure. We image every surviving spindle first, then perform virtual parity reassembly off the images. Originals stay quarantined so we can restart from a known state if anything goes sideways.

  • Both disks mechanically alive, just parity-inconsistent → ~90% full recovery rate.
  • One fully dead disk, second readable → ~75% with partial file loss possible.
  • Both dead + initialize already attempted → forensic extraction only, file-level not volume-level.

Are you licensed and insured to handle our production storage?

+

Every storage engineer in our NOVA dispatch network is independently insured ($2M professional liability minimum), background-checked, and certified on the specific platform they are dispatched to. We never send a generalist to a NetApp incident or a Windows tech to a ZFS pool.

  • Dell EMC: EMCDSA or active partner-tier engineer.
  • HPE: ASE Storage Solutions Architect minimum.
  • NetApp: NCDA with WAFL recovery experience.
  • Pure Storage: vendor-coordinated, host-side owned by our team.
  • Synology / QNAP: btrfs and ext4 forensic experience verified.

What is your response time across Northern Virginia?

+

Remote engineer on a bridge within 15 minutes, 24/7. On-site disk imaging in Ashburn, Sterling, and Herndon runs 45–75 minutes from our staging point; Reston, Chantilly, and Tysons add 10–20 minutes depending on Dulles Toll Road and Route 28 conditions.

Will my data be safer if I just let the controller finish its rebuild?

+

Often no — and this is the single most common reason recoveries fail. A rebuild reads every sector on every surviving disk, which is exactly when an aging second disk throws a URE and takes the array offline. If your array is degraded and the data matters, freeze the rebuild and image first.

URE math: consumer SATA spec is one unrecoverable error per ~10^14 bits read. A rebuild of a 4 × 6 TB RAID 5 reads ~144 trillion bits — well inside the error envelope.

Do you coordinate with our cyber insurance carrier on ransomware events?

+

Yes. We isolate affected hosts, preserve volume snapshots before they age out of retention, document chain of custody, and work directly with your IR firm and carrier. We do not negotiate with threat actors; we restore from the cleanest available state and let counsel handle the rest.

Transparency note: this page connects you with vetted Northern Virginia infrastructure specialists in our dispatch network. Every responder is independently insured, badge-credentialed at the major Data Center Alley facilities, and audited annually for vendor certification currency.