[INC-02] Storage Incident Response

RAID / SAN
Recovery.

Multi-disk failures, controller crashes, degraded arrays, and emergency rebuilds — handled on-site by senior storage engineers, never by remote chat support.

Escalate Now · +1 (703) 343-9850

[CONTEXT]

What This Actually Is

What a RAID failure actually is

A RAID array is a contract between disks and a controller: the controller promises a virtual disk built from parity, the disks promise to behave. When that contract breaks — through a failed disk, a controller fault, a firmware bug, or a write-hole event during power loss — the array enters a degraded or offline state. Recovery is about restoring the contract without losing the data it protected.

Why second-disk-out is so dangerous

When a RAID 5 array loses one disk, the controller begins a rebuild that reads every sector on every surviving disk. URE (unrecoverable read error) rates on multi-TB SATA drives mean that during a rebuild of a large array, the probability of encountering an unreadable sector is non-trivial. The second disk does not have to fail — it just has to have one bad sector — for the rebuild to fail and the array to go offline.

How recovery actually works

We image the surviving disks bit-for-bit before any controller action. Only then do we attempt parity reconstruction, virtual reassembly, or controlled rebuild. This sequence — image first, recover second — is what separates a successful recovery from a destroyed array. Most failed self-recoveries we are called in to clean up violate this order.

What variables decide the outcome

Array type (5/6/10/50/60), stripe size, controller make and firmware, disk age and SMART history, whether write-back caching was battery-backed at failure, and time since the first disk dropped. Filesystem layer on top (VMFS, NTFS, ext4, btrfs, ZFS) determines what reassembly tools we use after the block layer is restored.

[SYMPTOMS]

Field Indicators

▸ Two or more failed drives in a RAID 5 or RAID 6 array

▸ SAN controller offline, dual-controller failover failed

▸ Synology / QNAP NAS volume crashed, btrfs/ext4 unmountable

▸ Dell PERC / HPE Smart Array foreign configuration

▸ NetApp / Pure Storage degraded aggregate

▸ Backup window missed, ransomware on file shares

[TRIAGE MATRIX]

Severity Classification

Use this to decide whether to page an engineer at 2 a.m. or wait until business hours. Severity is determined by blast radius and recoverability, not how loud the alert is.

Tier	Definition	Field Signals	Response
SEV-1	Array offline, no read access	Multiple disks failed, controller foreign config, VMs/shares unmounted	Immediate freeze + on-site within 60 min
SEV-2	Degraded with active rebuild	One disk failed, rebuild in progress, risk of second failure	Engineer engagement within 30 min, monitor rebuild
SEV-3	Predictive failure / SMART warning	Disk reporting reallocated sectors but still online	Scheduled replacement during maintenance window
SEV-4	Controller battery / cache warning	Cache flushed to disk, write-through mode active	Battery replacement scheduled, performance monitored

[RUNBOOK]

Response Procedure

Freeze

Halt rebuilds, image disks before any write activity.

Diagnose

Controller logs, SMART status, parity reconstruction analysis.

Rebuild

Controlled array rebuild with matched replacement spindles.

Restore

Veeam, Commvault, or native snapshot restoration validated.

[PRE-CALL CHECKLIST]

Before You Escalate

Have this information staged. It cuts triage time roughly in half and lets the on-call engineer start work the moment the bridge opens.

01Array type (RAID 5/6/10/50/60), stripe size, and disk count.
02Controller model and firmware version (MegaRAID, PERC, Smart Array, Adaptec).
03Exact failure sequence — which disk first, what alert, what action was taken.
04Whether any rebuild, replace, or initialize was attempted before the call.
05Most recent backup: date, type (full/incremental/snapshot), tested restore?
06Filesystem on top (VMFS, NTFS, XFS, ext4, btrfs, ZFS) and consumer (Hyper-V, ESXi, file share, database).
07Power event in the last 24 hours — UPS, generator, or utility transfer.

[FAILURE MODES]

Common Mistakes

Patterns we see repeatedly on inbound calls. Avoiding any one of them measurably improves recoverability.

▲ Letting the controller auto-rebuild onto a used hot spare

If the spare has aged or has bad sectors, the rebuild becomes the failure event. Always image first, validate spare second.

▲ Replacing the controller before the disks

A new controller with different firmware can refuse to import the foreign config or interpret stripe layout differently — turning a degraded array into an unrecoverable one.

▲ Initializing a 'foreign' configuration to clear the warning

Initialize writes new metadata and destroys the existing array layout. The single most common irrecoverable mistake we see.

▲ Powering the chassis off and on to 'reset' the controller

If write-back cache was holding dirty data and the battery is degraded, the power cycle is the moment data is lost — not the original disk failure.

[FIELD SCENARIOS]

Real-World Incidents

Sanitized accounts of incidents we have actually run in Northern Virginia. Names removed, sequence and decisions intact.

Herndon · Synology RS-series · double disk failure

A 12-bay RS1619xs+ in SHR-2 lost two disks during a Saturday scrub. The on-site IT lead pulled and replaced both before calling. We aborted the second rebuild, imaged the surviving 10 disks, reconstructed the btrfs metadata from the images, and recovered the 38 TB volume with three days of journal replay. Zero file loss.

Chantilly · Dell PERC H750 · controller fault mid-write

A PERC controller faulted during a heavy VMFS write window. On reboot, the virtual disk appeared as foreign on a replacement controller with a newer firmware revision. Rather than import (which would have rewritten metadata), we matched the original firmware on a bench unit, imported cleanly, and let the VMware layer handle the partial-write recovery.

[TERMINOLOGY]

Operational Glossary

URE: Unrecoverable Read Error — a sector the disk cannot read. Specified per ~10^14 bits on consumer SATA; the math is why RAID 5 is risky on large arrays.
Foreign config: An array configuration the current controller did not create. Importing is usually safe; initializing destroys it.
Write hole: Window during which a parity write is partially complete. Power loss here corrupts the stripe; battery-backed cache exists to prevent it.
Hot spare: A disk reserved to replace a failed member automatically. Useful only if regularly health-checked.
Resilver / rebuild: Process of rewriting a replacement disk from parity or mirror. Read-heavy on surviving disks.
SHR: Synology Hybrid RAID — flexible array layout allowing mixed disk sizes. Recovers like RAID 5/6 underneath.

[VENDORS]

Certified Platforms

Dell EMCHPENetAppPure StorageSynologyQNAPVeeamCommvault

[SECTOR]

Service Areas

Ashburn

Reston

Herndon

Sterling

Chantilly

Tysons

Dulles

Leesburg

Fairfax

[QUESTIONS FROM THE FIELD]

Answers Engineers Ask

[How we sourced this]
Pricing benches were assembled from a Q1 2026 audit of seven active storage-recovery partners serving the Northern Virginia metro, cross-referenced against actual invoiced recoveries in our dispatch log (Ashburn, Sterling, Herndon, Chantilly). All responders we connect callers with are fully insured and carry the vendor certifications listed per platform.

What does emergency RAID or SAN recovery cost in Northern Virginia?

Most NOVA-area emergency array recoveries land between $1,800 and $9,500 all-in. The deciding factors are array size, whether disks need to be imaged in a clean facility, and how many prior recovery attempts were made before we were called. Diagnostic + written quote is flat $450 and credited to recovery.

Scenario	Disk Count	NOVA Cost Range	Typical RTO
RAID 5/6, single disk out	4 – 8	$1,800 – $3,400	6 – 12 hr
RAID 5, double disk failure	4 – 12	$3,500 – $6,800	24 – 48 hr
SAN controller crash	12 – 48	$4,500 – $9,500	12 – 36 hr
NAS (Synology/QNAP) volume crash	4 – 12	$1,800 – $4,200	12 – 24 hr

All ranges include on-site dispatch within the Ashburn–Sterling–Reston corridor. No trip fees inside that ring.

☎ Call Now · (703) 343-9850 Request a Quote →

Can you actually recover a RAID 5 array that lost two disks?

Yes, the majority of the time — provided no one ran an initialize or forced a rebuild after the second failure. We image every surviving spindle first, then perform virtual parity reassembly off the images. Originals stay quarantined so we can restart from a known state if anything goes sideways.

▸Both disks mechanically alive, just parity-inconsistent → ~90% full recovery rate.
▸One fully dead disk, second readable → ~75% with partial file loss possible.
▸Both dead + initialize already attempted → forensic extraction only, file-level not volume-level.

☎ Call Now · (703) 343-9850 Request a Quote →

Are you licensed and insured to handle our production storage?

Every storage engineer in our NOVA dispatch network is independently insured ($2M professional liability minimum), background-checked, and certified on the specific platform they are dispatched to. We never send a generalist to a NetApp incident or a Windows tech to a ZFS pool.

▸Dell EMC: EMCDSA or active partner-tier engineer.
▸HPE: ASE Storage Solutions Architect minimum.
▸NetApp: NCDA with WAFL recovery experience.
▸Pure Storage: vendor-coordinated, host-side owned by our team.
▸Synology / QNAP: btrfs and ext4 forensic experience verified.

☎ Call Now · (703) 343-9850 Request a Quote →

What is your response time across Northern Virginia?

Remote engineer on a bridge within 15 minutes, 24/7. On-site disk imaging in Ashburn, Sterling, and Herndon runs 45–75 minutes from our staging point; Reston, Chantilly, and Tysons add 10–20 minutes depending on Dulles Toll Road and Route 28 conditions.

☎ Call Now · (703) 343-9850 Request a Quote →

Will my data be safer if I just let the controller finish its rebuild?

Often no — and this is the single most common reason recoveries fail. A rebuild reads every sector on every surviving disk, which is exactly when an aging second disk throws a URE and takes the array offline. If your array is degraded and the data matters, freeze the rebuild and image first.

URE math: consumer SATA spec is one unrecoverable error per ~10^14 bits read. A rebuild of a 4 × 6 TB RAID 5 reads ~144 trillion bits — well inside the error envelope.

☎ Call Now · (703) 343-9850 Request a Quote →

Do you coordinate with our cyber insurance carrier on ransomware events?

Yes. We isolate affected hosts, preserve volume snapshots before they age out of retention, document chain of custody, and work directly with your IR firm and carrier. We do not negotiate with threat actors; we restore from the cleanest available state and let counsel handle the rest.

☎ Call Now · (703) 343-9850 Request a Quote →

Transparency note: this page connects you with vetted Northern Virginia infrastructure specialists in our dispatch network. Every responder is independently insured, badge-credentialed at the major Data Center Alley facilities, and audited annually for vendor certification currency.

[ADJACENT RESOURCES]

RAID / SANRecovery.

What This Actually Is

What a RAID failure actually is

Why second-disk-out is so dangerous

How recovery actually works

What variables decide the outcome

Field Indicators

Severity Classification

Response Procedure

Before You Escalate

Common Mistakes

Real-World Incidents

Operational Glossary

Certified Platforms

Service Areas

Answers Engineers Ask

What does emergency RAID or SAN recovery cost in Northern Virginia?

Can you actually recover a RAID 5 array that lost two disks?

Are you licensed and insured to handle our production storage?

What is your response time across Northern Virginia?

Will my data be safer if I just let the controller finish its rebuild?

Do you coordinate with our cyber insurance carrier on ransomware events?

Related Reading

RAID / SAN
Recovery.