Information Technology Services

Storage Failure at DR Site – Services Impacted

Outage category: Applications

Location: All users on RedHat Satellite, NetMRI, and several authentication services that may have then impacted downstream applications (such as Zoom) which might have experienced slowness.

Status: Closed

Resolved Alert:

Initial Symptoms

Work was being done with a Dell Technician who rebooted the only working DR Storage Controller.

Root Cause Analysis

Cause

Mistake committed during a fix per RFC 273521. Dell had a typographical error about which server needed repaired, which then caused the technician to insist on rebooting the working Controller. During this the previously working Controller had errors loading and this took time to resolve.

Resolution

Worked with Dell support to replace faulty part within Storage Controller and then get both DR Controllers upgraded, running, and in sync with one another.

Prevention

Insist on multiple parties to agree to steps being done when performing work that has a single point of dependency.