Storage Failure at DR Site – Services Impacted

Outage category: 
Applications
Location: 
All users on RedHat Satellite, NetMRI, and several authentication services that may have then impacted downstream applications (such as Zoom) which might have experienced slowness.
Status: 
Closed
Resolved alert: 
02/01/2024 5:40 pm

Users may have experienced slower logins.

Initial symptoms: 

Work was being done with a Dell Technician who rebooted the only working DR Storage Controller.

Duration: 
02/01/2024 1:54 pm - 02/01/2024 5:40 pm
Impact to Mason: 

End users in the Mason community may have experienced slower log in attempts. Other impacts are isolated to EIS team members who use NetMRI or RedHat Satellite.

Affected Services: 
Application Software
ROOT CAUSE ANALYSIS
Cause: 

Mistake committed during a fix per RFC 273521. Dell had a typographical error about which server needed repaired, which then caused the technician to insist on rebooting the working Controller. During this the previously working Controller had errors loading and this took time to resolve.

Resolution: 

Worked with Dell support to replace faulty part within Storage Controller and then get both DR Controllers upgraded, running, and in sync with one another.

Prevention: 

Insist on multiple parties to agree to steps being done when performing work that has a single point of dependency.

STATISTICS
Service Team: 
CCSO, CCSE