Mason Power Outage ITS Service Impacts

Outage category: 
Applications, Banner, Website, Wired Network, Wireless Network
Location: 
Fairfax Data Center
Status: 
Closed
Resolved alert: 
10/08/2023 3:00 am

Power Outage at 6:45PM resulted in cooling failure in Data Center. Heat in Data Center was a severe problem that was addressed by turning off hosts and services to avoid hardware failure. Below is the timeline of events.

Impact is primarily along ORC systems as well as Virtual Machines in the NODUS cluster that EIS supports.

Between 8:20PM and 11:00PM users would experience service outages for Banner, Blackboard, Authentication, Email, and many other services that are hosted within the Data Center.

=========

More detailed explanation and timeline:
6:45PM Power Outage
Heat increases in Data Center
8:00PM ORC Hosts are powered off to attempt to reduce heat in Data Center
8:20PM VMs start coming down so CCSO can power off Nodus hosts because of excessive heat.
9:15PM Technician from facilities arrives.
9:35PM Temperature starts to drop
10:20PM NODUS hosts powered on again.
10:30PM VMs start powering back on.
11:00PM All production VMs back online.
12:45AM Most Dev / Test / Outstanding systems powered on

Brief overview of cooling problem: No chilled water from Facilities, CRACs don’t go into DX (Direct Expansion) mode so Data Center gets hot. Chilled water gets turned back on around 9:20PM, and technician is here to work on CRACs to resolve DX issue.

Initial symptoms: 
Power outage at 6:45PM which resulted in cooling issues within Data Center. Hosts within Data Center are preemptively powered off to avoid damage from heat at 8:20PM.
Duration: 
10/07/2023 6:18 pm - 10/08/2023 3:00 am
Impact to Mason: 
All users, students, faculty, and staff, are impacted by inability to access listed resources.
Affected Services: 
Authentication, Authorization & Account Services (AAA), Patriot Web, Banner Admin, Office 365 Email, Blackboard Courses, myMason, RAMP (Research Administration Management Portal) System, Patriot Virtual Computing
ROOT CAUSE ANALYSIS
Cause: 

Preemptive powering down of EIS hardware supporting Virtual Machines within Data Center.

Resolution: 

Cooling was fixed with help from facilities to re-enable chilled water to the Data Center.

Prevention: 

Data Center Operations is working with Facilities to better respond to these incidents and assist with preventative work.

STATISTICS
Service Team: 
CCSO, CCSE, AT, ITSO, NET