COLO Servers Are Unavailable

Outage category: 
Applications, Website
Location: 
COLO VMware Environment
Status: 
Closed
Resolved alert: 
04/12/2023 6:55 pm

Websites and services on the affected hosts were unavailable.

Initial symptoms: 

After the patch apply to first COLO host, esxi1, network and nfs storage issues appeared for this one host. These were then resolved after seeing the configuration changes done by the patch, but the bug within patch undid these fixes about 90 minutes after these were fixed and the CCSO team was applying these patches to the subsequent two ESXi hosts for COLO.

Duration: 
04/12/2023 4:14 pm - 04/12/2023 6:55 pm
Impact to Mason: 

Users of these systems, including Law School Students performing registration.

 

Affected Services: 
Application Software
Other Affected Services: 
Services on the following servers were affected: bronco, coloaersv01p, colocctvv01p, colocpv01p, colohrinmdv01-, colohrlonpov01p, colohrlpsv01p, colohrlrdtv01p, coloimsv01p, colonmdvcv01p, colophotov01p, coloqnomdbv01p, coloqnomdbv02p, coloqnomv01d, coloqnomv01p, colotsv01p, colouldtv01p, dmwin10provm, drbatman, test-bmr, test-lister, testme2k12r2, testme2k19std, vug, vug70
ROOT CAUSE ANALYSIS
Cause: 

During normal monthly ESXi patching for the COLO environment, per RFC 254730, there was a bug within the VMware code, per https://kb.vmware.com/s/article/88875, that was to be fixed with this patch. This patch, however, did not fix this issue and when implementing this patch the bug happened wherein the Standard Virtual Switch configuration was changed, undoing normal configuration, and when CCSO engineers fixed the configuration (specifically around the Firewall rules within VMware, and Virtual Switch Port Group settings) they moved on to patching the remaining hosts. This bug still presented itself and then undid those fixes.

Resolution: 

CCSO engineers then worked with VMware to resolve this issue.

Prevention: 

Doing the work on a single host before moving to the others (and only doing so after the first host is upgraded successfully) and with us doing them in the following order: DR NODUS, Fairfax NODUS, COLO, CUI

STATISTICS