COLO Servers Are Unavailable

Outage category: Applications, Website

Location: COLO VMware Environment

Status: Closed

Resolved Alert:

Initial Symptoms

After the patch apply to first COLO host, esxi1, network and nfs storage issues appeared for this one host. These were then resolved after seeing the configuration changes done by the patch, but the bug within patch undid these fixes about 90 minutes after these were fixed and the CCSO team was applying these patches to the subsequent two ESXi hosts for COLO.

Root Cause Analysis

Cause

During normal monthly ESXi patching for the COLO environment, per RFC 254730, there was a bug within the VMware code, per https://kb.vmware.com/s/article/88875, that was to be fixed with this patch. This patch, however, did not fix this issue and when implementing this patch the bug happened wherein the Standard Virtual Switch configuration was changed, undoing normal configuration, and when CCSO engineers fixed the configuration (specifically around the Firewall rules within VMware, and Virtual Switch Port Group settings) they moved on to patching the remaining hosts. This bug still presented itself and then undid those fixes.

Resolution

CCSO engineers then worked with VMware to resolve this issue.

Prevention

Doing the work on a single host before moving to the others (and only doing so after the first host is upgraded successfully) and with us doing them in the following order: DR NODUS, Fairfax NODUS, COLO, CUI