Outage category: Telephone
Location: All campuses
Status: Open
Resolved Alert:
Initial Symptoms
Anyone logged into the system was working correctly. Users that were not logged in at the time of the outage, or logged out during the outage, could not log into the system. Calls could be answered normally.
Root Cause Analysis
Cause
From the vendor:
Investigation confirmed the symptoms and showed that some of the instances on the backend responsible for storing user information were unreachable and needed to be restored. These instances were found to be out of memory which made them inaccessible. The monitoring agents responsible for alerting based on memory thresholds, were not operational. It was determined that the monitoring agent was misconfigured. The team is still currently investigating on what triggered the out of memory condition. We have a hypothesis that certain bulk configurations acted as a possible trigger. We remediated the issue by bringing back these instances and verified that the configuration was intact. Validation efforts concluded that the services were operational.
Resolution
From the vendor:
We have corrected configuration on the monitoring agents for memory threshold based alerts. This should help us detect similar issues proactively.
Prevention
From the vendor:
We are also reviewing the design to improve data resiliency against similar failures i.e. user access. We will provide further updates onĀ the triggering condition as well as the progress on design improvements once identified and upon request.