Outage category: VoIP phone service
Location: All Vonage users that were not logged into the Vonage app and attempted to login in were not able to use single sign on to open the app.
Status: Closed
Resolved Alert:
Initial Symptoms
Users were calling the support center and telecom with this issue. Also telecom admins could not log into the portal.
Root Cause Analysis
Cause
Per Vonage:
Our engineers began to investigate and diagnose the reason for this issue. The team identified there was a mixture of both poorly performing, and healthy servers. This was identified to be due to a configuration change that caused some servers to become unhealthy. This was not initially identified when the change was made during validation testing as the change was made during a low traffic period.
Resolution
Per Vonage:
In an attempt to restore service, additional servers were added. This did provide additional resources, but insufficient healthy servers connected necessary to handle traffic demands. Service recovered for connected users, but new login attempts were failing due to the volume of simultaneous requests. Measures were taken to regulate the flow of login requests and moderate system services. This activity continued until 17:44 UTC when all services were restored, and traffic volumes normalized.
Prevention
Vonage to run a full internal review – Completed.
● Add additional unit tests to validate configuration – Completed.
● Add monitoring and alerting for all individual API servers, specifically for unhealthy
message code – Completed.
● Systems database connection pool and resource review – In Progress.
● Full process and deployment review including runbook updates – In Progress.