Vonage Single Sign-on issues

Outage category: 
VoIP phone service
Location: 
All Vonage users that were not logged into the Vonage app and attempted to login in were not able to use single sign on to open the app.
Status: 
Closed
Resolved alert: 
11/27/2023 1:55 pm

Vonage users trying to sign into the Vonage app were unsuccessful using the Single sign on feature.  Also the Vonage portal was down for the Admins to see any issues or do any updates on the system.

Initial symptoms: 

Users were calling the support center and telecom with this issue. Also telecom admins could not log into the portal.

Duration: 
11/27/2023 12:05 pm - 11/27/2023 1:55 pm
Impact to Mason: 

Vonage users could not access their phone via the Vonage application if they were not already signed on. No phone service unless they had the MS Teams integration (they could use that dial pad) or a Yealink desk phone. Telecom admins could not service accounts as the portal was not available.

Affected Services: 
Telephone Services
Other Affected Services: 
Phone services for those with only the Vonage app if not logged in prior to the outage.
ROOT CAUSE ANALYSIS
Cause: 

Per Vonage:
Our engineers began to investigate and diagnose the reason for this issue. The team identified there was a mixture of both poorly performing, and healthy servers. This was identified to be due to a configuration change that caused some servers to become unhealthy. This was not initially identified when the change was made during validation testing as the change was made during a low traffic period.

Resolution: 

Per Vonage:
In an attempt to restore service, additional servers were added. This did provide additional resources, but insufficient healthy servers connected necessary to handle traffic demands. Service recovered for connected users, but new login attempts were failing due to the volume of simultaneous requests. Measures were taken to regulate the flow of login requests and moderate system services. This activity continued until 17:44 UTC when all services were restored, and traffic volumes normalized.

Prevention: 

Vonage to run a full internal review – Completed.
● Add additional unit tests to validate configuration – Completed.
● Add monitoring and alerting for all individual API servers, specifically for unhealthy
message code – Completed.
● Systems database connection pool and resource review – In Progress.
● Full process and deployment review including runbook updates – In Progress.

STATISTICS
Service Team: 
Telecom, Support Center