Cisco Webex Teams Issue

Outage category: 
Webex
Location: 
All locations
Status: 
Open
Resolved alert: 
03/11/2020 1:06 pm

Customers experienced issues with logging into Webex Teams or using the services within Webex Teams.

Initial symptoms: 

On March 11th, Webex Engineering identified various Webex services alerting within the infrastructure indicating impact to media, video device calling, messaging, and administration services. Through the impact duration, services dependent on the Webex authentication components experienced degraded services or failures. This included logging into Webex Control Hub and Control Hub authenticated sites, hybrid services, bots and APIs. In addition, video endpoint users connecting to Webex Meetings would have experienced call-in and call-back delays or failures. Webex Teams users experienced intermittent messaging delays, authentication delays or failures, and issues accessing the web client.

Duration: 
03/11/2020 10:25 am - 03/11/2020 1:06 pm
Impact to Mason: 

Users may have been unable to log into Webex Teams or make calls using videoconference room systems.

ROOT CAUSE ANALYSIS
Cause: 

Cisco Webex experienced an issue related to our authentication service which resulted in the degradation of services. Due to an error condition that was experienced within the authentication microservice, a protective rate limiting parameter was incorrectly triggered between the microservice components that make up the authentication and media flows within the data centers. Users that were successfully authenticated and had active tokens were not affected by the issue, nor were users connecting to Webex Meetings via phone or VoIP, as those were not dependent on the same media authentication flows.

During service restoration, a configuration change was completed, but the number of requests for authentication and tokens took additional time to process, extending the length of time to restore services.

 

Resolution: 

Cisco made a configuration change in Webex.

Prevention: 

While monitors were in place to detect the initial impact, additional monitors are being implemented to verify microservice performance and optimize alerting to allow for a quicker root cause analysis.

Engineering has added additional incident management protocols targeted to react faster to the microservices health to ensure optimized service redirection.

STATISTICS
Service Team: 
Enterprise Collaboration