A large set of users were unable to connect to the Zoom web portal or join meetings from their clients
Mason users were unable to connect to the Zoom web portal or join meetings from their clients
To scale globally, Zoom has developed a geographically distributed platform for delivery of services, which is housed across Zoom’s data centers and other well-established cloud service providers. In addition, Zoom leverages a series of web servers segregated into distinct clusters. During the incident, Zoom Web infrastructure experienced higher than normal traffic on enterprise clusters. This sudden surge in web traffic triggered the automated mechanism of failover against such increased load. After the failover to alternate region, there was a network glitch on the currently active region, which led to slower response from the web servers and increased incoming requests caused further performance degradation. In this situation, the failback was not automatic since Zoom wants to prevent service failing back again which would potentially lead to even further downgrade in user experience. However since failback has a cooling off time period that hasn’t been met yet, Zoom Engineers had to manually failback to the original region. When Web zone services had failed over; the database connection pools for Web servers in the original region were reduced in size to minimum configuration. During this time, as the Web zone services were rerouted back to the original region, there was a minimum number of database connections to handle the incoming requests. This firstly caused delays in handling requests, secondly it put a strain on the application servers and database servers to ramp up to match the high volume of incoming traffic, which further caused delays in response and inability to handle new requests leading to service degradation. With a manual failover, cooling off time is reset and the auto failover gets triggered again in the original region. But the same scenario got reenacted in the newly active region. This cascading effect of stuck processing threads and time required to ramp up the DB connections after a failover/failback caused this extended incident.
The Zoom DevOps team restarted the web servers in the original region, failed over the cluster to that and the servers were able to ramp up to fresh incoming request load. They also restarted the web servers in the backup region to clear the stuck processing threads.
Zoom DevOps and Engineering teams will review the failover module code to identify any potential failures to prevent such issues from reoccurring. Zoom DevOps has added more web server nodes in the pool to increase processing capacity. We have implemented an operational procedure to restart the servers in the original zone after a failover. Zoom Engineering will investigate tuning the DB connection pool parameters for a quick ramp up of the service and have an optimal reserve capacity of database connections.