![]() ![]() To prevent such problems from happening again, Amazon increased its network traffic systems' capacity and moved Slack to a dedicated network. Amazon increased the network capacity and lifted the rate limit on its AWS Transit Gateway that had prohibited Slack from provisioning new back-end servers to handle the traffic. The network instability prevented Slack engineers from accessing their observability platform, a type of network management system, which complicated the debugging process.Īmazon eventually aided Slack in fixing the problem. The result was not enough servers to meet Slack's capacity needs, which led to customers receiving error messages or not loading Slack. "The network problems worsened, which significantly reduced the number of healthy servers." An overtaxed AWS Transit Gatewaytook down the Slack messaging service. "Our load balancers entered an emergency routing mode where they routed traffic to healthy and unhealthy hosts alike," Slack said. ![]() While those requests were only 1% of the incoming traffic, they used up about 40% of the back-end server time, putting them in an "unhealthy" state. The troubles resulted in the back-end servers handling too many high- latency requests. Slack's IT team did not discover the escalating problem until almost an hour after it started.Īt the same time, Slack experienced network problems between its back-end servers, other service hosts and its database servers. That led to an increase in error rates from Slack's back-end servers. The gateway problem contributed to packet loss between servers within the AWS network, which worsened over time. ![]() By 10 a.m., the service was unusable for all subscribers. EST with customers experiencing occasional errors immediately. However, a source familiar with the matter confirmed that the gateway failed to scale up fast enough to handle the incoming traffic. Slack declined to discuss the problems related to the AWS Transit Gateway. Slack relies entirely on AWS for its cloud hosting. Slack released a root cause analysis report to the media this week, detailing how AWS problems set off a domino effect that left the service inaccessible. ![]()
0 Comments
Leave a Reply. |