Amazon crashed part of the Internet last Tuesday, and it explains why
Most of us know Amazon best for its e-commerce services, which enables us to easily order nearly anything off the internet these days—from food to clothes and furniture—with free shipping, with just a few clicks on Amazon Prime. It’s exactly this that made Jeff Bezos the (until recently) richest man in the world, and continues to rake in the most cash; but Amazon does much, much more than retail.
And last Tuesday, a portion of the internet, together with Amazon.com, disappeared for a while, when Amazon’s servers in Northern Virginia (which has one of the biggest, as well as the first AWS data center ever) experienced an unexpected crash. The downtime lasted about seven hours, starting at around 7:30 AM PST, and with the network finally fully restored by 2:22 PM PST.
As it turns out, it was a very unusual crash which affected the AWS monitoring systems, which Amazon says significantly delayed the tech rescue team’s own ability to understand and diagnose the issue for the first few hours. Moreover, Amazon says that “the networking congestion impaired our Service Health Dashboard tooling from appropriately failing over to our standby region.”
Amazon says it is hard at work updating the systems to prevent the tech team (and consequently, AWS customers) from being left in the dark anymore, should future technical issues or outages occur.
Apart from sending significant portions of the internet offline, the Amazon outage also affected large-scale services such as Netflix, Disney+, Ticketmaster, and others.
At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks.
These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.
This congestion immediately impacted the availability of real-time monitoring data for our internal operations teams, which impaired their ability to find the source of congestion and resolve it.
Operators instead relied on logs to understand what was happening and initially identified elevated internal DNS errors. Because internal DNS is foundational for all services and this traffic was believed to be contributing to the congestion, the teams focused on moving the internal DNS traffic away from the congested network paths. At 9:28 AM PST, the team completed this work and DNS resolution errors fully recovered. […]
This code path has been in production for many years but the automated scaling activity triggered a previously unobserved behavior. We are developing a fix for this issue and expect to deploy this change over the next two weeks. We have also deployed additional network configuration that protects potentially impacted networking devices even in the face of a similar congestion event. These remediations give us confidence that we will not see a recurrence of this issue.
For all the latest Technology News Click Here
For the latest news and updates, follow us on Google News.