Amazon Details Elastic Compute Cloud Outage -- Campus Technology

Cloud Computing and SaaS

Amazon Details Elastic Compute Cloud Outage

By Jeffrey Schwartz
05/02/11

Amazon released a postmortem Friday detailing the cause of last month's massive outage of its cloud services that left numerous customers incapacitated.

The company also apologized for the event, which left certain customer sites down for days and caused the permanent loss of some data. Amazon also promised credit to those affected.

Since the outage occurred, the company had been largely silent on the matter, other than to point to its Service Health Dashboard.

"We want to apologize," the company said in the postmortem report. "We know how critical our services are to our customers' businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes."

The problem began April 21, when the company was performing a routine network upgrade to an "Availability Zone," or hub, at its Northern Virginia data center in an attempt to increase capacity. The upgrade was executed incorrectly.

"During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS [Elastic Block Storage] network to allow the upgrade to happen," the company explained. "The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower-capacity redundant EBS network.

"For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving. As a result, many EBS nodes in the affected Availability Zone were completely isolated from other EBS nodes in its cluster. Unlike a normal network interruption, this change disconnected both the primary and secondary network simultaneously, leaving the affected nodes completely isolated from one another."

The company said it is taking steps to make sure such an event doesn't recur. "We will audit our change process and increase the automation to prevent this mistake from happening in the future. However, we focus on building software and services to survive failures. Much of the work that will come out of this event will be to further protect the EBS service in the face of a similar failure in the future."

Customers that were affected by the outage will automatically receive 10-day credits equal to 100 percent of their usage of EBS volumes, Elastic Compute Cloud (EC2) instances and Relational Database Service (RDS) database instances that were running in the affected Availability Zone, Amazon said. While the credits will be welcomed by affected customers, in some cases they may not equal the business lost by the outage.

The company also indicated it will improve its communications in the future: "We would like our communications to be more frequent and contain more information. We understand that during an outage, customers want to know as many details as possible about what's going on, how long it will take to fix, and what we are doing so that it doesn't happen again," the company said in its report.

Amazon Web Service's complete summary of the outage can be accessed here.

About the Author

Jeffrey Schwartz is executive editor, features, for Redmond Developer News. You can contact him at [email protected].

E-Mail this page

Printable Format

Featured

Fast-Moving Ransomware, Router-Based Espionage Threats Target Education and Small-Office Organizations

A recent report from Microsoft warns about two active cybersecurity threats: a fast-moving ransomware campaign and a Russian espionage operation that abuses small office and home office routers to monitor victims' network traffic.
Student Readiness: Learning to Learn

Melissa Loble, Instructure's chief academic officer, recommends a focus on 'readiness' as a broader concept as we try to understand how to build meaningful education experiences that can form a bridge from the university to the workplace. Here, we ask Loble what readiness is and how to offer students the ability to 'learn to learn'.
Encryptionless Extortion on the Rise as Ransomware Groups Shift Tactics

Ransomware attacks continued to climb in 2025 as attackers increasingly timed operations around year-end staffing gaps and shifted away from traditional file encryption, according to new research from NordStellar.
Microsoft Intros 'Cowork' Feature for Copilot, AI Updates

Microsoft has announced a trio of AI updates, spanning Microsoft 365 Copilot, Security Copilot and Microsoft Foundry.