Don’t blame it on the technology.
In a blog post on its site, Amazon explained the reason why Amazon Web Services
suffered a four-hour outage on Tuesday, one that caused significant problems at numerous Internet sites, including those operated by retailers.
In the post, the company explained that one of its employees was debugging an issue with the billing system and accidentally took more servers offline than intended. The mistake resulted in a domino effect that took down two other server subsystems and so on and so on.
Amazon said it making several changes to make sure a similar error wouldn’t have such a large impact.
“While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly,” the company said. “We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future.”
Click
here to read Amazon’s full post.