In this article, we’ll discover components of Amazon Web Service’s massive global infrastructure: Regions, Availability Zones, Services and the Amazon control plane. This architecture was created to provide high availability and resiliency for customer workloads. For an introduction to high availability concepts and how Amazon provides high availability for specific workloads, see our article AWS High Availability.
In this post, we’ll examine AWS high availability architecture, and show how NetApp Cloud Volumes ONTAP can add value to this architecture for enterprise workloads.
In this page:
When setting up robust production systems, minimizing downtime and service interruptions is often a high priority. Regardless of how reliable your systems and software are, problems can occur that can bring down your applications or your servers.
Implementing high availability for your infrastructure is a useful strategy to reduce the impact of these types of events. Highly available systems can recover from server or component failure automatically. This is because failures to a single element don’t affect the functioning of the whole infrastructure: if a virtual node goes down, virtual machines failover to other nodes, assuring the continuity of service without interruptions.
Amazon Web Services built a massive global infrastructure to provide high availability and resiliency for customer workloads. Let’s examine the key components of this architecture: Regions, Availability Zones, Services and the Amazon control plane.
Source: Amazon Web Services
Amazon provides cloud services in 21 geographical regions (see the Amazon region map). The concept of a “Region” in AWS is different from the same concept used by other cloud providers (see our article on Azure High Availability). Amazon defines a region as a geographical area in which there are at least three separate data centers, called Availability Zones (AZs).
An AWS Availability Zone is a complete, localized infrastructure with redundant power, networking and connectivity. Amazon currently operates 66 availability zones globally. Each AZ encompasses multiple data centers, typically 3, in the same location. In each region, AZs are separated by a “meaningful distance”, but no more than 100 km, to allow for fast connectivity between them.
Amazon AZs are interconnected with high-bandwidth, low-latency networking, over a dedicated metro fiber cable. Network performance is fast enough to perform synchronous replication between AZs, meaning machine instances or other Amazon services in one AZ can be replicated in real time to another AZ without affecting read or write performance.
By deploying across several AZs, you can protect your applications from natural disasters such as tornadoes and earthquakes. By deploying across regions, you can achieve even higher resilience against disasters that can affect a broader area, such as a war. Amazon offers private, high-speed networking across regions to provide an additional layer of business continuity.
Amazon’s policy is to deliver Amazon services, features and instance types across all AWS regions, although full deployment of a new service can take up to 12 months. Some Amazon services are delivered globally and are not deployed in specific regions or AZs, such as Route 53, Amazon WorkMail and Amazon WorkLink.
The Amazon control plane, including APIs, the AWS Management Console and CLI, is distributed across AWS regions with a multi-AZ architecture, ensuring their continuous availability. This means that organizations that rely on Amazon capabilities in their applications are never dependent on a single data center. It also allows Amazon to conduct maintenance without making critical services unavailable to customers.
Compliance standards and regulations often mandate that an organization’s data is stored in a specific geographical location (this is called “data sovereignty”). Amazon provides full control over the AWS region in which data is stored, allowing you to comply with data sovereignty requirements.
Amazon’s global high availability architecture is massive, robust and in many cases vastly superior to high availability measures most organizations could deploy on-premises. Nevertheless, even a massively redundant system like Amazon Web Services experiences failures. Learning about these failures is instructive, as it will help you understand the weak points of Amazon’s architecture and what you can expect with regards to the actual reliability and resilience of Amazon services.
A timeline of all reported AWS outages between 2014-2018
Key takeaways from the reported outages
Here are a few best practices you can follow to improve high availability in the Amazon ecosystem:
If your AWS infrastructure powers business critical applications, NetApp Cloud Volumes ONTAP, the leading enterprise-grade storage management solution, provides high availability to ensure business continuity with no data loss (RPO=0) and minimal recovery time (RTO < 60 secs). Add that to data protection with NetApp Snapshot™ technology, which doesn’t require additional storage or impact application performance, and your enterprise is guaranteed to remain in operation and able to recover from any unplanned downtime or disaster event.