High availability is the cornerstone of all IT architecture, making sure that enterprise workloads are protected from planned and unplanned outages. The two leading cloud service providers, AWS and Azure, offer multiple configuration options and services to implement highly available and resilient architectures to keep the lights on for your line-of-business applications.
But how can you select the solution and services that best fit your use case? To help you achieve high availability from the ground up, this blog will introduce you to the architecture considerations, best practices, and benefits of deploying highly available applications in Azure and AWS, both of which can be enhanced by NetApp® Cloud Volumes ONTAP.
There are some crucial differences between the ways that Azure and AWS achieve what they consider to be high availability. Let’s take a look at the different systems used by the major cloud providers for added resiliency.
For AWS high availability, the simplest architecture is to have multiple Amazon EC2 instances deployed behind the AWS Elastic Load Balancing service. The instances are deployed across AWS locations, which are defined as Regions and Availability Zones (AZs). While an AWS Region denotes a geographical area, such as a country or a group of states, Availability Zones are isolated locations within a region.
Amazon EC2 instances should be deployed across multiple Availability Zones so that they are resilient against any Availability Zone failures. If an AZ is unhealthy, Elastic Load Balancing can be configured to route traffic to instances in an alternate availability zone.
For storage resiliency, Amazon Elastic Block Store (Amazon EBS) is used by EC2 instances for persistent data storage, and is automatically replicated across an Availability Zone. Amazon EBS offers 99.999% availability, with an annual failure rate (AFR) as low as 0.1-0.2%. For additional data resiliency, it is recommended to configure point-in-time backup of volumes through AWS EBS snapshots that are saved in Amazon S3.
For Azure high availability, it’s recommended for users to deploy virtual machines (VMs) in Azure availability sets or availability zones. With availability sets, multiple VMs are placed in separate fault domains and update domains within a single data center. While separate fault domains ensure that instances are protected from a single point of failure in terms of power, network connectivity, etc., the separate update domains ensure that at least one instance remains available if the other is brought down for planned maintenance.
Azure availability zones are different locations within an Azure region, spanning across multiple data centers each running on an isolated power, networking, and cooling. It is recommended to place VMs in multiple availability zones to protect against data center failures within a region. It also offers better SLA (99.99%) when compared to placing VMs in availability sets (99.95%)
Azure load balancers and application gateways can be used to redirect traffic to an available instance of the application. Azure Load Balancer is available in basic and standard SKUs. While basic SKU supports only availability sets, standard SKU supports both availability sets and availability zones. Standard SKUs are recommended for designing with HA in mind, since they can enable cross-zone load balancing of your application.
To protect Amazon EC2 instances from transient failures in the underlying hardware, customers can also leverage placement groups in AWS. AWS partition placement groups can be used to deploy instances into groups or partitions. Each partition will have its own isolated racks, and partitions in a single partition placement group can span across multiple availability zones in a region. This is best suited for distributed computing applications like Hbase, Cassandra, etc. With spread placement groups, each instance is placed in separate rack spanning multiple Availability Zones. This HA architecture is suitable in cases where the application is made up of a small number of critical instances and hence it is ideal to deploy them across multiple racks for high availability.
It should be noted that because these configurations are limited to Availability Zones in a single region, they don’t provide cross-region high availability. To get that level of multi-region availability, additional configurations need to be done for each component. For example, EBS snapshots of your data stored in Amazon S3 can be copied over to a different AWS region. The copies can then be used to deploy the application in the target region in case the primary region is unavailable. EC2 AMIs (Azure Machine Image) can also be copied over to a different region to deploy the application and to which the EBS volume created from the snapshot can be attached. Amazon Route 53, the cloud DNS web service, can be used to redirect incoming traffic to available instances of application deployed across multiple AWS regions so as to protect from failure of services in a single region.
Azure Site Recovery can be used to deploy an active-passive architecture where virtual machines in one Azure region are replicated in real time to storage in a paired Azure region. It can then be used to spin up the instances in the target region should the source Azure region not be available.
For active-active Azure high availability, it is important to ensure that data is synchronous and available in both regions. For example, a database-level high availability configuration such as SQL AlwaysOn can be configured with SQL instances in two regions. Another option is to use PaaS services such as Azure SQL that offer active geo-replication that creates a read-only copy of data in the secondary region. Such an architecture would also need the Azure Traffic Manager service, which uses DNS based load balancing to send traffic to application instances to the available Azure region.
AWS offers multiple storage classes with different resiliency levels for its Amazon S3 cloud storage. The S3 standard and S3 standard-Infrequent Access storage classes store data in a minimum of three AZs in a single region. Hence the data remains available even if an entire Availability Zone is destroyed. Amazon S3 One Zone-Infrequent Access storage tier on the other hand has data stored in a single AZ and hence is not protected against AZ failures. This storage class is preferred as a low-cost storage option for infrequently accessed data that does not need the resiliency of other storage classes.
Amazon S3 also supports cross-region replication, where automated, asynchronous copies of data in S3 bucket can be shared across different AWS Regions. Resiliency is also built-in to other AWS services that manage data. For example, Amazon RDS also has a primary and secondary instance provisioned with synchronous data replication between them when you deploy a Multi-AZ DB instance.
Locally-redundant storage (LRS) is the default resiliency level offered by Azure Cloud storage. It ensures that three synchronously-replicated copies of data are available in the same region. However, this only protects the data from failures in a single data center. For additional resiliency, Zone-redundant storage can be used which replicates data to three storage clusters within an Azure region. Thus, data is protected from node- as well as zone-level failures.
For resiliency from node-, zone-, and region-level failures, geo-redundant storage (GRS) can be used, which replicates data asynchronously to a paired Azure region in addition to the primary region. Geo-zone-redundant storage (GZRS), which is currently in preview, combines the benefits of ZRS and GRS, where the data is replicated to three availability zones in a primary region and also to a secondary paired geographical region.
For virtual machines in Azure, it is recommended to use managed disks as they are compatible with availability sets as well as availability zones. Disks associated with VMs in availability sets or availability zones are deployed in isolated storage scale units to prevent single point of failure.
Additional resiliency through a High availability architecture for cloud service often involves extra cost. For that reason, it’s prudent to use the right configuration for the resources to ensure a fine balance between cost optimization and high availability.
For example, Amazon S3 One Zone-Infrequent Access offers cheaper storage ($0.01 per GB) for data when compared to S3 Standard ($0.023 per GB) or S3 Standard-IA ($0.0125 per GB), but those savings come at a huge risk. Only data types that are expendable should be used for it, such as secondary backup copies.
Azure Blob storage accounts can be configured to support Hot, Cool, and Archive access tiers with the Hot tier having higher storage charges but less access charges when compared to Cool and Archive tiers. These tiers are available in LRS, GRS, ZRS and ZGRS replication options for general purpose-V2 account type and LRS, GRS, RA-GRS for Blob storage account type. The cost will be lower for LRS and charges will be higher as the resiliency level increases. Based on data usage patterns, optimal storage tier and replication option should be leveraged in the architecture.
The data tier is the most critical part of any application architecture; ensuring its high availability is of paramount importance. AWS and Azure offer many options for storage resiliency, but a number of these can be challenging to configure without sacrificing data protection or costs. NetApp has an alternative.
NetApp Cloud Volumes ONTAP offers an enterprise-class data management solution for AWS, Azure and Google Cloud, providing file and block level storage services for your workloads. In addition to advanced data management features like data protection with NetApp Snapshots™ technology, cloning, storage efficiencies to lower cloud data storage costs, and SnapMirror® data replication, as well as the Cloud Volumes ONTAP HA configuration which uses a dual-node architecture in order to ensure AWS and Azure high availability.
Cloud Volumes ONTAP HA mirrors data synchronously between two nodes to ensure enterprise reliability and nondisruptive operations in case of failures in your cloud environment. Storage is shared between nodes and automated storage takeover/giveback process and multiple network access paths enables high availability of data. Very stringent SLA requirements of RPO=0 and RTO<60 seconds can be achieved using this technology.