Data durability and recoverability are major pillars in AWS high availability. Mission-critical systems require “five nines” of uptime, with strong guarantees on the durability of data storage. These systems must transparently recover when outages occur, and without data loss, i.e. maintaining an RPO (recovery point objective) of 0. Large organizations that make use of a variety of storage formats, such as file shares, iSCSI disks and others have a considerably larger amount of data to protect and, therefore, face an even greater challenge in preventing data loss.
Guaranteeing that data is not lost during a system outage requires resilience against hardware, network and even site failure. Cloud platforms are no different than on-premises systems in this regard and must ensure adequate protection at all levels in the stack. The cloud, however, offers sophisticated services that make creating highly available systems less burdensome, although choosing the right services and making them work together still requires expert knowledge.
In this article, we’ll first look at how AWS high availability services can be used to build highly-available systems in the cloud to reach RPO=0. We’ll then compare this to NetApp’s Cloud Volumes ONTAP HA, which provides a ready-made solution for robust data storage in the cloud.
RPO=0 is essential to successful business continuity of critical workloads. Creating a highly available data platform in AWS to reach RPO=0 means eliminating single points of failure, as they relate to compute, network and storage. In this section we will look at each of these aspects in turn and discuss the issues involved with AWS high availability.
To ensure site redundancy and to protect against any sort of cloud outage, AWS splits each region into different Availability Zones. Each Availability Zone relates to isolated physical locations that can each be supported by one or more data centers, with no two Availability Zones co-located within the same premises. Ensuring high availability across zones provides a much greater level of protection against failure, as the data center itself is removed as a single point of failure.
Monitoring is paramount when it comes to high availability, and for this AWS provides Amazon CloudWatch, which can be used to monitor most of the services mentioned below.
Amazon Elastic Compute Cloud (Amazon EC2) compute resources are made highly available by using multiple instances at the same time to collectively process incoming requests or distribute workloads across. An Amazon EC2 instance group can be managed with Amazon EC2 Auto Scaling.
Auto scaling allows for the health of Amazon EC2 instances to be monitored and also self-healed by spinning up new compute instances in the case an existing one fails, with feature support to help cater for stateful applications. Using the dynamic scaling feature, policies can be established, such as aggregate CPU load, that can act as a trigger for creating new instances. New instances are created from Amazon Machine Images which can be user-customized.
Amazon ECS offers an alternative approach when Docker containers are being used to package up an application. Amazon ECS is compatible with a wide range of other AWS services and can be used to scale an application for both performance and high availability.
One of the most important facets of highly available systems is management of connectivity. In the event of a failure, clients need to be able to find any currently active nodes in order to re-establish communication and data operations. Amazon Elastic Load Balancing is an Amazon EC2 load balancer that handles this by distributing incoming requests to a group of Amazon EC2 instances.
Requests can be distributed to an auto-scaled Amazon EC2 instance group, which can even be spread across multiple availability zones in order to provide site resilience. As Amazon Elastic Load Balancers exist outside of any particular Availability Zone, they are themselves insulated from Availability Zone failure.
In addition to this, Amazon Route 53 provides highly available DNS services that can perform health checks on its targets and perform automatic failover for both active-active or active-passive configurations.
Amazon EBS provides block-level storage that can be attached to Amazon EC2 instances. Although Amazon EBS storage is internally replicated to a number of servers within an Availability Zone, there is no redundancy outside of this zone. In order to achieve this, you would need to use Amazon EBS snapshots to creates copies of the data in Amazon S3, which provides data durability across Availability Zones. In the event of an Availability Zone outage, however, there would be data loss due to the latency between snapshots.
Amazon EFS can be used to provide redundancy across Availability Zones, as well as providing the capability to mount the same filesystem concurrently from multiple Amazon EC2 instances. Amazon EFS achieves this exposing data over NFS v4.1. It should be noted that Amazon EFS is currently offered in only a limited number of AZs.
Amazon S3 provides cost effective cloud storage with high levels of data protection, however, performance considerations may make this unsuitable for some use cases.
As we have seen so far, there are a number of moving parts to consider when creating a highly-available system in the cloud. This is by no means a simple venture and although the AWS high availability services exist to create such a platform, putting these building blocks together correctly can be difficult to get right without a high degree of technical ability.
NetApp’s Cloud Volumes ONTAP HA simplifies this whole endeavor by providing a flexible and highly available enterprise data storage system that is ready-to-use for AWS high availability. You can very quickly create a single platform for hosting all of your data that also provides redundancy across Availability Zones.
Just as with Cloud Volumes ONTAP, Cloud Volumes ONTAP HA uses Amazon EC2 to provide the underlying compute and Amazon EBS for storage resources. However, Cloud Volumes ONTAP HA goes further by replicating the environment to a secondary location, usually in another Availability Zone.
After setting up Cloud Volumes ONTAP HA, all data volumes are synchronously mirrored with write operations completing only after the information has been written to each Cloud Volumes ONTAP node, which ensures there is no data loss if an outage occurs. The nodes can be setup to work in either an active-active configuration, where data can be written to either node, or an active-passive configuration, where the passive node serves out reads.
If one of the nodes goes down, the other node can continue to serve all data requests from its own independent copy of the storage, which is how to reach RPO=0. For NFS and CIFS file shares, floating IP addresses are used to re-point clients to the currently active node.
In addition to this, a mediator instance is used to ensure communication between the Cloud Volumes ONTAP nodes and assist in storage failover and fail back. While the Cloud Volumes ONTAP nodes make use of more powerful Amazon EC2 instance types, the mediator node uses only a t2.micro. Setup of all three nodes is managed by the wizard interface in NetApp Cloud Manager, which is used to control all aspects of Cloud Volumes ONTAP deployment and management.
As well as providing high availability for your data, Cloud Volumes ONTAP provides a comprehensive feature set for management of your cloud storage resources. Space efficiency features, such as data compression and deduplication to mention a few, greatly reduce cloud storage footprint, and therefore costs.
Other features include data replication, storage cloning, storage tiering and many more. These are important features that will play a part in your backup and disaster recovery solution and business continuity plan.
As we’ve seen in this article, creating a highly available platform that can reach RPO=0 and protect your data against any data loss is a complex task. Even though the cloud services for creating such a system are readily available in the AWS cloud, putting them together in order to fully protect data storage requires a high degree of skill.
Using Cloud Volumes ONTAP HA, you can achieve high availability for your data out-of-the-box, reach RPO=0 and gain the significant advantages of NetApp’s enterprise platform for the management of your cloud storage.