Disaster recovery (DR) is how companies react and recover to damaging incidents that can take entire operations offline. A disaster can be anything that will impact business continuity or finances, including hardware and software failures, network or power outages, human error, malicious attacks, ransomware and other security threats, and natural disasters that can physically damage on-premises infrastructure, such as floods or earthquakes. With a number of these points, the cloud can offer big advantages.
How does disaster recovery work for users who chose AWS as their public cloud provider? In this article we will explain how AWS users can prepare a plan for disaster recovery, and take a look at the tools that are available to make these plans more secure and efficient, including NetApp’s Cloud Volumes ONTAP.
There are a number of advantages to having a disaster recovery plan ready when you use AWS. These include:
Minimized downtime: In case of a disaster, a disaster recovery plan enables users to quickly continue running business-critical applications.
Protecting critical data: DR plan will establish proper replication intervals to make sure that little to no data is lost in a disaster.
Maintaining business reputation and compliance: Long downtime periods can upset users who rely or pay for a service. Regulations may also require that services remain available without interruption. By having a disaster recovery plan and solution in place, companies can limit downtime and reduce the impact on customers and the business’ reputation and liability.
When it comes to AWS, which doesn’t offer a native solution for disaster recovery, users need to be prepared with a plan and a way to ensure their sites and services remain available. There are a number of ways a DR plan can fail, so make sure to follow the proper steps to crafting your plan.
Create a disaster recovery management contingency statement: This document will outline the specific rules that the company will follow in order to develop and implement a DR plan. Every part of the company’s IT infrastructure should be considered in the plan, and every role in the company should have its responsibilities defined in this statement.
Conduct a Business Impact Analysis: Companies can identify the effects that a disaster will have on their operations by conducting a Business Impact Analysis (BIA). With BIA you can identify which applications and components are the most critical to protect and recover in a disaster, and how downtime will affect business operations and growth.
DR Plan Testing and Team Readiness: If the plan doesn’t work when a failure event occurs, the company is going to suffer. Once the plan is in place, it will be crucial to test it to make sure that every component and player reacts properly. Training to follow-up with changes and to keep the team up to date will also be key.
Plan Upkeep: The DR plan should be familiar to everyone at the company, and constantly evolving with the company’s changing needs and makeup. This document needs to be reviewed regularly: if team members lose touch with the plan, the plan is likely to fail.
The most critical part of crafting a disaster recovery plan is determining the company’s Recovery Time Objective (RPO) and Recovery Point Objective (RTO).
Recovery Point Objective (RPO) measures the acceptable amount of data which can be lost in the disaster scenario.
Recovery Time Objective (RTO) is a measure used by companies to determine the maximum amount of time that the company can withstand an outage before the impact becomes unacceptable.
With both of these objectives, there are concerns. To keep RTO, a secondary DR site can be kept in constant operations, but that can be a cost concern. With RPO, you can set data backup to prevent almost any data loss, however, data backups and storage fees will also heavily impact your monthly AWS bill. Shorter RPOs can also be pricey to configure.
In this section we’ll look at four different ways that a DR architecture can be configured for use in AWS.
This is the scenario that takes the highest amount of time to recover from disaster. Here, almost all data is backed up to Amazon S3. This data is available from any location and can be accessed really quickly in case something breaks up and restore is required. If you are running your infrastructure on-premises, and want to use AWS for disaster recovery, AWS Storage Gateway can be used to snapshot your volumes, which are then copied and stored on Amazon S3.
In this disaster recovery scenario, you are running a minimal version of your production environment. The recovery operation in this case involves provisioning all of the environment as quickly as possible based around a core of crucial infrastructure. Elements of this scenario typically include database servers, load balancers, etc.
Using a pilot light architecture allows for a quicker method than backup and restore because the core infrastructure is already up on the disaster recovery site and is up to date. With AMIs you can easily build the rest of your infrastructure around the core one.
Warm Standby is a disaster recovery scenario in which you are running a scaled-down version of your functional production environment. Because warm standby architectures include services running in the cloud, it is even faster than the previous methods covered. Servers are kept running on a minimum-sized fleet of instances and can be used for testing, quality assurance, and internal use. When a disaster hits, these instances need to be able to scale to meet the full production workload.
In this Disaster Recovery scenario, you are running fully-replicated copy of your production environment in the cloud. All data is copied to one or more destinations, and when disaster occurs, by switching DNS to unaffected site, the downtime is minimal, and you continue with working as if nothing happened.
In each of these scenarios, the burden to configure and maintain the solution lies with the user. With Cloud Volumes ONTAP, AWS users get a different kind of disaster recovery solution. Cloud Volumes ONTAP for DR works out of the box, leveraging SnapMirror® replication to enable seamless DR failover and failback operations across AWS environments. With its data tiering capabilities, Cloud Volumes ONTAP allows DR secondary copies to be tiered between Amazon S3 and Amazon EBS. This saves costs by storing the copy on Amazon S3 until a crisis requires the copy to be tiered back to Amazon EBS and bounce back from the failure event.
For disaster recovery in AWS, one of the safest ways to protect a system is to leverage different availability zones and regions to store data. By doing so, users eliminate the chance that the loss of an entire AZ or region will impact the company’s normal functioning.
With Amazon S3, cross-region replication (CRR) allows objects in buckets to be copied automatically between regions. With Cloud Volumes ONTAP, enabling cross-region replication is even easier through the use of OnCommand® Cloud Manager and SnapMirror®, NetApp’s data replication technology. This replication is not only efficient in terms of the amount of storage it consumes but also in the amount of time it takes to transfer.
With Amazon EBS, point-in-time snapshots can be created as backups of the volumes. Once a snapshot is created, it is stored on Amazon S3; from there it can be copied within or between regions. The initial snapshot that Amazon EBS creates will be a full copy of the volume data. Subsequent snapshots that Amazon EBS creates for that volume will be incremental, only syncing changes to the baseline copy. This both speeds up the copying process and lowers storage costs.
However, the efficiency of Cloud Volumes ONTAP’s snapshots far exceed those used by Amazon EBS. This is because the initial copy is optimized through the use of NetApp’s WAFL technology. Snapshots are also incremental, and benefit from smaller size thanks to cost-reducing NetApp storage efficiencies.
It is important to run regular backup and replication jobs. When a disaster occurs, if everything is properly backed up and replicated, you can continue with your business. Also, when you switch to the disaster recovery site, all of the changes that are made there need to be properly backed up and replicated back to the primary site after failback. When using Cloud Volumes ONTAP, automatic snapshot backups and replication are easily scheduled with the use of the Cloud Manager.
Make sure that your critical infrastructure is properly monitored. With good monitoring you can detect potential failures such as application shutdown, server failure etc. A good monitoring will help you see which parts of your infrastructure are down and help reduce time to switch on to disaster recovery site.
Once you have created your Disaster Recovery site, you should test it to make sure everything goes according to plan. These tests can be scheduled, for example, every six months. Periodic testing will make sure all the components of the DR architecture are working to ensure RTO and RPO are met. Cloud Volumes ONTAP users can use FlexClone® data cloning technology to test DR copies.
In this article we covered some common scenarios and steps to prepare your DR site on AWS. While AWS doesn’t have its own disaster recovery service, it does allow users to control the building blocks to create appropriate DR solutions on their own. For users who need a DR solution but don’t want to invest the time and money in configuring one themselves, Cloud Volumes ONTAP offers a faster solution for disaster recovery on AWS.