hamburger icon close icon

Disaster Recovery Orchestrator (DRO), a scripted solution with VMware Cloud on AWS and Amazon FSx for NetApp ONTAP

FSx for ONTAP® is an increasingly popular choice for VMware Cloud on AWS customers who would like to expand their storage requirements and also for new customers who would like to quickly adopt VMware Cloud on AWS. If you are planning to migrate from on-premises (VMware on any storage vendor) or expand your existing software-defined datacenter (SDDC) within VMware Cloud, FSx for ONTAP simplifies deployment. Keep in mind, your SDDC version must be 1.20 or above. For a detailed step-by-step approach for provisioning the datastore, see the resources “VMware Cloud on AWS integration with Amazon FSx for NetApp ONTAP Deployment Guide” and “TR-4938: Mount Amazon FSx for ONTAP as a NFS datastore with VMware Cloud on AWS.

Block-level replication is a widely adopted feature set of FSx for ONTAP that dramatically simplifies disaster recovery. You can quickly activate a destination volume after a disaster and then reactivate the source volume after the primary site is back up. There are a few siloed steps to perform this process manually in case of virtual machine recovery. In this blogpost, I demonstrate how simple and easy it is to perform failover and failback using SnapMirror with a scripted, UI-based solution. It’s time to say goodbye to siloed scripts when you can instead rely on DRO for automated, single-click disaster recovery with a predictable RTO.

Disaster Recovery Orchestrator (DRO) can be used to seamlessly automate recovering workloads replicated from on-premises to FSx for ONTAP or between two VMC SDDCs using FSx for ONTAP. DRO automates recovery using NetApp® SnapMirror® technology through VM registration to VMC to network mappings directly on NSX. A recent AWS blogpost covers the benefits of using SnapMirror for disaster recovery. 

Getting started

To get started with DRO, use an Ubuntu operating system on a designated EC2 instance or virtual machine on-premises. Then install the package. Make sure you meet the following criteria:

  • Designated DRO instance have connectivity to the source and destination vCenter and storage systems.
  • DNS resolution should be in place if you are using DNS names. Otherwise, you should use IP addresses for the vCenter and storage systems.
  • SnapMirror replication is configured for the designated datastore volumes.

After connectivity is established (it can be VPC peering, AWS Transit Gateway or VPN connection between the source and destination sites), proceed with the following installation steps, which should take 2 to 3 minutes.

  1. Download the installation package on the designated EC2 instance or onto a virtual machine running on an on-premises vSphere instance.

    Note: NetApp recommends deploying the DRO agent in AWS and to the same VPC where FSx for ONTAP is deployed (it can be peer connected too), so that the DRO agent can communicate through the network with on-premises components as well as with the FSx for ONTAP and VMC resources (even in case of cross-region SDDC).
  2. Unzip the package, navigate to the directory, and run the deploy script.
  3. It is as simple as that. After you have completed these tasks, access the UI using https://<host-ip-address> with the default credentials:

    DRO
    Note: DRO is currently in private preview.


DRO configuration

In the context of this blog post, Prod always refers to the original Production site and DR always refers to the original disaster recovery site, regardless of where VMs or workloads are currently active.

The first step in preparing for disaster recovery is to discover and add the on-premises or cloud resources (both vCenter and storage) to DRO. Open DRO in a supported browser, use the default username and password (admin/admin), and click Add Sites. Sites can also be added using the Discover option.

Add the following platforms:

  • Source. On-premises vCenter along with ONTAP storage system or VMC SDDC vCenter with FSx for ONTAP.
  • Destination. VMC SDDC vCenter along with FSx for ONTAP.
DRO-Site-Type

 

In this blog post, I cover disaster recovery between two cloud SDDCs that are using FSx for ONTAP.

What DRO can do for you

After the source and destination sites are added, DRO performs automatic discovery and displays the VMs along with associated metadata. DRO also automatically detects the networks and port groups used by the VMs and populates them.

DRO-Sites

After the sites have been added, VMs can be grouped into resource groups. DRO resource groups allow you to group a set of dependent VMs into logical groups that contain their boot orders and boot delays that can be executed upon recovery. To start creating resource groups, navigate to Resource Groups and click Create New Resource Group.

DRO-Rsource-Group

The next step is to create the blueprint or a plan to recover virtual machines and applications in the event of a disaster. As mentioned in the prerequisites, SnapMirror replication should be configured before creating the replication plan.

Replication-Plan

After SnapMirror is in place, configure the plan by selecting the source (Prod) and destination (DR) vCenter platforms from the drop down and pick the resource groups to be included in the plan, along with the grouping of how applications should be restored and powered on (for example, tier 0, tier 1, tier 2, and so on). To define the recovery plan, navigate to the Replication Plan tab and click New Replication Plan.

DRO-Replication-Plan

After you create the replication plan, the dashboard provides all the necessary details, including the number of sites, storage environments, protected VMs, replication plan health, and so on.

DRO-Dashboard

After you create the replication plan, you can perform the failover option, the test-failover option, or the migrate option, depending on the requirements. During the failover and test-failover options, you can use the most recent SnapMirror Snapshot copy, or you can select a specific Snapshot copy from a point-in-time Snapshot copy (per the retention policy of SnapMirror). The point-in-time option can be very helpful if there is a corruption event like ransomware, where the most recent replicas are already compromised or encrypted. DRO shows all available recovery points. To trigger failover or test failover with the configuration specified in the replication plan, click on Failover or Test failover.

DRO-Options

From BlueXP or the FSx for ONTAP CLI, you can monitor the replication health status for the appropriate datastore volumes (those that were mapped to VMC as read-write volumes). During test failover, DRO does not map the SnapMirror target volume. Instead, it makes a FlexClone copy of the required SnapMirror (or Snapshot) instance and exposes the FlexClone instance, which does not consume additional physical capacity for FSx for ONTAP. This process makes sure that the volume is not modified, and replica jobs can continue even during DR tests or triage workflows. Additionally, this process makes sure that, if errors occur or corrupted data is recovered, the recovery can be cleaned up without the risk of the replica being destroyed.

DRO-Test-Failure-Step

In the case of a real failover event, DRO enables reverse resync for SnapMirror and also enables failback, which again can be performed with the click of a button.

To summarize, disaster recovery to cloud is a resilient and cost-effective way of protecting workloads against site outages and data corruption events (for example, ransomware). With NetApp SnapMirror technology, on-premises VMware workloads can be replicated to FSx for ONTAP running in AWS. Similarly, disaster recovery can be performed between two SDDCs.

The benefits of this solution include the following:

  • The use of efficient and resilient SnapMirror replication and the recovery to any available point in time with Snapshot copy retention.
  • Full automation of all required steps to recover hundreds to thousands of VMs from storage, compute, and network and workload recovery with ONTAP FlexClone technology using a method that doesn’t alter the replicated volume.
  • Avoiding replication interruptions during disaster recovery test workflows.
  • Leverage disaster recovery data with cloud computing resources for workflows beyond disaster recovery such as dev/test, security testing, patch or upgrade testing, and remediation testing.
  • Optimize VM resources to help lower the horsepower requirements by allowing recovery to smaller compute clusters.


If you are using FSx for ONTAP with VMC SDDC or planning to migrate to VMC SDDC using FSx for ONTAP, DRO is here to help. Try it now for free and feel free to follow the detailed, step-by-step documentation at NetApp Docs.

New call-to-action

Principal Technical Marketing Engineer