A business continuity and disaster recovery (BCDR) strategy helps organizations secure data, applications, and workloads during planned or unplanned outages.
To help organizations implement BCDR, Azure provides Azure Site Recovery (ASR). ASR ensures business continuity during outages by replicating applications and workloads from their primary location to a secondary site.
This post covers two common DR architectures: Azure-to-Azure disaster recovery and physical server to Azure disaster recovery.
This article is part of our series of articles on Azure Backup.
In this article, you will learn:
Azure Site Recovery (ASR) is a disaster recovery as a service (DRaaS) that can be used in both public cloud and hybrid cloud architectures. It lets you use Azure as an on-demand disaster recovery site, without needing to invest in disaster recovery equipment upfront.
ASR creates replicas of computer systems which are synchronized through a near-real-time data replication process. It provides application-consistent snapshots, which make sure data is usable after a failover.
ASR provides support for several migration and disaster recovery scenarios:
Azure Site Recovery continuously replicates Azure VMs to different regions, which serve as a secondary location. In case of an outage, organizations can use the secondary region to access their data and workloads - this is known as failover. Once the primary site works, normal operations can resume there - this is called failback.
Note: ASR is only supported between two regions in the same geographical cluster (for more details see the documentation). VMs you want to replicate must run one of the supported operating systems (which include most popular versions of Windows and Linux).
Here are the components involved in this process:
Site Recovery takes snapshots of VM disks. By default, the system takes crash-consistent snapshots of data. It is possible to define a frequency, and then Site Recovery will take app-consistent snapshots.
The Site Recovery system uses snapshots to create recovery points, and then stores them according to the retention configurations pre-defined in the replication policy. Once a failure occurs, Site Recovery lets you restore a VM from a recovery point.
This process helps ensure VMs start with no data loss or corruption. Depending on the snapshot taken, the process also helps make sure VM data remains consistent for the OS and the applications running on the VM.
The replication process performs several actions. First, the system installs the service extension of Site Recovery Mobility on the virtual machine. The extension can then register the virtual machine with Site Recovery. Then, the system starts continuously replicating the VM, immediately transferring disk writes to the Standard cache storage that was set up in the source location.
During the next phase, Site Recovery starts processing cache data, sending it to replicated managed disks or to a storage account. When this is done, the system starts generating crash-consistent recovery points every 5 minutes. If the replication policy defined a process for app-consistent recovery points, these will be created too.
The following diagram shows what happens when you enable replication. ASR starts replicating disk data to the target environment. However, virtual machines are not started yet.
When you initiate a failover via ASR, VMs are created in the target region based on the virtual network, subnet, and availability set you defined, and the data previously replicated from the source environment.
Azure Site Recovery can also be used to replicate physical servers deployed in an on-premises data center to the Azure cloud.
Here are the components involved in the process:
When replicating an on-premises physical server to Azure, replication works as follows:
1. You set up a deployment with local and Azure components. Recovery Services Vault specifies a replication source and destination, sets up a configuration server, creates a replication policy, and enables replication.
2. ASR replicates an initial copy of server data to Azure storage, using the predefined replication strategy.
3. When the first copy is complete, ASR begins copying incremental changes to Azure. Changes are tracked using a log file with extension .HRL.
4. Traffic can be rerouted to the new machine on Azure over the Internet or using Azure ExpressRoute with public peering.
In case of a disaster the failover process from the on-premises data center to Azure works as follows:
1. You select the VM you want to fail over.
2. You select a recovery point to fail over to. There are several options:
a. Latest - provides the lowest Recovery Point Objective (RPO), by failing over to the very latest version of the on-premises server. This option is crash consistent.
b. Latest processed - provides the lowest RTO (Recovery Time Objective) because it doesn’t wait for processing of the latest replicated data, instead using the latest version that is ready for failover. This option is also crash consistent.
c. Latest app-consistent - fails over using the latest recovery point that is app consistent, guaranteeing that application data is not corrupted.
d. Custom - lets you set a recovery point manually.
3. You can optionally shut down the source machine before starting the failover (but failover continues even if this fails)
4. Failover begins - the Azure Site Recovery Jobs page shows progress.
5. When failover ends, connect to the target VM to validate it works correctly, and then commit the failover. This deletes all other available recovery points.
After failing over to Azure, you may need to re-protect Azure VMs by replicating them back to the on-premises site. This is a detailed procedure - see the documentation for more details.
When your primary on-premises site is available again, you can fail back as follows.
Note: ASR failback requires a local VMware environment - even if you originally protected physical machines. ASR is only able to fail back to VMware VMs in your local data center.
In the ASR console, you select the VM you need to fail back, and request “unplanned failover”. Failback is called “failover” in ASR - make sure you select the failover direction as “from Azure”.
See the Azure documentation for more details, and guidance for automating some of these steps for faster failover and recovery.
While Azure offers some capable disaster recovery solutions on its own, users can get more protection and better DR processes, all at a lower cost, with NetApp Cloud Volumes ONTAP.
Cloud Volumes ONTAP brings the trusted enterprise-class storage management features of on-prem ONTAP storage systems to Azure. It can reinforce your DR strategy in Azure, serving as the underlying storage management system, or serve as a DR storage management on its own. In either case, Cloud Volumes ONTAP adds value with features such as data replication, high availability, and storage efficiency, as we’ll see below.
Ensure Cloud Business Continuity
The Cloud Volumes ONTAP HA (High Availability) configuration for Azure uses shared storage between two Cloud Volumes ONTAP nodes that are part of different fault and update domains. In the event that one of the nodes becomes unavailable, the surviving node takes over and provides access to data without any disruption.
Easy Azure storage replication
Cloud Volumes ONTAP uses SnapMirror® technology to replicate data across hybrid and multicloud architectures. It can be used to replicate data to a secondary site for DR and keep it in sync. SnapMirror replication is set up using the simple drag-and-drop controls in NetApp Cloud Manager.
Failover and Failback
A failover can be initiated by breaking the replication relationship in SnapMirror. During failback the synchronization can be reversed and data from the DR site can be replicated back to the primary location.
Storage Efficiencies
Cloud Volumes ONTAP’s storage efficiency features, including thin-provisioning, data compression, deduplication, and data tiering, help to reduce the DR copy’s storage footprint and costs. These features are available out of the box and can be used by the customer with no configuration overheads.
FlexClone for DR Testing
Cloud Volumes ONTAP data cloning technology can be used to clone instant, writable volumes for DR testing that won’t affect ongoing operations. These volumes are created instantly, and with zero capacity penalty.