hamburger icon close icon
Cloud File Sharing

Cloud File Share High Availability Nightmares (and How Cloud Volumes ONTAP Can Help)

Cloud storage changed the game when it enabled data to be available and stored in different environments and locations without heavy upfront IT investments. This change was especially important for organizations that require highly available storage capabilities and robust IT operations, as is the case with cloud file sharing.

With the cloud it's easier to provision storage for file shares that can meet demanding recovery time and point objectives (RPO/RTO). Yet, there are still many challenges, and without a good understanding from IT leaders, this can make or break their business continuity plans.

In this article, we are going to cover the key challenges with file share availability and how they can be solved with the help of NetApp Cloud Volumes ONTAP.

Jump down to a specific part of this article:

Key Challenges to File Share High Availability

Any software architect that ever designed and implemented resilient and highly available cloud-based solutions understands that while the cloud simplified a lot of technical aspects, it didn’t fully eliminate the challenges.

To build a highly scalable and robust file storage solution that can withstand failures and continue to operate despite hardware failures, network outages, or region-wide losses, there are many business and technical key aspects to consider.

It all starts with the business requirements and expected service levels. Is data availability of 99.9% okay? Or is 99.999% required? In other words, how much data loss and service failures are we and our customers willing to tolerate?

Redundancy and Data Replication

Depending on the degree of availability needed, there are different cloud infrastructure building blocks we can leverage to design a great solution. The most important part of the design is to avoid (or eliminate) single points of failure and establish redundancies. The challenge with redundancy and data replication is not just the cost that it takes to duplicate infrastructure but also the technical complexities and operational overhead it generates.

Since the cloud provider takes care of the hardware and other physical requirements—such as bare metal servers, storage devices, and network connections—that solves a big part of the challenge of redundancy compared to on-premises infrastructure. Also, because provisioning, modifying, and decommissioning cloud resources are always carried out via APIs, it is very easy to duplicate infrastructure resources which is key to building highly available solutions.

In the cloud, the technical complexity and challenge then comes down to how customers design their solutions. Can they leverage cloud resources to ensure that needed redundancy? Data replication and high availability in file services are still big challenges that the cloud providers’ built-in services struggle to solve beyond a single region.

In the end, it's always up to the customer to safeguard that their data is available when it's needed and that when a component fails there are auxiliary resources to take over to minimize system disruptions.

Leveraging Multiple Geographical Regions

Typically, a single cloud geographical region in a given cloud provider has two or more availability zones. Think of an availability zone as an independent data center. With fast connectivity between multiple availability zones within a region, it’s fairly easy to achieve a certain degree of availability in a single geographic region. The built-in cloud services usually have inherent features that allow multi-AZ file storage replication.

The challenge to file share high availability comes when we need to leverage other geographical regions or different cloud providers. At bare minimum, a good practice is to have data backups in a different geographic location for disaster recovery purposes.

Since a cloud provider has multiple regions across the globe, configuring this should be easy, at least in theory. However, as any cloud expert will tell you, setting up a multi-region architecture that can ensure your data and applications can withstand unexpected system failures or disasters is not quite so straightforward. The built-in file services usually lack those capabilities and have fairly limited options to ensure data replication can happen outside the specific region where cloud data resources were originally provisioned.

Engineers and software architects can apply different methods to achieve a robust degree of overall system availability for file services. There are battle-tested approaches that can be implemented depending on the requirements and help balancing between cost and business value created:

    • Backups in a secondary region are a great practice to enable a minimal disaster recovery setup and one step towards achieving a 3-2-1 backup strategy. This can be key to ensuring business continuity in case of a severe failure.

      Duplicating your data to another cloud region (within the same provider) is usually achievable with low engineering effort and cost. This method assumes the customer is willing to tolerate a high recovery time objective (often several days) and has prepared operating procedures for manual recovery from a system failure. However, for many businesses that’s simply too long a downtime to tolerate.
      • Active-Standby is a common and balanced approach when high availability with a short and automated recovery time is required. In this scenario a standby replica of the system is deployed in a secondary region. In the event of a failure in the primary (active) region, the system failover automatically (or manually) to a standby system replica in another geographical region.

        Keeping the data in sync across locations, correctly instrumenting and testing the overall system availability is key to Active-Standby setups and is usually a big challenge. This approach also implies that a duplicate replica of the infrastructure is mostly idling, which is not ideal due to cost factors. The elasticity of cloud resources can play a big role in minimizing this but it's still something that significantly increases the operational costs for storage resources without active usage.
      • Active-Active is the most technically challenging type of high availability to design, implement and operate. Therefore, is only recommended when the business requirements absolutely demand the recovery time and point objectives to be under a few minutes.

        In this scenario the system is deployed across two or more geographical regions that can operate independently. In case of a failure, the traffic can be redirected to a region where the system is operating without known issues (i.e., known to be in a healthy state).

        This approach is quite costly not only because of the large infrastructure footprint it requires but also because it significantly expands the technical complexity of development and operations. Storage and data persistence are especially challenging due to the implicit latency and potential inconsistencies they might cause if not properly managed.

Performance and Scalability

When tackling file availability challenges, it’s not enough to plan on how to respond, mitigate, and recover from failure. While naturally you need to have a disaster recovery plan that outlines how your data and applications will be recovered, a big item to consider is the performance and scalability of the file storage layer.

Having good monitoring and alerting capabilities is critical to identify and resolve potential issues that might cause system degradation or impairment. Cloud storage resources usually have capabilities that allow them to scale and adjust automatically based on monitoring metrics that users can enable and configure easily.

Performance is directly linked to how the system can scale to cope with these automatic changes. That means architects need to design the file storage solution for an optimal performance, considering factors such as data access patterns, caching, and networking latency.

From a file service availability perspective, ensuring good performance and scalability is not only technically challenging but also subjective. Without experimentation and historical information and metrics, it’s not something that can be easily pre-determined and implemented. Leveraging a modern cloud file managed service makes a huge difference in lowering the technical complexity and costs (both for development and operations).

File Share High Availability with the Cloud Providers’ Managed Services (and Their Pitfalls)

Today the cloud providers each offer options for fully managed file storage services. However, these offerings come with some considerations.

Part of the challenge is related to the history of the cloud itself: Managed object storage services were introduced early on by the top cloud providers, which was then followed by block storage options. File services took much longer to introduce. Since managed file services were introduced later on, those offerings tend to lack the level of features that allows customers to easily meet file share high availability and performance requirements without significant custom development.

The lack of advanced features and configurations also translates into how their service offering is structured. A typical pitfall is that each provider has completely different services depending on the protocol you want to use. Selecting NFS or SMB often translates into separate services within the same vendor. That makes it hard for customers to have a holistic view over their storage volumes and limits use cases like data sharing, collaboration, and data mobility. When it comes to file share high availability, juggling multiple services means dealing with different SLAs, which might affect overall user experience.

There are other hurdles customers face building highly available and distributed file storage solutions. One of the biggest issues with the cloud hyperscaler file service offerings is that it is difficult or nearly impossible to leverage the managed service capabilities with other cloud vendors or on-premises environments, which makes it hard for customers with hybrid or multicloud strategies to adopt.

While in recent years providers have expanded their offerings to start bridging that gap, when it comes to file storage they are still quite far from providing the fully managed, low operational, and hands-off experience.

Ensuring File Share High Availability with NetApp Cloud Volumes ONTAP

Organizations that need data and file shares to be accessible across multiple geographical regions and/or different cloud providers do have an option. NetApp Cloud Volumes ONTAP offers a way to ensure data availability without spend-heavy engineering efforts to develop in-house data synchronization and integrations for file storage.

Cloud Volumes ONTAP leverages enterprise-grade data management capabilities to solve the challenges of file storage. By hosting data across on-premises and cloud deployments on AWS, Azure, and Google Cloud it provides a vendor-agnostic approach that helps make highly available file shares accessible anywhere.

In its high availability setup, Cloud Volumes ONTAP leverages two nodes—residing either in a single or multiple zones—that are kept in constant, absolute sync. If one node fails (or if its entire zone fails), the other takes over operations seamlessly. This ensures RPO of zero and RTO of less than 60 seconds.

Read about how Cloud Volumes ONTAP HA works in AWS, in Azure, and in GCP.

In addition to the high availability setup, Cloud Volumes ONTAP’s disaster recovery solution can help shares withstand and recover from major disasters. NetApp SnapMirror® data replication keeps data in sync between the primary and DR copies, which can be deployed across regions or even across cloud providers.

Summary and Key Takeaways

Understanding the different challenges that cloud-based storage and file share services is the most important step IT experts and leaders can take to ensure proper business continuity and that best practices are met.

This allows architects and other IT experts to build solutions that can ensure data is always accessible and business operations can continue without interruption. This availability is the only way businesses can truly decrease the risk of downtime, keep data integrity, and meet SLAs.

Through BlueXP, NetApp Cloud Volumes ONTAP offers protection levels and SLAs that can address the entire range of availability challenges and use cases of enterprise-grade file sharing

Read these file storage success stories to see why so many enterprises choose Cloud Volumes ONTAP as their file share solution.

New call-to-action
Bruno Almeida, Technology Advisor

Technology Advisor

-