NoSQL Cloud Databases and The Power of Big Data Analytics

Written by Yifat Perry, Technical Content Manager | Dec 6, 2018 3:27:36 PM

NoSQL databases are revolutionizing database deployment, shifting many organizations from more traditional database models. However, there are some differences between the way NoSQL operates that can make it challenging at the enterprise level to build NoSQL cloud databases.

In this entry of our databases in the cloud series, we will discuss the NoSQL architecture and then consider how NetApp’s Cloud Volumes ONTAP can help NoSQL in cloud deployments. We will also explore the way in which Cloud Volumes ONTAP supports big data analytics, a typical use case for NoSQL databases in the cloud.

What Is NoSQL Database

NoSQL database systems represent a paradigm shift from traditional, relational databases, which manifests itself in two overarching areas. Firstly, NoSQL databases primarily make use of non-relational data structures, for example graphs, semi-structured documents, such as JSON and XML, key-value maps, etc.

Secondly, whereas an RDBMS normally scales up vertically, NoSQL systems adopt a horizontal scale out strategy, allowing them to work with larger volumes of data and improve the availability and performance of the database. Building a NoSQL database cluster in the cloud therefore involves a significant amount of compute and storage management.

With NoSQL databases, each node in a cluster requires access to its own block-level storage allocations, with new compute and storage allocated as the cluster expands. Most relational database systems simply scale to larger compute hosts, possibly with the addition of read-only database replicas.

The disparity in cloud resource usage between NoSQL and relational databases requires DBAs and cloud architects to re-evaluate database systems deployment.

Types of NoSQL Databases and How They Work

MongoDB, Apache Cassandra, Hadoop, and Couchbase are some of the prominent types of NoSQL databases. These databases are each deployed as a cluster of nodes that work together to provide high availability and performance at scale. Distributing data across the database nodes is achieved automatically through a process called sharding, which is usually based on a hashed value of a set of fields in each data record. To remove single points of failure, each node will replicate its own data to a number of other nodes in the cluster, as determined by the replication factor. These are the big NoSQL advantages.

Migrating data from on-premises NoSQL database clusters to the cloud requires the use of tools and processes specific to each database platform. Most mature NoSQL database systems support a form of cross-datacenter replication that is used to create a second cluster and keep it incrementally synchronized with the primary site. A failover to the second cluster is used to switch database operations over to the new location.

Large NoSQL database clusters are able to utilize the aggregate processor, memory, and storage resources of all participating nodes. When a node fails, the database system remains operational, and a new replacement node can be added back into the cluster. As the new node does not contain any data, the database cluster will rebalance data onto the node from the rest of the cluster. This operation, however, can take time to complete and system performance may degrade while it is taking place. Rebalancing also occurs when new nodes are added to grow a cluster.

Working with large volumes of data dispersed around a sizable cluster of database nodes makes administrative operations, such as backup and restore, creating database test environments, and storage management, all more complicated than with a traditional database system.

Cloud Volumes ONTAP and NoSQL

Managing storage with Cloud Volumes ONTAP provides a wide range of benefits for NoSQL deployments in the cloud using AWS, Azure, or Google Cloud storage, including.

Block-level storage management: Cloud Volumes ONTAP can store multiple LUNs, or block-level storage allocations, in the same volume, with the option to organize this storage using a directory-type structure called a qtree. A single volume snapshot is then able to snapshot all LUNs at the same time; however, individual LUN snapshots are also supported. If a node fails, its LUN storage can simply be mounted to a new node that rejoins the NoSQL database cluster, removing the need for a full rebalance operation.
Data protection: Cloud Volumes ONTAP HA storage volumes are highly available, with synchronous replication of data across Availability Zones. In the event of a failure, immediate failover of storage services means that operations can continue with zero data loss (RPO=0). Volume snapshots make it easy to create instant point-in-time copies backups of your data, with the ability to instantly restore to a snapshot in the future. For consistent snapshots, the NoSQL database should be properly quiesced prior to performing the snapshot.
Storage efficiency: There are many storage efficiency technologies built into Cloud Volumes ONTAP that dramatically reduce cloud storage footprint, and therefore costs. For example, data deduplication and compression can reduce the amount of storage used by up to 50-70% and are especially effective with database systems. As NoSQL database systems store three or more redundant copies of the same data on different nodes, Cloud Volumes ONTAP helps to reduce the storage overhead this creates. Other storage efficiency technologies provided by Cloud Volumes ONTAP include thin provisioning, data compaction and data tiering.
Volume cloning: Using FlexClone® technology, Cloud Volumes ONTAP can instantly create zero-capacity cost, writable clones of an existing volume, based on a specific snapshot. This makes provisioning data for database test environments very fast and easy to accomplish. Each clone only requires storage for the changes that are made it, making the clones very space efficient. If all NoSQL block-level storage allocations are in the same storage volume, they can all be cloned at once with a single operation.
Simple cloud onboarding: Cloud Volumes ONTAP comes with an enterprise, block-level, data replication solution called SnapMirror that incrementally synchronizes data between on-premises and cloud environments. For those not already using NetApp ONTAP systems, Cloud Sync provides an alternative, platform-agnostic solution for file replication and incremental synchronization.
Data security: All data is fully encrypted to ensure to highest levels of data security.
RESTful APIs: Although Cloud Manager is the graphical, web-based UI used to manage and deploy Cloud Volumes ONTAP, external processes can also integrate with the system over REST. Cloning, volume creation, and block-level storage allocations can all be controlled through this programmatic interface, allowing for a much greater level of process automation.

Big Data Analytics with Cloud Volumes ONTAP

NoSQL in cloud deployments is frequently used for big data management and analytics projects. Organizations use big data to analyze huge datasets in order to uncover hidden patterns, insights and improve business decisions.

For successful big data analytics projects, users need to make sure they have a way to get all the data synced and consolidated from disparate places into one environment while ensuring a high level of data scalability, availability, mobility and security. This is where the many data management features of Cloud Volumes ONTAP can help with big data analytics.

Cloud Volumes ONTAP serves out block-level storage, as is used by NoSQL database systems, and also NFS and SMB file shares, which can be used to store the large datasets consumed by cloud-based analytics services. Apache Hadoop can connect through to NFS storage using NetApp In-place Analytics. This allows data files to be stored in a single, central repository and accessed uniformly by all users and services.

Each compute node in an Apache Hadoop cluster normally stores a part of the full dataset to be processed, in a similar way to NoSQL database systems. Separating out the data storage, however, by using Cloud Volumes ONTAP to create a data lake has the following benefits:

No data copying: By using NFS, all users and processes can readily access the same data without the need to make copies to Apache Hadoop clusters or other storage platforms. This reduces both storage requirements and the time it takes to start processing analytics workloads.
Less redundancy: As with NoSQL database systems, Apache Hadoop clusters also make multiple, redundant copies of the data they store across a number of different nodes. With Cloud Volumes ONTAP, this is not necessary, and a single copy of the data can be used, which increases storage efficiency and reduces costs.
Cluster independence: When the compute cluster is not being used, it can safely be taken down without affecting the data, which resides separately in Cloud Volumes ONTAP. Similarly, a compute cluster could be reused to process other workloads and datasets.
Improved scalability: Separating out compute and storage means that both can be scaled up individually. If more compute power if required, Cloud Volumes ONTAP will scale to serve out more concurrent requests, and storage volumes can easily be grown to accommodate more data.
Data Management: Cloud Volumes ONTAP features many storage efficiency and management technologies, as described above. Snapshots, data cloning, deduplication, compression, tiering, and everything else, are all available for NFS and SMB file shares hosted on Cloud Volumes ONTAP.

Conclusion

NoSQL database clusters benefit from storage systems that support advanced features for managing block-level storage. Cloud Volumes ONTAP provides unparalleled levels of storage management in the cloud, catering for block-level storage for NoSQL database systems, as well as NFS shares that can be accessed directly from cloud big data analytics clusters.

The built-in storage efficiency features have a direct impact on costs for NoSQL in cloud deployments. The data protection and flexibility provided by features such as snapshots and FlexClone® give NoSQL database administrators and big data engineers the power to manage large volumes of data effectively.

For more in our cloud databases series, check out the previous entries on database challenges, SQL, Oracle, and the next part which will focus on database storage tiering in the cloud.

View full post