hamburger icon close icon

Deep Learning and AI in the Cloud with NFS Storage

July 30, 2019

Topics: 5 minute read

Though it may seem like a rapid technological development, deep learning didn’t happen overnight. Its foundations have been around since the 1940s, gradually ushering us toward where we are today: closer than ever to developing an authentic artificial intelligence (AI). But what tipped the scales in the past few decades? The answer is simple: data. 

We have more data at our disposal, along with greater access and wider distribution, thanks in no small part to the growth of the cloud. Coupled with enhancements in computational power, deep learning has broad applications. It’s directly contributed to highly accurate medical diagnostics software, advancements in self-driving cars, video games, marketing, and machine translation. 

As you might imagine, deep learning demands highly performant storage to handle all of that data. Cloud Volumes Service for AWS is one such solution that, at the same time, makes it easy to deploy and manage large NFS shares.

How Deep Learning Training Works

How are deep learning workloads consumed? At the heart of deep learning and AI development are data scientists. Data scientists creatively analyze data sets to identify and underpin an organization’s strategy or applications. Deep learning is one of the tools that data scientists use; they apply models and algorithms that train a collected data set so that the model can later perform a specific task by itself with high accuracy.

It begins with the data ingest phase. This is where the data that’s going to be used for training purposes is either obtained or imported from a source and then, depending on the case, cleansed and normalized. Once the data ingest phase has been completed, data scientists have an available data set to train the AI. Now the deep learning training phase kicks in. This phase is characterized by high-level parallel computing needs that are usually run by GPU-based clusters. The whole dataset needs to be fed to the neural network several times for hyperparameter tuning in order to get closer to the desired output. 

In the AI lexicon, these cycles (wherein the whole dataset is fed to the neural network) are called “epochs.” Within an epoch, the whole dataset can’t be fed all at once to the neural network: it has to be partitioned in smaller sections of data called “batches”. The batch size is the number of data samples that will be passed through to the neural network at any given time. The number of these batches needed to complete one epoch is known as an “iteration.” 

Deep learning starts as a random process in which the hyperparameters that the model uses to calculate the outputs are selected, initially almost at random. Based on the results obtained at the end of each batch, data scientists calculate how far the results predicted by the AI model vary from the “truth.” Using a method called back propagation, data scientists can adjust the hyperparameter values that influence those results. The next batch uses those adjusted values and starts the training process with a new batch until the outputs generated by the model begin to resemble realistic results.

Data Storage for Deep Learning 

The general flow that a deep learning workload follows during a training phase largely shapes data scientists’ storage needs. Deep learning training places heavy demands on parallel computing, which in turn demands high input bandwidth from the data lake it feeds on. Deep learning is a highly intensive read I/O task that requires high storage performance. 

Versions of all of the work that’s been done need to be stored and accessible for use at any time. Training periods for data sets can last days, and accidental interruptions can be expensive and time-consuming. Data scientists need to constantly save the progress of each training epoch in case they need to resume training at a given point. They also need to store the results of each epoch and finetune the values to manage the training of each model and to track changes in output accuracy. 

With these big data requirements comes the need for highly capable file systems and storage solutions that meet the performance demands of intense workloads, such as feeding your cluster during the deep learning training intervals. Data scientists usually make use of an efficient file system such as NFS, a leading contender in the field. Doing so allows them to manage training processes’ demands for high capacity, density, throughput, latency, and I/O operations. Those requirements underlie the entire deep learning process. NFS in the cloud offers specific benefits in this case. For instance, a file share in the cloud can scale up and ensure cost-effective availability in a way unlike any on-premises system.

It’s also important for data scientists to be able to instantly create and share unlimited copies of each version so that other team members can access them, copy the model, and perform the constant tests required during the training phase.

Another prerequisite is an easy-to-manage file system that’s capable of handling both random and sequential I/O access patterns without sacrificing performance. This is where Cloud Volumes Service for AWS comes in. 

Meeting the Demands of Deep Learning Workloads

Cloud Volumes Services for AWS supports the storage infrastructure required by deep learning during its most resource demanding phase, when the AI algorithm is trained. Using Cloud Volumes Service, you can deploy large NFS shares in just a few clicks, without having to manage the underlying storage stack’s configuration.

Those NFS shares can host your training data sets. Each highly available volume delivers IOPS levels of more than 200k against multiple clients. That would be about 3000MB/s for sequential reads. And that’s just one volume. You can also adjust any volume’s performance level on the fly.

Through its schedulable, automatic Snapshot copy policies, you can restore and replicate these read-only copies of your volumes instantly for shared access. This simplifies the process for other team members to simultaneously access information and perform tests by cloning volumes and mounting them from other instances. 

Is Cloud Volumes Service For You?

Sign up for a free Cloud Volumes Service for AWS demo and learn more about how this fully-managed service can help you with your AI demands.

Kristina Brand, Cloud Data Services

Cloud Data Services