Genomics Data Puts NetApp and AWS to the Test

March 14, 2019

Topics: 6 minute read

How long would it take you to run a query against 20 trillion data points? That’s not a rhetorical question. Can your NFS solution even do it? WuXi NextCODE’s solution can, and they know exactly how long it will take them.

A few months ago, we told you about a leading genomics company that had begun to use NetApp^® Cloud Volumes Service to bolster its analysis capabilities in genetic sequencing. (See Personalizing Healthcare Through Faster Genome-Sequence Analysis.) WuXi NextCODE gathers vast caches of information about “cohorts” within the population, or a subset of a population that shares specific traits, and then breaks that information down for their pharmaceutical customers. The pharmaceutical companies use analytics to uncover hidden patterns, unknown correlations, and other insights to improve their ability to prevent and cure diseases based on differences in genetic makeup, lifestyle, and environmental factors.

As you might imagine, creating cohorts and sequencing genomes can be challenging. The sheer amount of data generated is mind-boggling: A single complete genome is made up of 3 billion “base pairs” of DNA molecules, and sequencing a whole genome generates more than 100GB of data. By 2025, over 100 million human genomes could be sequenced. That’s over a zettabyte of data!

A Path to the Cloud

Very specific high-performance workloads like genome sequencing and rendering database workloads have traditionally had a hard time making their way into the cloud. Many companies have relied on an on-premises architecture built around a high-performance computing (HPC) cluster architecture with highly scalable local storage. That setup creates an enormous hurdle for granting access to thousands of research scientists who need to mine the complete clinical data of millions of individuals.

After the data is moved to shared files in the cloud, access becomes much easier and more fluid. But early cloud services weren’t a viable alternative; they provided only conventional data processing and storage. Customers struggled with the scalability of NFS solutions. The high-speed I/O that’s fundamental to their analysis simply wasn’t available. Often, the next step was to spin up their own NFS service. This inevitably increased the complexity and cost of processing data and later evolved into self-managing homegrown NFS storage in the cloud, obviating an important reason to move to the cloud in the first place.

That’s what happened to WuXi NextCODE before NetApp came onto the scene. NetApp has a long history of expertise in managing large-scale file datasets. Not only have we solved the NFS bottleneck, but we also deliver data management capabilities and performance tiers that aren’t otherwise available. With NetApp Cloud Volumes Service for AWS, our fully managed cloud service, we helped WuXi NextCODE move their entire production set into Amazon Web Services (AWS).

The combination of NetApp and AWS is a great fit for WuXi NextCODE’s architecture. They’re running a traditional database using 53 Amazon EC2 instances that all push their analytic data to NetApp Cloud Volumes Service. They tried other enterprise file services first but found that they experienced unparalleled performance using NFS with our Cloud Volumes Service.

Putting NetApp and AWS to the Test

Before committing to Cloud Volumes Service, WuXi NextCODE tested it. First, they wanted to see how long it would take to onboard their data. Using NetApp Cloud Sync, they were able to onboard 50TB of data in less than a weekend. Cloud Sync is included in Cloud Volumes Service and is used to automate data migration processes within the premises or to the cloud.

The second test was to compare Cloud Volumes Service to the previous file services they’d tried. When they ran their first test cases, they actually found that NetApp’s product was over three times faster than other cloud file systems. As an added measure, they tested out certain use cases—cases that the other file systems had trouble running—and were able to run them to full length.

The big test came when they decided to run a genome query that they’d created, but had never been able to complete because it touched on some 20 trillion data points, using over 500 genetic cohorts and a proprietary database for genomic analytics. They wanted to see if they could actually complete it by switching to our Cloud Volumes Service, because other services hadn’t worked. In previous testing with other NFS solutions, the query would time out within 3 or 4 hours. But using our service, they were able for the first time to render it to completion, and to do so much more swiftly than they’d anticipated.

Beyond Performance

In addition to its outstanding performance, NetApp Cloud Volumes Service differentiates itself from other file sharing options through its cloning capabilities and NetApp Snapshot^™ data.

It’s crucial that data scientists, like genetic researchers, are able to work on the most up-to-date production datasets. And they need to be able to replicate the data across many different environments; clones allow them to make multiple copies of datasets for testing, staging development, and production. Cloud Volumes Service creates space- and time-efficient clones of datasets that take zero capacity and near zero time to produce—improving quality and time to market while lowering costs.

Protecting datasets is critical. With hundreds of scientists constantly accessing shared files, the data is always changing. Accidental deletion, corruption, or modification of genetic data can have a devastating effect. Our Cloud Volumes Service uses Snapshot copies to make and maintain frequent, low-impact, user-recoverable copies of files, directory hierarchies, and application data. Snapshot technology vastly improves performance overhead and can be safely created on a running system. A Snapshot copy takes only a few seconds to create—typically less than 1 second, regardless of the size of the volume or the level of activity. After the first Snapshot copy is created, only subsequent changes in the data need to be recorded. For example, instead of copying 40TB chunks over and over again, NetApp cloning technology streamlines the process by allowing users to simply record the difference between copies. This incremental behavior reduces storage capacity consumption, radically cuts the time required to back up data, and increases overall performance. Some alternative implementations consume storage volumes rivaling that of the active file system, which raises storage capacity requirements and increases costs.

Choose Your Own Performance Level

Cloud Volumes Service also enables you to automatically or dynamically change the performance tiers of the volume according to your needs from moment to moment. You can easily move between standard, premium, and extreme performance according to what you actually need and when you need it—and you only pay for each service level when you’re using it.

Learn More

To learn more about WuXi NextCODE, the results of their benchmark testing, and how Cloud Volumes Service is able to satisfy the extreme demands of genomics analytics, read The Internet of DNA: Cloud Enabled.

Take Cloud Volumes Service for a Spin

Cloud Volumes Service for AWS is a fully managed cloud service that enables you to move your workloads and applications to the cloud and manage them with ease. You might not have a query with 20 trillion data points, but your workload would probably benefit from optimized performance. Put Cloud Volumes Service for AWS to the test. For more information, visit our Cloud Volumes Service for AWS page at NetApp Cloud Central, or sign up to request a demo.

David Boland