The first whole human genome sequence took 20 years and $3 billion dollars to complete. The modern process takes a fraction of that time and cost. But companies, universities, and laboratories still have a big problem: How can they efficiently secure, mine, process, and share billions to trillions of sequence data objects?
The solution is high performance genomics cloud computing enabled by three core components:
Genomics sequencing involves multiple steps and generates multi-terabyte data sets. The sequencing process starts in the field by taking samples from individuals. The samples go to the labs where researchers process them through sequencers to produce raw data files that are a terabyte or more in size. Users stream or batch files to specialized databases in the cloud like WuXi NextCODE.
WuXi NextCODE leads the world in human genomic data research for precision medicine--a type of medicine that takes into account individual genetic and environmental variability, among other factors. Let’s use an example to illustrate the sheer amount of data WuXi NextCODE handles: the average number of differences between just two individuals’ DNA is 5 million. Now, multiply 5 million differences by thousands, millions or billions of people. WuXi NextCODE’s purpose-built database enables researchers to efficiently discover critical differences or mutations. This data drives research into the causes of rare diseases and cancers. Once identified, discoveries enable researchers to develop more effective treatments.
The core technology of the WuXi NextCODE platform is the genomic relational database. Of the many genomic software products in the world, this database is the only purpose-built architecture to organize, mine, and share large-sequence genomic databases.
Amazon Web Services (AWS) optimizes their cloud offerings for genomics processing and collaboration. Dynamic scalability and a broad system of genomics tools and partners enable researchers to process and share massive genomics data and workloads.
AWS customers can retain their on-premises computing environment and seamlessly bridge to AWS for low-cost big data storage and high-performance dataset processing.
The third core component of cloud genomics processing is NetApp Cloud Volume Services for AWS. Cloud Volumes Service for AWS is a fully managed file service suitable for HPC in the cloud that enables highly scalable, durable, and high performance SMB shares on AWS for high-performance genomics cloud processing.
NetApp Cloud Volumes Service: Lowers Storage and High-Performance Computing Costs
AWS has three options for storing and analyzing large data sets:
There is a fourth option: Deploy NetApp Cloud Volume Services to lower costs and accelerate genomics processing on AWS.
Genomics users frequently cite low storage costs is the reason for moving to AWS.
S3 is popular with a $0.01-$0.02 cost per GB/month. Initially, this amount is lower than CVS, which charges $0.10/GB/month for Standard. However, S3 costs rise significantly when processing data because S3 charges for data access. Although data access charges are nominally low, only $0.0004 per 1000 GET requests, a large data sequencing process will easily generate thousands of GET requests per second. This adds hundreds of dollars to a single processing task occurring on S3.
NetApp CVS does not charge for data access, which considerably lowers overall processing costs for HPC on AWS to about 60%-75% of a similar operation on AWS S3.
CVS raises performance with high IOPs: up to 460k IOPs with low latency on large genomics databases. WuXi NextCODE tested CVS on their real-world cloud genomics database on AWS. Here’s what they found.
Read the WuXi NextCODE Case Study and visit Cloud Central to learn more about Cloud Volumes Service for AWS.