Archiving in the cloud has become a necessity for most organizations. After adopting cloud computing at a fast pace and vast degree, companies now need low-cost solutions for storing massive amounts of data that will only be accessed infrequently.
All the data that a company stores in the cloud has to be backed up to ensure high availability and disaster recovery. Various backup tools are offered by cloud service providers, however those tools differ depending on data criticality and frequency of access.
Fortunately, most cloud providers offer storage services for durable backup that are also cost-effective. This article will take a close look at one cloud archival storage service – Amazon Glacier – and walk you through its features and archiving process.
The major differences in cloud standard storage and cloud archival storage are cost and retrieval time: Archival storage rates are much cheaper than standard storage rates, but the retrieval time is also much slower. The slower rate of retrieval is acceptable because the data is not going to be accessed frequently.
Archiving this kind of infrequently-accessed data in the cloud achieves reliability and cost-optimization. The data can be archived automatically from the cloud storage when it reaches a certain age, and it can always be retrieved later, if required.
Cloud archiving provides the flexibility of retaining a large amount of infrequently accessed data, at a lower storage cost. Archiving can also come with an expiration date, upon which the archived data will be deleted permanently.
Consider how this might work with something like log files. Suppose you have an application that generates gigabytes of log files every month. Over a period of several months, those files may soon measured in terabytes.
Generally, the log files are only accessed during the first month they generate, however, for audit purposes, your organization compliance team wants to preserve these files for one year. In this case, it would be good to automatically archive the data to archival storage after one month to save on storage costs.
AWS Glacier is a low cost, durable and secure cloud archive solution to archive your data in the AWS cloud. It has options to archive data with various retrieval durations ranging from a few minutes to several hours. It provides annual durability of 99.999999999% and uses checksums to maintain data consistency.
When it comes to security and data protection in the cloud, AWS Glacier provides different security features such as default encryption, immutable archives, request signing, and flexible access control with AWS Identity and Access Management (AWS IAM) policies. Here are some of the security features in detail:
Initially, AWS Glacier only had a single option for data retrieval, one which took four to five hours to restore an archive. Since then, Amazon has expanded Glacier’s options and now offers three different retrieval options:
1. Expedited Retrieval - allows you to download data in just 1 - 5 minutes
2. Standard Retieval - the original option that takes 3-5 hours to restore
3. Bulk Retrieval - allows to download Terabytes of data, but it will take around 5-12 hours
To optimize the cost for data retrieval, AWS offers three retrieval policies only for the Standard retrieval option. These policies are Free Trial Only, Max Retrieval Rate, and No Retrieval Limit.
For applications that don’t need to retrieve huge amounts of data, the Free Trial Only option is best choice. The user can configure their account so that AWS will not download more than the free retrieval limit set for each day.
If you want to download more data than the free retrieval limit, Max Retrieval Rate is the more suitable policy since it allows you to set a bytes-per-hour limit. Retrieval requests more than the set limit will be rejected in case of Free Trial and Max Retrieval Rate.
If your retrieval frequency is that high, the No Retrieval Limit policy should be used. The pricing for these policies varies. The retrieval limit is provided in metrics of GB/hour and the estimated cost is calculated based on that value.
AWS Glacier is useful for many cases such as archiving large media data, healthcare data, log information, content for large video streaming services to name just a few examples. Archiving is also helpful for maintaining regulatory and compliance data, digital preservation of physical copies, scientific data storage for future reference, as well for replacing tape drives.
Consider a use case of a media streaming application hosted in the AWS cloud. The application enables its organization to go live daily and store their streamed videos on AWS S3 as cloud storage backup. The streamed videos which are older than one month are no longer needed and may be retrieved at some future date.
Also, the application is generating logs which are stored on AWS S3. These archived logs can be examined in case the server ever goes down or if the organization is not able to go live at some point.
A solution is needed to back up the data and also to save on costs.
The older streamed videos will be archived to Glacier and the older logs will be permanently deleted. The retrieval requests are based on the retrieval policy set in Glacier. Streaming videos that are older than two months are deleted using lifecycle policies, while videos older than one month are archived using Glacier’s Free Trial Only option.
If the organization changes its methodology and asks to retrieve older videos at more frequent intervals, other retrieval options can be selected.
The AWS S3 bucket is used to archive data to Amazon Glacier. The lifecycle policy rules that manage AWS S3 objects can be defined for both on current and previous versions of objects. These rules provide the option to automatically archive object to Glacier after a predefined number of days following the object creation date. It also gives you the option to automatically expire objects after a set object expiration date.
The following steps show how to provision lifecycle policies for the AWS S3 bucket. The images are taken from AWS Glacier console.
The lifecycle rule is now defined. The S3 objects will automatically be archived and expire once they reach the specified number of days after their creation date.
There is also an option to back up data directly to Glacier. Less important data can directly be uploaded to Glacier saving on storage cost, with different options for retrieval frequency. Glacier provides a management console which can create vaults (containers for archives), but the files can only be uploaded using AWS CLI, or SDK and REST APIs. AWS Import and Export Service can also be used to transfer data directly to the Glacier. This service helps make a connection between storage devices and the Glacier that bypasses the Internet.
AWS CLI provides command line interface to upload data such as photos, documents, videos, logs directly to Glacier. The steps include creating a vault using the Glacier console and then using CLI to transfer the data directly to Glacier. There is a 4GB object size limit to upload data in a single operation, and a 40TB limit for uploading in multipart operations. You won’t immediately be able to see the objects in the vault as soon as you upload them: the Glacier console updates the vault information only once a day.
Information about newly archived objects will be available in next vault update. Every object will have an archive id associated with it. This id will be used to delete, upload, and manage the archives inside the vault. Once an archive is uploaded, it cannot be edited.
After installing AWS CLI, various parameters can be used to upload the archive. The available parameters are vault-name, account-id, archive-description, checksum, and body. The steps below show how to create a vault and upload files to it:
aws glacier upload-archive --account-id <account_id> --vault-name testVault --body test.zip --region us-west-2
{
"checksum": "241234234234gfsdgfdsgsfsfdbcbbe76cdde932d4646fa7de5f21e18aa67","archiveId": "kKB7ymWcasfsfsffsfJVpPSwhGP6ycSOAekp9ZYe_--zM_mw6k76ZFGEIWQX-ybtRDvc2VkPSDtfKmdsfdgsaQrj0IRQLDUmZwKbfdsfdsfdsfhHO0bjbGehXTcApVud_wyDw",
"location": "/0123456789012/vaults/testVault/archives/kKB7ysfwfsgfwdfZYe_--zM_mw6k76ZFGEIWQX-ybtRDvc2fffffffwfwrj0IRQLSGsNuDp-AJVlu2ccmDSyDUmZwKbwbpAdGATGDiB3hHO0bjbfefefefe_wyDw"
}
Best practices for cloud archiving start with understanding the basic difference between cloud backup and cloud archiving. People often confuse these two terms and end up storing their important data in an archive.
That results in higher spending since retrieving data above the limit specified is more expensive. Also, cloud backup is used to increase the availability of data in case of an unexpected data loss.
Storing unwanted, older files in a backup may result in more storage costs for non-essential data. Understanding your data’s criticality and frequency of access are very important factors when deciding on a storage class.
Another thing to consider is how to move data from a backup to an archive. Data that has to be available for a certain period of time but then might not be accessed may still require storage: this is perfect case for archiving.
Archiving data that still needs to be accessed, or backing up data that does not need access may result in higher storage costs.
Cloud backup and archiving are vital elements of cloud computing.
Every application needs a backup and archiving strategy to ensure high availability and for its disaster recovery solution. There is always a need to keep storage costs low. Different cloud providers provide different tools to archive data.
The user must prepare solutions and best practices for archiving that balance storage cost and retrieval options. Methodologies must be used to categorize data for backup or archive, and archiving and expiration policies for older data should be properly set.
Want to get started? Try out Cloud Volumes ONTAP today with a 30-day free trial.