Cloud Data Archiving Using Amazon Glacier Solution

Written by Gali Kovacs | May 17, 2017 2:37:21 PM

Archiving in the cloud has become a necessity for most organizations. After adopting cloud computing at a fast pace and vast degree, companies now need low-cost solutions for storing massive amounts of data that will only be accessed infrequently.

All the data that a company stores in the cloud has to be backed up to ensure high availability and disaster recovery. Various backup tools are offered by cloud service providers, however those tools differ depending on data criticality and frequency of access.

Fortunately, most cloud providers offer storage services for durable backup that are also cost-effective. This article will take a close look at one cloud archival storage service – Amazon Glacier – and walk you through its features and archiving process.

Storing Large Amounts of Infrequently Accessed Data

The major differences in cloud standard storage and cloud archival storage are cost and retrieval time: Archival storage rates are much cheaper than standard storage rates, but the retrieval time is also much slower. The slower rate of retrieval is acceptable because the data is not going to be accessed frequently.

Archiving this kind of infrequently-accessed data in the cloud achieves reliability and cost-optimization. The data can be archived automatically from the cloud storage when it reaches a certain age, and it can always be retrieved later, if required.

Cloud archiving provides the flexibility of retaining a large amount of infrequently accessed data, at a lower storage cost. Archiving can also come with an expiration date, upon which the archived data will be deleted permanently.

Consider how this might work with something like log files. Suppose you have an application that generates gigabytes of log files every month. Over a period of several months, those files may soon measured in terabytes.

Generally, the log files are only accessed during the first month they generate, however, for audit purposes, your organization compliance team wants to preserve these files for one year. In this case, it would be good to automatically archive the data to archival storage after one month to save on storage costs.

Amazon Glacier

AWS Glacier is a low cost, durable and secure cloud archive solution to archive your data in the AWS cloud. It has options to archive data with various retrieval durations ranging from a few minutes to several hours. It provides annual durability of 99.999999999% and uses checksums to maintain data consistency.

Data Protection in Glacier

When it comes to security and data protection in the cloud, AWS Glacier provides different security features such as default encryption, immutable archives, request signing, and flexible access control with AWS Identity and Access Management (AWS IAM) policies. Here are some of the security features in detail:

Encryption by default: AWS Glacier supports encryption by default. This means that the data present in Glacier will be encrypted on the server side. The service which stores the data encrypts it and stores it on the servers.

Key management and key protection are also handled by Glacier. Keys are used to decrypt the encrypted data. Glacier uses 256-bit Advanced Encryption Standard (AES-256) which is one of the strongest ciphers available today.

It also provides the flexibility to encrypt data using your own keys. Your data can be encrypted before it is uploaded to Glacier and can later be decrypted when it is retrieved.

Mandatory request signing: This is done for authentication protection. Requests are signed automatically before being made to protect the data. The signed request is managed on both the sender and the recipient side. A digital signature is calculated using a cryptographic hash function and the hash value that is generated is included in the request.

Once the request is received by Glacier, the signature is recalculated using the same hash function. The input to the hash function includes the access key of your account. Amazon Glacier processes the request only when the signature in the request matches the resulting signature.

Control with AWS IAM policies: AWS Identity and Access Management policies are used to control the AWS Glacier vaults. Using AWS IAM , different roles and policies can be created and assigned for different employees of an organisation.

Employees can be given roles based on their positions, and each role can have a different authorization policy rule attached to it. In this way, access to the Glacier vaults can be restricted to only critical team members.

More information about Glacier’s security features can be found here.

Glacier Archive Retrieval Options

Initially, AWS Glacier only had a single option for data retrieval, one which took four to five hours to restore an archive. Since then, Amazon has expanded Glacier’s options and now offers three different retrieval options:

1. Expedited Retrieval - allows you to download data in just 1 - 5 minutes
2. Standard Retieval - the original option that takes 3-5 hours to restore
3. Bulk Retrieval - allows to download Terabytes of data, but it will take around 5-12 hours

To optimize the cost for data retrieval, AWS offers three retrieval policies only for the Standard retrieval option. These policies are Free Trial Only, Max Retrieval Rate, and No Retrieval Limit.

For applications that don’t need to retrieve huge amounts of data, the Free Trial Only option is best choice. The user can configure their account so that AWS will not download more than the free retrieval limit set for each day.

If you want to download more data than the free retrieval limit, Max Retrieval Rate is the more suitable policy since it allows you to set a bytes-per-hour limit. Retrieval requests more than the set limit will be rejected in case of Free Trial and Max Retrieval Rate.

If your retrieval frequency is that high, the No Retrieval Limit policy should be used. The pricing for these policies varies. The retrieval limit is provided in metrics of GB/hour and the estimated cost is calculated based on that value.

A Use Case for Archiving Data

AWS Glacier is useful for many cases such as archiving large media data, healthcare data, log information, content for large video streaming services to name just a few examples. Archiving is also helpful for maintaining regulatory and compliance data, digital preservation of physical copies, scientific data storage for future reference, as well for replacing tape drives.

Consider a use case of a media streaming application hosted in the AWS cloud. The application enables its organization to go live daily and store their streamed videos on AWS S3 as cloud storage backup. The streamed videos which are older than one month are no longer needed and may be retrieved at some future date.

Also, the application is generating logs which are stored on AWS S3. These archived logs can be examined in case the server ever goes down or if the organization is not able to go live at some point.

A solution is needed to back up the data and also to save on costs.
As can be seen in the diagram, the application will store the streamed videos and the application logs to the S3 bucket. Lifecycle policies will be used in the S3 bucket to manage the S3 objects.

The older streamed videos will be archived to Glacier and the older logs will be permanently deleted. The retrieval requests are based on the retrieval policy set in Glacier. Streaming videos that are older than two months are deleted using lifecycle policies, while videos older than one month are archived using Glacier’s Free Trial Only option.

If the organization changes its methodology and asks to retrieve older videos at more frequent intervals, other retrieval options can be selected.

How to Archive Data to Glacier Using S3

The AWS S3 bucket is used to archive data to Amazon Glacier. The lifecycle policy rules that manage AWS S3 objects can be defined for both on current and previous versions of objects. These rules provide the option to automatically archive object to Glacier after a predefined number of days following the object creation date. It also gives you the option to automatically expire objects after a set object expiration date.

The following steps show how to provision lifecycle policies for the AWS S3 bucket. The images are taken from AWS Glacier console.

In the AWS S3 bucket management console, select the Management tab, as shown below:

Define a rule name for your lifecycle rule and define a prefix to filter the S3 objects:
Select the object versions that will transition, and then define the number of days after the creation date to wait before sending the object from S3 to Glacier for archiving.
Select the object versions that will expire, and then define the number of days after the creation date to wait before expiring the object. The object will be automatically deleted after expiration period.
Review the rules defined above and save.

The lifecycle rule is now defined. The S3 objects will automatically be archived and expire once they reach the specified number of days after their creation date.

How to Archive Directly to Glacier

There is also an option to back up data directly to Glacier. Less important data can directly be uploaded to Glacier saving on storage cost, with different options for retrieval frequency. Glacier provides a management console which can create vaults (containers for archives), but the files can only be uploaded using AWS CLI, or SDK and REST APIs. AWS Import and Export Service can also be used to transfer data directly to the Glacier. This service helps make a connection between storage devices and the Glacier that bypasses the Internet.

AWS CLI provides command line interface to upload data such as photos, documents, videos, logs directly to Glacier. The steps include creating a vault using the Glacier console and then using CLI to transfer the data directly to Glacier. There is a 4GB object size limit to upload data in a single operation, and a 40TB limit for uploading in multipart operations. You won’t immediately be able to see the objects in the vault as soon as you upload them: the Glacier console updates the vault information only once a day.

Information about newly archived objects will be available in next vault update. Every object will have an archive id associated with it. This id will be used to delete, upload, and manage the archives inside the vault. Once an archive is uploaded, it cannot be edited.

After installing AWS CLI, various parameters can be used to upload the archive. The available parameters are vault-name, account-id, archive-description, checksum, and body. The steps below show how to create a vault and upload files to it:

First, go to the AWS Glacier management console:
Next, specify the region in which the Glacier vault will be created and specify a vault name. For example, in the above image, the vault name is “testVault”.
Enable notifications if you need to be notified when AWS Glacier archiving jobs complete. You can either skip, create a new or use an existing SNS topic.
Review the settings that you have specified and click “Submit” to save the settings. The vault is now created. You can start using the vault to upload the archive. The vault management console can be seen below:

Archives can now be uploaded using the following CLI command:

aws glacier upload-archive --account-id <account_id> --vault-name testVault --body test.zip --region us-west-2

The archive is uploaded and the output is shown below. The archive id, checksum and the location of the archive are provided in the output:

{
"checksum": "241234234234gfsdgfdsgsfsfdbcbbe76cdde932d4646fa7de5f21e18aa67","archiveId": "kKB7ymWcasfsfsffsfJVpPSwhGP6ycSOAekp9ZYe_--zM_mw6k76ZFGEIWQX-ybtRDvc2VkPSDtfKmdsfdgsaQrj0IRQLDUmZwKbfdsfdsfdsfhHO0bjbGehXTcApVud_wyDw",

"location": "/0123456789012/vaults/testVault/archives/kKB7ysfwfsgfwdfZYe_--zM_mw6k76ZFGEIWQX-ybtRDvc2fffffffwfwrj0IRQLSGsNuDp-AJVlu2ccmDSyDUmZwKbwbpAdGATGDiB3hHO0bjbfefefefe_wyDw"
}

Best Practices for Cloud Archiving

Best practices for cloud archiving start with understanding the basic difference between cloud backup and cloud archiving. People often confuse these two terms and end up storing their important data in an archive.
That results in higher spending since retrieving data above the limit specified is more expensive. Also, cloud backup is used to increase the availability of data in case of an unexpected data loss.

Storing unwanted, older files in a backup may result in more storage costs for non-essential data. Understanding your data’s criticality and frequency of access are very important factors when deciding on a storage class.

Another thing to consider is how to move data from a backup to an archive. Data that has to be available for a certain period of time but then might not be accessed may still require storage: this is perfect case for archiving.

Archiving data that still needs to be accessed, or backing up data that does not need access may result in higher storage costs.

Conclusion

Cloud backup and archiving are vital elements of cloud computing.

Every application needs a backup and archiving strategy to ensure high availability and for its disaster recovery solution. There is always a need to keep storage costs low. Different cloud providers provide different tools to archive data.

The user must prepare solutions and best practices for archiving that balance storage cost and retrieval options. Methodologies must be used to categorize data for backup or archive, and archiving and expiration policies for older data should be properly set.

Want to get started? Try out Cloud Volumes ONTAP today with a 30-day free trial.

View full post