hamburger icon close icon
AWS Snapshots

The S3 Outage: Be Prepared For Unavoidable Cloud Failures

The cloud can always come crashing down. That was the case back in 2017 when AWS completely went offline for most users in the US. Would you be able to handle an outage that bad if it occurs again? What if it was just one service, such as an Amazon S3 outage? How would AWS Snapshots work?

In this article, we will discuss concepts and tools that will help you focus your efforts to better protect your cloud deployment so that when something like Amazon S3 outage happens again, you will be prepared. One of the main services we’ll examine is Amazon S3 and how snapshots can help protect it.

The Damage That Was Done

When AWS went down, many web sites, services and even gadgets (IoT) were heavily affected. Perhaps, the biggest story from the outage is the fact that one of businesses suffering from the outage was AWS itself, along with Amazon S3, as the company relies on its own infrastructure for various services.

Amazon is a big name, but it’s by no means the only big company that has taken a hit of this kind.

Check out the list below:

What Cause
DNS provider Dyn A DDoS Attack by a botnet consisting of a large number of Internet-connected devices, such as printers and IP cameras, that were infected with the Mirai malware.
Microsoft Office 365 Physical components degraded due to heavy users’ demand.
Salesforce A service disruption caused by a database failure.
Google Cloud Platform An outage affecting the cloud instances and VPN service in all regions. The outage was caused by bugs in the network-management layer.
Amazon Web Services in Australia As storms pummeled Sydney, Australia, on June 4, 2016, an Amazon Web Services region in the area lost power and a number of EC2 instances and EBS volumes hosting critical workloads for name-brand companies subsequently failed.

 

Before continuing, it is important to understand where AWS’s responsibility ends and yours begins. The general rule is that what’s in the cloud is the user’s responsibility while the cloud’s security and infrastructure itself is the responsibility of Amazon.

To understand more, make sure you carefully read the AWS shared responsibility model – and make sure your colleagues do as well.

Responsibility at the Vendor Level

Over the last decade, both AWS and Azure – the cloud infrastructure industry leaders – have built up a great global presence. In fact, AWS has 42 availability zones (AZs) spread across  16 different regions while Azure is in 34 regions.

According to the cloud vendors, these AZs and regions are segregated physical data centers. They are well secured and adhere to strict compliance standards.

In addition, each cloud vendor has its own out-of-the-box replication and recovery mechanism for its Database-as-a-Service (DBaaS) offering or for its object storage pool.

Amazon provides users with the option to create a read replica of its RDS service across regions, so in case of a failure you can make your replica the new master database.

The same goes for Azure SQL active Geo-Replication. You get a robust physical site, and “cloud building blocks” that enable you to build and customize your backup and recovery processes.

You can use the AWS console to take your next EBS volume snapshot, but that won’t work on scale. And, it definitely won’t work if you tend to forget things.

The same is true of those responsible for running failover processes:

  • Automate your compute and block storage snapshot, continually and consistently.
  • Stream your data between regions
  • Separate privileges between your backup repositories and your other environments, especially the production one.
New call-to-action


4 Key Cloud Building Blocks at the Application Level


1. Data Replication and Backup

First, it is recommended that your cloud deployment blueprint includes data replication and backup, and is part of the complete application architecture plan.

2. External Discs
Cloud Computing Test RoutinesOn the application level, you should also consider the link between your application and your storage. The basic and most obvious point (though still important to mention as it’s not an uncommon occurrence), is that data should be stored on external disks, and not on the application's local server. 

Apart from the scenario in which the server goes down, doing so also supports cloud elasticity, where instances comes and go. 

3. Persistent Backup
Another important checklist item is keeping a persistent backup. Contrary to the aforementioned DBaaS option, you should consider limitations.

For example, Amazon RDS data retention is for 35 days, so you will need to come up with both a data migration and sync solutions, moving data to another region or even to another cloud.  Mature databases such as Oracle, do come with a native support of persistent backup, but might not leverage the cloud scalability.

4. Backup Consistency
This should also be looked upon: AWS provides an EBS snapshot, which is a point-in-time backup of the volume. Automating around this cloud building block supports the need for a consistent backup on the cloud. This means no data loss and reliable recovery.

Another option is Linux (Logical Volume Manager) LVM, which allows your disks to be backed up while the data on the volume keeps on changing, eliminating the need to stop your storage device.

Automate Disaster Recovery Tests

The Amazon S3 outage took a considerable amount of the internet along with it. Replicating that level of a disaster in a testing process can make sure that your DR process can handle such an event. No matter what the planning process, good backup and recovery processes are measured during a real event. So why not simulate these events on a regular basis, measuring your system robustness continually?

Some of the most interesting and inspiring developments on testing your disaster recovery systems were presented by Netflix, the Amazon cloud poster child.

Over the years, Netflix R&D talents open-sourced tools such as Chaos Monkey and Chaos Kong, tools that create random malfunctions across the Netflix cloud stacks. These put their system self-healing processes to the test, making sure that points of failure are proactively revealed and fixed.

With the cloud, you should strive to automate a test routine, checking that your backup repositories are up to date and that your failover scripts are doing their job.

As a result, you will not only have confidence in your system's robustness but will also identify problems; if not fixed on the spot, at least you will be able to identify the manual steps that the team needs to take to resolve issues in a timely manner.


Want to get started? Try out Cloud Volumes ONTAP today with a 30-day free trial.


-