More about AWS Big Data
- AWS Snowball vs Snowmobile: Data -Migration Options Compared
- AWS Snowball Edge: Data Shipping and Compute at the Edge
- AWS Snowmobile: Migrate Data to the Cloud With the World’s Biggest Hard Disk
- AWS Snowball Family: Options, Process, and Best Practices
- MongoDB on AWS: Managed Service vs. Self-Managed
- Cassandra on AWS Deployment Options: Managed Service or Self-Managed?
- Elasticsearch in Production: 5 Things I Learned While Using the Popular Analytics Engine in AWS
- AWS Data Lake: End-to-End Workflow in the Cloud
- AWS ElastiCache for Redis: How to Use the AWS Redis Service
- AWS Data Analytics: Choosing the Best Option for You
- AWS Big Data: 6 Options You Should Consider
Subscribe to our blog
Thanks for subscribing to the blog.
Apache Cassandra can be an integral part of your AWS big data workloads. In this article we’ll take a look at Apache Cassandra and the two deployment options for running it on AWS: as a managed service or self-managed.
Read on below to find out:
- What Is Apache Cassandra?
- AWS and Apache Cassandra
- Amazon Keyspaces: The Apache Cassandra Compatible Managed Service from AWS
- Running Apache Cassandra on EC2: The Self-Managed Option
- Apache Cassandra with Cloud Volumes ONTAP: Managed Service Style with Full Control
What Is Apache Cassandra?
Initially developed at Facebook for their inbox search feature, Apache Cassandra is currently an open-source project developed by Apache Software Foundation. Designed as a combination of Google’s Bigtable storage and data engine with Amazon’s Dynamo distributed storage and replication techniques in mind, Apache Cassandra is a highly scalable distributed NoSQL database, capable of handling large amounts of data with sub milli-second performance.
With a masterless architecture, meaning that every node is the same, Apache Cassandra is linearly scalable whether using commodity hardware or any cloud infrastructure. Thanks to automatic data replication and high availability features where failed nodes can be replaced without any downtime or performance impact, Apache Cassandra provides a fault tolerant platform for mission critical data scenarios. It also comes with its own CQL (Cassandra Query Language) that is similar to SQL (Structured Query Language) for your database operations.
Apache Cassandra Use Cases
Thanks to its distributed nature, linear scalability, and performance, Apache Cassandra addresses a wide variety of use cases. It is often used when global distributed data is necessary, such as in an ecommerce platform where data is closer to customers providing lower latency and storing time-series data, such as logging or chat history.
Apache Cassandra on AWS
Does AWS support Cassandra?
In order to run Apache Cassandra on AWS, you have two different options to choose from. You can either run your Apache Cassandra workloads on a managed service, such as the native Amazon Keyspaces or Datastax Astra, or deploy Apache Cassandra on AWS compute services and manage it by yourself.
Is DynamoDB based on Cassandra?
This is a fairly common question. While both DynamoDB and Cassandra are NoSQL databases, the answer is no. DynamoDB is a proprietary engine technology from AWS. For customers that want to use a Cassandra-like managed service, AWS suggests Amazon Keyspaces.
Amazon Keyspaces: The Apache Cassandra Compatible Managed Service from AWS
Amazon Keyspaces is a managed database service from AWS, compatible with Apache Cassandra, that allows you to easily set up and scale Cassandra workloads without the administrative overhead of server management. As it is compatible with the most common CQL API’s and Cassandra drivers, you’ll be able to easily update your existing applications to start using Amazon Keyspaces.
Amazon Keyspaces Pros
- Scaling Capacity: Automatically provision the necessary storage for your tables, and also scale it up and down accordingly to your application data operations. It also offers two read/write capacity modes to choose from:
- On-demand Capacity Mode: Automatically scales throughput to meet demands. When you have unknown workloads with unpredictable traffic, you can opt for the On-demand capacity mode that will automatically scale your tables throughput capacity to suit your application demands.
- Provisioned Throughput Capacity Mode: Set throughput to keep costs low. For predictable application traffic, opt for the Provisioned Throughput capacity mode, enabling you to define a specific throughput threshold for your application and optimize cost.
- Data Protection: Every table you create is automatically replicated three times in different AWS Availability Zones, within the same region, and with encryption at rest enabled by default, keeping your data highly available and secure at no additional cost. Additionally, you can enable point-in-time recovery (PITR), providing you with continuous backup and restore capabilities. For instance, if data is accidentally overwritten or deleted from your tables, you will be able to easily restore your data up to the last 35 days since PITR was enabled. While you can query your deleted data with no additional cost, PITR backup and restore operations have additional charges based on the size of your data.
However, even being compatible with Apache Cassandra, there are some key aspects to be aware of when using Amazon Keyspaces:
Amazon Keyspaces Cons
- Limited Cassandra CQL API support: Amazon Keyspaces doesn’t support all of Apache Cassandra CQL APIs and features. CQL APIs are available at different cluster levels, such as the control and data planes. Examples of CQL APIs that aren’t supported are the ones related to indexes, triggers, aggregates and materialized views. When it comes to data types, the frozen and user-defined types are also unavailable and creating, altering, and removing types are not allowed.
- Single region deployment: Amazon Keyspaces is only available for use within single AWS regions, and it lacks multi-region replication. This means that Amazon Keyspaces is not suitable for solutions that require high availability across disperse geographical locations, since all redundancies are bound to different availability zones within the same AWS region.
- Fixed cluster settings. You will also be unable to change the cluster settings, as it is a managed service.
Astra by Datastax on AWS
You can also subscribe to Datastax Astra available at the AWS Marketplace. Astra is a fully managed DBaaS (Database-as-a-Service) from Datastax, built on Apache Cassandra, that allows you to easily deploy your database on AWS cloud without the operational overhead, and with exclusive features such as SAI (Storage-Attached Index) and multiple APIs that simplify your database operations and application development. You will be able to choose from multiple database instances with different compute and storage capacity that best fit your workloads, and multiple support packages for direct support from Datastax.
Running Apache Cassandra on EC2: The Self-Managed Option
Self-managing Apache Cassandra on AWS is possible with the deployment of your clusters on Amazon Elastic Compute Cloud Instances (Amazon EC2). Amazon EC2 offers a wide variety of instances with different compute and network capacity, and different volume types for storage from Amazon Elastic Block Store (EBS), enabling you to pick the most appropriate configuration for your particular use cases.
How Do I Install Cassandra on AWS?
When selecting a self-managed option, since you are not bound to the restrictions of a managed database service, you are able to install Cassandra using the official instructions. From the AWS perspective, you can still leverage services such as the virtual computing (EC2), storage (EBS), monitoring and logging (CloudWatch), among others. Alternatively, you can also leverage a container-based approach using the AWS ECS or EKS services.
The Pros and Cons of Self-Managed Cassandra?
There are a number of benefits to the self-managed option for Cassandra deployment:
Self-Managed Cassandra Pros
- Full control and flexibility: Full control and flexibility over your clusters allowing you to configure and optimize your cluster settings to best suit your workload requirements.
- Access to all Cassandra latest features and CQL APIs: Full access to all of the Cassandra features has to offer. For instance, you will be able to create indexes, user defined types, user defined functions and triggers, that currently isn’t possible with Amazon Keyspaces. Furthermore, with the latest release of Apache Cassandra 4.0 you will have additional features such as Zero Copy Streaming delivering data between nodes five times faster, and audit logging for security and compliance.
- No Quotas: You are no longer subject to service quotas from AWS Keyspaces. For instance, the 1MB Max row size, number of tables per region or table-level read and write throughput.
- Multi-region deployment: Using raw AWS compute on EC2 allows you to deploy in multiple regions, which currently isn’t possible with Amazon Keyspaces. This capability enables data to be distributed globally, bringing it closer to your customers and ultimately reducing latency.
- Access to all AWS services: Benefit from all the advantages of AWS cloud and services. A good example would be leveraging AWS KMS for data encryption, and AWS IAM for access control to your Cassandra clusters.
- Domain expertise: While self-managing Cassandra clusters on EC2 brings full control and flexibility, it requires additional expertise and setup work.
- Increased Management Overheads: You will be responsible for all the software maintenance, and making sure that data is properly secured, and backup and monitoring strategies are in place, ultimately increasing the operational overhead.
Apache Cassandra with Cloud Volumes ONTAP: Managed Service Style with Full Control
Amazon Keyspaces is a good option when you are looking to use your existing Apache Cassandra workloads in AWS without the operational overhead of server management, however it does come with its own limitations. On the other hand, by self-managing Apache Cassandra clusters on EC2, you will benefit from full control and flexibility, and access to all of Apache Cassandra features has to offer. It does come with the added operational overhead.
But is there another option?
Cloud Volumes ONTAP, the cloud-based data management solution from NetApp, is a popular option to further enhance your experience when choosing to self-manage Apache Cassandra on AWS.
As a data management platform that works on top of AWS, Azure, and Google Cloud IaaS resources, Cloud Volumes ONTAP lets you avoid the managed service limitations for Apache Cassandra deployment, while making it easy to deploy multi cloud and hybrid cluster architectures, fulfilling complex scenarios.
But you’re getting more than you would then just by running Cassandra on raw EC2 instances. Cloud Volumes ONTAP provides essential enterprise-grade features that aren’t native to AWS, including:
- Cost-saving storage efficiency features that can reduce storage footprint by up to 70%
- Automatic data tiering moves data between block and object storage based on usage frequency
- Dual-node, cross-region high availability ensuring business continuity with RTO=0 and RPO<60 seconds
- Data protection via point-in-time, highly efficient NetApp Snapshot™ copies
- Seamless data replication and DR with SnapMirror®
- Flexible cloning capabilities that make dev/test faster and less expensive with FlexClone®
- Increase performance: Achieve higher throughput and lower latency, for greater performance
- Data migration on the fly across multiple cloud providers and on-premises deployments, hence avoiding region restrictions and vendor lock-in
When using Cloud Volumes ONTAP as a data management solution for your self-managed Cassandra clusters, you will benefit from additional features that will lower the operational overhead while also reducing storage costs and increasing performance.