In order to extract valuable business insights from big data, we need to store, process, and analyze that data. But handling petabytes of data has always been a challenge; even in the big data era, scaling through massive data sets and dynamically allocating memory for variable load on the servers remain operational challenges.
To solve these issues in the cloud, AWS offers Elastic MapReduce (EMR), which provides the infrastructure to run Apache Hadoop on Amazon Web Services. The open-source Apache Hadoop has the ability to expand horizontally and can process structured as well as unstructured data.
In this post we’ll take a closer look at this crucial big data service from AWS and how you can leverage Cloud Volumes ONTAP to do more with big data analytics.
Data compression is always advisable as it provides multiple benefits. Faster access to data, which reduces storage costs as the storage size shrinks, reduced traffic while exchanging data between AWS Simple Storage Service (AWS S3) and EMR, and easier application of MapReduce operations on compressed data are all reasons to do this.
EMR supports multiple data compression techniques such as gzip, LZO, bzip2 and Snappy, each of which has its own advantages.
Hadoop works parallely on data sets: in the process it splits the file into smaller chunks and applies MapReduce algorithm on top of it.
If the compression technique supports the splittable feature, it is easier for the MapReduce algorithm to work, as individual mappers are assigned to each of the split files. Otherwise, a single mapper is applied to the entire file.
Hadoop reads data from AWS Amazon S3 and the split size depends on the version of AMI (Amazon Machine Image) being used. Hadoop splits the data on AWS Amazon S3 by triggering multiple HTTP range requests.
Generally, for 1GB of data, Hadoop triggers 15 parallel requests, extracting around 64MB from each request.
So, how can you decide between the data compression algorithms EMR supports?
The list below details the individual benefits of the gzip, LZO, Bzip2, and Snappy compression algorithms.
Let's have a look at the various file formats preferred for EMR:
There are two cluster types used by EMR: Persistent clusters and transient clusters. The main difference between the two is the time it takes for each to initialize.
Persistent clusters remain alive all the time, even when a job has completed. These clusters are preferred when continuous jobs have to run on the data. Since clusters take a few minutes to boot up and spin, using persistent clusters saves a significant amount of time which would otherwise be lost during the initialization process.
Generally, persistent clusters are preferred in testing environments. Errors in such an environment running transient clusters would close the job and might shut down the cluster, but with persistent clusters jobs can continue to be submitted and modified.
Persistent clusters also allow jobs to be queued and executed in the order in which they were submitted as soon as the cluster has freed up resources to execute each job.
It is very important to remember to shutdown the cluster manually, as it is accures charges even when idle.
Transient clusters shut down automatically after a job is complete. This is a cost-effective approach in which metadata is automatically deleted by the cloud when the cluster shuts down.
EMR File System (EMRFS) is best suited for transient clusters as the data resides irrespective of the lifetime of the cluster. Since transient clusters shut down after the completion of each job, they must restart to begin new tasks.
When restarting, transient clusters suffer the booting lag mentioned above before they are ready for use again.
The best practice is to use a single metastore across all EMR clusters as it provides persistent storage and read-write consistency. There is a configurable xml hive-site.xml where we need to specify the location of the data warehouse, as shown below:
<property>
<name>hive.metastore.warehouse.dir</name>
<value>s3n://bucket/hive_warehouse</value>
<description>location of default database for the warehouse</description>
</property>
AWS provides a number of options to choose the right instance type for various types of enterprises. It ranges from CPU intensive, memory intensive, to general purpose instances.
It offers a great variety in terms of vCPUs, storage, memory and network bandwidth.
For memory intensive instances, X1 and R4 should be preferred over the others. X1 is optimized for large scale enterprise data, offering the lowest price per GB of ram used.
R4 is optimized for memory optimized applications used in data mining. In case we require compute-optimized instances, C4 is built for enhanced networking and offers the lowest price/compute performance.
T2 and M4 are general purpose instances that give a combination of network, storage, and memory-optimized instances. You can find more information on instance types here.
Configuring the appropriate number of instances for the job depends on the size of the data sets the user is working on. The optimal number of instances depends on the duration for which the job runs. In case there are not enough nodes to process in parallel, the jobs would be placed in a queue.
There are three types of EMR nodes: master nodes, core nodes, and task nodes. Every EMR cluster has only one master node which manages the cluster and acts as NameNode and JobTracker.
Core node- All the MapReduce tasks performed by the core nodes which acts as data nodes and execute Hadoop jobs.
Task nodes are part of the task instance group and are optional. They only run tasktrackers.
When comparing EMRFS to Hadoop Distributed File System (HDFS), there are a few vital points to keep in mind.
EMRFS is cost effective, as you only pay for what you use.
HDFS generally has a replication factor of 2 or 3. In such cases twice or thrice the amount of data has to be maintained, which is not the case using EMRFS with AWS S3.
Scalability is another important issue in this comparison. In EMRFS scalability is not an overhead: the user can just provision another instance from the pool of instances and the change would be accommodated hassle-free.
In HDFS on the other hand, servers have to be manually added and configured to the cluster.
EMRFS also allows the option to use transient EMR clusters, which means they can be shut down once a job has been executed.
DistCp and S3DistCp are used to move data between the two file systems. DistCp is used to copy data from HDFS to AWS S3 in a distributed manner. It creates a map task and adds files and directories and copy files to the destination.
S3DistCp is derived from DistCp and it lets you copy data from AWS S3 into HDFS, where EMR can process the data. S3DistCp runs similar MapReduce jobs as DistCp, but it has optimizations in the reduce phase in which multiple HTTP threads run parallely in order to upload data, which gives S3DistCp an edge over DistCp.
Apache Spark is preferred for large-scale computational needs and is a credible alternative to MapReduce due to its low latency stats. But deploying Spark on EMR is complicated and it needs to be tuned in order to provide optimized results.
To do so, Spark’s default configurations can be altered, making it more effective at a lower AWS cost.
The path to spark-defaults.conf in EMR is:
/home/hadoop/spark/conf/spark-defaults.conf
Parallelism is important in order to completely utilize the cluster. By default, Spark sets the number of partitions of an input file according to its configurations and distributed shuffles.
We can configure the level of parallelism by passing a second argument to an operation.
Two to three tasks are recommended per CPU core. In case the tasks are taking more time than expected, we can increase the level of parallelism by 1.5.
Secondly, we come across instances where either we face out of memory errors or a poor performance, this can happen due to large working sets.
Spark shuffle operations create a hash table within each task that can be very large in size. Again, increasing the level of parallelism will help here.
Shuffling involves high disk I/O operation and data serialization and is not an easy operation since each reducer has to pull data across the network. All these shuffling files are not cleaned by Spark on their own, so Spark may consume the entire disk space.
To overcome this challenge you can minimize the amount of data to be shuffled.
Using cogroup can be beneficial if two datasets are already grouped by key and we want to join them. Doing so avoids all overheads associated with the unpacking or repacking of groups.
Spark users should use the standard library in order to implement the existing functionality and use data DataFrames in order to provide better performance. Data Frames are similar to a RDBMS table in which data is organized into named columns.
You can find more information on Spark tuning here.
EMR provides security configurations to specify encryption for the EMR file system, local as well as in-transit encryption. At-rest Encryption for EMRFS can be defined for server-side as well as client-side servers.
For server-side encryption, two key management systems are present: SSE-S3 and SSE-KMS.
For in-transit encryption there are a lot of options available. Hadoop RPC, TLS for Tez Handler, and 3DES for Spark are some of the options.
AWS CloudWatch offers basic and detailed monitoring of EMR clusters.
Basic monitoring sends data points every five minutes and detailed monitoring sends that information every minute. CloudWatch helps enterprises monitor when an EMR cluster slows down during peak business hours as the workload increases.
Similarly it can even trigger an alarm when the resources are underutilized and nodes which are not in use can be removed, saving costs.
CloudWatch keeps an eye on the numbers of mappers running, mappers outstanding, reducers running, and reducers outstanding
AWS allows users to choose from Spot, Reserved, and On-Demand Instances: Depending on usage and workload, a user can opt for any of these services.
Spot Instances running task instance group increase the capacity of the cluster and are cost-effective as well. Users can bid with their own price for AWS services, and in doing so they can significantly lower the computing costs for time-flexible tasks.
Users can host master, core, and task nodes on spot instances. Since you’ll be bidding for spot instances, it is not a good practice to host the master node on these instances.
Reserved Instances (RI) come in a number of types, such as light, medium, and heavy depending on our usage. It gives us the an option to pay for each instance we want to reserve on the AWS Platform.
On-Demand Instances have no minimum commitments to meet and users can pay on an hourly basis, depending on the computing capacity of the instances. For transient jobs, On-Demand instances are preferred as its EMR hourly usage is less than 17%.
EMR is a cost-effective service where scaling a cluster takes just a few clicks and can easily accommodate and process terabytes of data with the help of MapReduce and Spark.
As it supports both persistent and transient clusters, users can opt for the cluster type that best suits their requirements. Defining correct instance types is also an important part of using EMR effectively, ensuring better performance, scalability, time, and space complexities.
If you’re looking to leverage EMR by transferring data to AWS S3, NetApp offers the Cloud Sync service.
Cloud Sync provides an easy way to migrate and automatically synchronize data between any NFS share and AWS S3, which allows your data to be accessed by EMR and other big data AWS services.
Get the more out of your data and learn more about Cloud Sync and how it works here.