Confused about choosing Amazon EMR VS Cloudera for large data sets processing? This article dives down into more details for big dataset processing. It also helps you in deciding the relevant service for your organization.Amazon EMR VS Cloudera for large data sets processing? This article dives down into more details for big dataset processing. It also helps you in deciding the relevant service for your organization.
Let’s have a little introduction about these services:
- Amazon EMR is a web service that makes it easy to process large amounts of data efficiently. It uses underlying Hadoop, Spark, HBase, Presto, Hive, and other Big Data Frameworks. You can create a cluster of any size through the UI console or through the CLI or code.
- Cloudera is open source(As well Enterprise version) with access to the source code. You can inspect it for debugging purposes and make modifications as required.
Amazon EMR VS Cloudera Comparison
AWS EMR | Cloudera on EC2 | |
---|---|---|
Auto Scaling | EMR segregates slave nodes into two subtypes – Core Nodes and Task nodes. This results in high scalability and low cost by using the spot instance for task node. | Cloudera does not categorize slave nodes into core and task nodes. So if a node is removed/lost then there is increase the risk of losing HDFS data. |
Dynamic Orchestration | You can dynamically orchestrate a new cluster on-demand within a very short span of time. This cluster can be terminated after successful completion of the jobs. This way you can improve the utilization can reduce the costs drastically. | If your application already running on ec2 then it shall take the resources unnecessarily. In order to save the cost, you need to start/stop the instance for data processing. |
Access to Amazon S3 | You can access data on S3 from EMR directly or through Hive Tables. EMR is highly tuned for working with data on S3 through AWS-proprietary binaries. EMR works seamlessly with other Amazon services like Amazon Kinesis, Amazon Redshift, and Amazon DynamoDB. | Cloudera uses Apache libraries (s3a) to access data on S3 .But EMR uses AWS proprietary code to have faster access to S3. |
Highly Availablity | EMR Service monitors the slave nodes and replaces any unhealthy node with a new node. | Unlike EMR, Cloudera does not categorize slave nodes into core and task nodes. This increases the risk of losing HDFS data in case a node is removed/lost. |
Ease of Use | AWS manages EMR Hadoop service as well as underlying AWS infrastructure. So you can quickly start a new Hadoop cluster quickly and start processing the data. | Cloudera is comparatively more difficult to learn and configure.But once you have it setup, it’s far more flexible than EMR, and there’s no extra infrastructure cost. Cloudera Manager has an easy to use web GUI. This helps manage and monitor Hadoop services, cluster, and physical host hardware. |
Hadoop Management Console | AWS does not provide any management console like Apache’s Ambari or Cloudera Manager, for EMR. This makes it difficult to manage and track various Hadoop services on a running cluster. | Cloudera also provides Cloudera Director to enable self-service for using CDH in the cloud. It provides an administration experience for central IT to reduce costs and deliver agility. There is interface for end-users provisioning and scaling clusters. |
On-Premise and Cloud Options | AWS does not provide the on-premise option and rely on the other Amazon services. | Cloudera offers both on-premise and on-cloud options. This helps reuse the on-premise expertise – experience, human resources, and learnings. |
Additional Features, Data flexibility, and Debuging | EMR have standard features and use S3 for data processing. One can’t debug the issue and there is not much control. | Cloudera uses the open source Hue framework for its user interface. If you need new features from your web interface, you can customize them using the Hue SDK. Also since you have the control on source code so you can debug the issue |
Cost Calculations
Let’s take an example (configure a 6-Node Hadoop cluster) for our data processing to test Amazon EMR VS Cloudera data set processing. Below calculations is for US East region and you can look at details here.
AWS EMR | Cloudera on EC2 | |
---|---|---|
Instance required for year | $0.030 per Hour * 6* 24*365 | $0.120 per Hour*6*24*365 |
Total Cost | $1576.8 | $6307.2 |
We have taken the worst case scenario in which we need to run the big data sets processing throughout the year. In the case of EMR, we can have one master node and five slaves nodes. In the case of Cloudera, we would need 6 EC2 instances to run 6 nodes. We can reduce the costing by buying reserved the instances but in most of the scenarios you need these clusters for a smaller duration. So the above calculations suggest that EMR is very cheap compared to a core EC2 cluster using Cloudera.
Conclusion
Amazon EMR VS Cloudera, Well your choice will depend on your particular business case. As a result, here are your choices, if you:
- Don’t want to invest time in managing and updating your distribution then AWS EMR must be the best option for you.
- Data is stored in S3 and you want to run the occasional job on the data and dump the results back to S3 then it must make sense that you use Elastic Map/Reduce (EMR).
- Need to run a full Hadoop/HBase stack 24×7 and have custom data format (other than S3) then Cloudera must be the best option for you.
- Need to debug the issues and need to integrate it with other software then Cloudera must be the best option for you.
Do you still have further questions or wanted to have more clarity? Feel free to contact us if you need our help for your organization or have any further questions.
References
-
Easily Run and Scale Apache Hadoop, Spark, HBase, Presto, Hive, and other Big Data Frameworks
-
2 Choices for Big Data Analysis on AWS: Amazon EMR or Hadoop