Hadoop MapReduce & Apache Spark

This month we're taking a deeper dive into some of the differences between MapReduce and Apache Spark. I have answered some simple questions below in order to offer a sense of both platforms including some of the pros and cons of these highly regarded technologies. I have also included some information on Mahut.

What is the bottleneck in Hadoop MapReduce?
1. A bottleneck occurs when one of the system resources consumes more time or energy than typically required, which slows other resources and decreases the overall system performance. In Hadoop MapReduce, system performance measures could include; RAM (memory), CPU, Storage I/O and Network Bandwidth. Of the noted possible bottlenecks, RAM is known to play a key role in the ecosystem of resources. RAM is the Random Access Memory that stores data temporarily. MapReduce completes its tasks by reading and writing back to the disk, which is inefficient. Overall, Memory should be configured and set at each node to handle the workload, otherwise it could be a potential bottle neck that significantly slows Hadoop MapReduce’s performance (Apprize, 2014). For example, Memory could be a bottleneck in a case where machine learning requires greater workload iterations than a system could handle. A MapReduce workflow for a machine learning model is shown below in Figure 1. If the mapping and reducing nodes are not able to handle the data workload being produced, they could fail and cause overall system resource downtime.

Figure 1

What is the bottleneck or issues in Mahout as a machine learning Hadoop ecosystems?
1. Mahout is used as a data mining/machine learning framework to develop models for recommendation, classification and clustering applications within an Hadoop environment. Some of the historically known issues with Mahout are similar to the legacy developments found in MapReduce. Mahouts legacy algorithms are based on Hadoops MapReduce jobs, which have been found to drag because they do not have in memory processing for faster iterative algorithms. Although it is still functional, Mahout does not have the same support it originally had because the focus has turned over to Apache Spark and its libraries. Ultimately, Mahout is older and includes more legacy support. Mahout is now adapting to cover-up its bottlenecks by integration of Spark back end support (Barga, 2015). Figure 2 below is an image of where Mahout plays a role in the Data accessing section of Hadoops ecosystems.

Figure 2

How Spark can get over MapReduce’s bottleneck?
1. Apache Spark could improve upon MapReduces bottleneck regarding RAM by utilizing in memory computing. Rather than writing its temporary information on a disk, Spark has been noted to utilize in memory computing to execute jobs upto 100 times faster than MapReduce. The in-memory computing feature not only simplifies jobs, but also enables Spark for real time analytical processing, rather than batch processing. Spark has a Direct Acyclic Graph (DAG) execution engine, which supports cyclic data flows for in-memory computing. DAG engine helps optimize Spark over MapReduce by creating partitions of RDD’s so that they can be computed or recovered at any point in time. (Apache Spark, 2017) Figure 3 below displays the difference in MapReduce’s workflow versus Apache Spark’s DAG workflow, which shows a major difference in how many times data is read and written back to (Hadoop Distributed File System) HDFS. DAG is able to keep its computations within memory (RAM) for quicker performance on tasks, which is a major advantage.

Figure 3

References:

Apache Spark. (2017, April 8). Direct Acyclic Graph DAG in Apache Spark. Retrieved from Data Flair: https://data-flair.training/blogs/dag-in-apache-spark/

Apprize. (2014). Optimizing Hadoop for MapReduce, Detecting System Bottlenecks. Retrieved from Apprize.info: http://apprize.info/data/hadoop/4.html

Barga, M. (2015, 10 22). Apache Mahout and Spark Comparison. Retrieved from MatthewBarga.com: http://matthewbarga.com/blog/index.php/2015/10/22/apache-mahout-and-spark-comparison/

Comments

AnonymousFebruary 4, 2022 at 11:00 PM
Top online casino sites: Betsoft, Betsoft, Betsoft and more
Top online casino sites: Betsoft, Betsoft, Betsoft and more · 1. Spin Palace Casino · 2. Wild Casino · 3. Playpawa 온카지노 검증 Casino · 4. Lucky 7 Casino · 5. Joker123 Casino · 6.
ReplyDelete
Replies
odeleiannucciFebruary 28, 2022 at 3:08 AM
"Cotton Tip" T-Shirt - Titanium Tube
Watch baoji titanium video: Watch video: "Cotton Tip" 사이트 추천 T-Shirt 원피스 바카라 by T-Shirt (2) by 실시간 바카라 사이트 샤오 미 T-Shirt on YouTube. 라이브스코어 Watch more videos:.
ReplyDelete
Replies

Add comment

Search This Blog

Phirilytics

Hadoop MapReduce & Apache Spark

References:

Comments

Post a Comment

Popular Posts

What is your stand about Big Data? What are the most critical issues from Data Scientist perspective?