Control Systems and Computers, N5, 2016, Article 10
DOI: https://doi.org/10.15407/usim.2016.05.084
Upr. sist. maš., 2016, Issue 5 (265), pp. 84-92.
UDC 681.3:658.56
Glybovets Аndriy N., PhD (Ph.-M.Sc.), National University of “Kyiv-Mohyla Academy”, E-mail: andriy@glybovets.com.ua,
Dmytruk Ya.O., master, National University of “Kyiv-Mohyla Academy”, E-mail: yaroslav.dmytruk@gmail.com
The Effectiveness of Programming Languages in the Apache Hadoop MapReduce Framework
The effectiveness of the different languages for Apache Hadoop framework to process large data collections based on the MapReduce model is discussed.
Apache Hadoop is used in many industrial projects all over world such as Facebook and Yahoo!. It provides the ability to process different tasks effectively and reliably on the cluster to handle the huge amounts of data. MR model allows the developers to ignore the complex architectures by cluster management, and immediately to develop a program.
This work investigates the influence of the programming language on the speed of the program in the Apache Hadoop framework.
The subject of comparison is the execution of programs in Java, Scala and Python that implements the solution of the simple problem: how long each word in the input collection of text documents is searched. All three programs, in spite of the language, is written in the same style, so that the comparison results are objective.
For the experiments, we have chosen the image of ClouderaQuickstart VM virtual machine. The easy use of this virtual machine is that it is already established Hadoop, HDFS, and other services. Also, a cluster of three nodes is created for the study. CDH is elected as the distribution of Apache Hadoop and related projects. The desired configuration on each node is set.
Each program is ran for the different size input: 8Mb, 34Mb, 61Mb, 106Mb and 203Mb.
During the experiments, the best results is showed by the program that is written in the Apache Spark. In addition, it is found that the MR program in the Apache Hadoop is better to write in Java or any other JVM languages than Python. An advantage in speed is obvious. Also, experiments shows that the processing speed is larger at higher input collections. So, it is not necessary to use Hadoop to work with small data.
Download full text! (In Russian).
Keywords: BigData, MapReduce, Apache Hadoop, Spark, Java, Pyton, Scala.
1. Dean, J., Ghemawat, S. MR: Simplified Data Processing on Large Clusters. Retrieved 2004. http://
static.googleusercontent.com/media/research.google.
com/ru//archive/MR-osdi04.pdf
2. Kuznetsov, S., 2010. MR: vnutri, snaruzhi ili sboku ot parallel’nykh SUBD, http://citforum.ru/data-base/articles/ dw_appliance_and_mr/2.shtml#2.1 (In Russian).
3. Lin, J., Dyer, C., 2010. Data-Intensive Text Processing.
MR: University of Maryland, College Park. https://
lintool.github.io/MRAlgorithms/MR-book-final.pdf
4. White, T., 2012. Hadoop: The Definitive Guide. O’Reilly.
http://cdn.oreillystatic.com/oreilly/booksamplers/
9781449311520_sampler.pdf
5. Miner, D., Shook, A., 2012. MR: Design Patterns. O’Reilly,
http://www.nataraz.in/data/ebook/ hadoop/MR_design_patterns.pdf
6. Apache Spark. http://uk.wikipedia.org/wiki/Apache_Spark
7. Xin, R., 2015. World record set for 100 TB sort by open source and public cloud team, http://opensource.com/business/15/1/apache-spark-new-world-record
8. Hadoop, Ch. 1: razvertyvaniye klastera. https://habrahabr.ru/company/selectel/ blog/198534/
9. Ustanovka klastera Hadoop (CDH) na Debian, Ch. 1
https://bigdata-intips.blogspot.com/2015/10/hadoop-cdh-
debian-1.html (In Russian).
10. Apache Flume. http://flume.apache.org/
Received 25.08.2016