Control Systems and Computers, N5, 2016, Article 8
DOI: https://doi.org/10.15407/usim.2016.05.062
Upr. sist. maš., 2016, Issue 5 (265), pp. 62-75.
UDC 004.7:004.75:004.9:004.738.5
Oursatyev Alexey A., PhD (Eng.), Leading Research Associate, International Research and Training Centre of Information Technologies and Systems of the NAS and MES of Ukraine, Glushkov ave., 40, Kyiv, 03187, Ukraine, aleksei@irtc.org.ua
Some Frameworks for Big Data Analytics and Machine Learning
Introduction. The work is the final stage of research related to the use of the theory and tools for processing Big Data in problems of intellectual analysis of situations.
Purpose. The machine Learning (Machine Learning, ML) and distributed processing of the large data collections on Apache Mahout with the automatic search ability for relevant laws are considered. Its realization through the use of MapReduce paradigm and framework Spark is compared.
Methods. The representation of data and mechanisms to restore their failures, the method of calculation and the ability to cache data in memory are considered. The latter is a key tool for fast interactive use. Spark is implemented on Scala. It combines the best features of functional and object-oriented programming languages, and uses it as an application of the environment development. It provides the application programming interface for the Java language, Scala, Python and R, invites more than 80 high-level operators that makes it easily accessible for the construction of a parallel applications.
Results. Interactive mathematical environment Mahout Samsara ML includes an extended version of Scala. Mahout Samsara or
the Scala & Spark Bindings are necessary for creation the semantically friendly conditions for еру linear algebra, and is built
in the image of the base package in R. The linear algebra works with scalars, vectors, matrices and distribution lines of the matrices (distributed row matrix, DRMs). DRM is a new abstraction, introduced in Apache Mahout for the representation and processing matrices convenience. One of the main elements of Mahout Samsara is algebraic DSL Scala and expressions optimizer. ML Mllib, supports the scalable universal linear algebra and includes many modern algorithms.
Download full text! (In Russian).
Keywords: Frameworks for Big Data, Analytics, Machine Learning data processing, Apache Mahout, MapReduce, Mahout Samsara, Spark.
34. What is Apache Mahout?, http://mahout.apache.org/
35. Vorontsov, K.V. Collaborative filtering. http://www.machinelearning.ru/wiki/images/9/95/Voron-ML-CF. pdf, 6 Nov. 2013.
36. Algorithm of collaborative filtering. http://
habrahabr.ru/post/80955/, 16 Jan. 2010. (In Russian).
37. Apache Mahout. http://hortonworks.com/hadoop/mahout/, March 2010.
38.Chernyak, L., 2014. MapReduce Alternatives for Real Time. Open systems, 5, http://www.osp.ru/os/2014/05/13041818/ (In Russian).
39. Serialization in Java. https://habrahabr.ru/post/60317/, 24 May, 2009. (In Russian).
40. M. Zaharia, M. Chowdhury, T. Das et al., 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012, Apr. 2012. https://people.csail.mit. edu/matei/publications/
41. Spark Programming Guide. Spark 1.5.2. http://spark.apache.org/docs/latest/programming-guide.html
42. Intro to Apache Spark. http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf, (15. 08. 2014).
43. Dinsmore, T. W. Apache Spark for Big Analytics (Updated for Spark Summit and Release 1.0.1). The Big Analytics Blog. http://thomaswdinsmore.com/2014/01/02/apache-spark-for-big-analytics/, (01. 02. 2014).
44. Overview – Spark 1.5.2 Documentacion – Apache. ttp://spark.apache.org/docs/latest/
45. Jones, M. Tim., 2012. Spark, an alternative for quick data analysis. http://www.ibm. com/developer-works/ru/library/os-spark/, 12.07.2012. (In Russian).
46. Machine Learning Library (MLlib) Guide. http://spark.apache.org/docs/latest/mllib-guide.html
47. GraphX Programming Guide. http://spark.apache.org/docs/latest/graphx-programming-guide.html
48. Spark SQL and DataFrames – Spark 1.5.2 Documentation. http://spark.apache.org/docs/latest/sql-programming-guide.html
49. Apache Kafka. http://kafka.apache.org/
50. Amazon Kinesis. https://aws.amazon.com/ru/kinesis/
51. Spark Streaming Programming Guide. http://spark.
apache.org/docs/latest/streaming-programming-guide. html
52. Spark FAQ. http://spark.apache.org/faq.html
53. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters /Matei Zaharia, Tathagata Das, Haoyuan Li et al. Univ. of California, Berkeley. http://people.csail.mit.edu/matei/papers/2012/ hotcloud_spark_streaming. pdf
54. Matei Zaharia, Tathagata Das, Haoyuan Li et al. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. Univ. of California, Berkeley http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf
55. Scala. http://scala-lang.org/
56. Contributed Libraries and Tools. http://www.scalalang.org/old/node/1209.html#libraries
57. Awesome Scala. https://github.com/lauris/awesomescala
58. The Scala Program. Lang. http://www.scala-lang.org/old/node/25.html
59. Creating Domain Specific Languages with Scala – Part 1.
http://blog.scalac.io/2015/05/07/encog-dsl.html
60. Hunger M. Domain-Specific Lang. http://programmer.
97 things. oreilly.com/wiki/index.php/Domain-Specific_Languages (23.12.15).
61. DSLs – A powerful Scala feature. http://www.scalalang.org/old/node/1403
62. The Scala Program. Lang. http://www.scala-lang.org/old/ node/25.html
63. Spark 1.5.2 Cluster Mode Overview. http://spark.
apache.org/docs/latest/cluster-overview.html
64. Based on: Cloudera. Apache Spark job optimization. Ch. 1. http://datareview.info/article/optimizatsiya-zadaniy-apache-spark-chast-1/, 20.05. 2015. (In Russian).
65. Apache Spark. http://spark.apache.org/
66. 18 essential Hadoop tools. http://www.kdnuggets.com/2014/08/18-essential-hadoop-tools.html. Aug. 2014.
67. Mahout 0.10.1 Features by Engine. https://mahout.apache.org/users/basics/algorithms.html
68. Ingersoll, G. Apache Mahout: scalable machine learning for all. https://www.ibm.com/developerworks
/ru/library/j-mahout-scaling/(In Russian).
69. What is Apache Mahout? Release Notes. http://mahout.apache.org/
70. Sparkling Water. http://www.h2o.ai/product/sparkling-water/
71. Scala & Spark Bindings. http://mahout.apache.org/
users/sparkbindings/home.html
72. Lyubimov, D. Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines. http://mah-out.apache.org/users/sparkbindings/ ScalaSparkBindings.pdf
73. Dunning Ted. Why Apache Mahout is shifting its all algorithms from Java to Scala, i.e. are Apache Spark and Apache Mahout are moving in one direction?. http://www.quora.com/Why-Apache-Mahout-is-shifting-its-all-
algorithms-from-Java-to-Scala-i-e-are-Apache-Spark-and-Apache-Mahout-are-moving-in-one-direction, 18 Apr. 2015.
74. A Free Software Project. https://cran.r-project.org/doc/html/interface98-paper/paper_2.html
75. Ferrel Pat. Mahout on Spark: What’s New in Recommenders. https://www.mapr.com/blog/ mahout-spark-what%E2%80%99s-new-recommenders, 12 Aug. 2014.
76. Grigorev Alexey. Apache Mahout Samsara: The Quick Start. http://www.itshared.org/2015/04/apache-mahout-samsara-quick-start.html, April 2015.
77. Friedman Ellen. Advances in Apache Mahout: Highlights for the 0.9 Release. https://www. mapr.com/blog/advances-apache-mahout-highlights-09-release#.Vebs-rWTWT4, 19 Febr. 2014.
78. Delzell K. Do you need to learn the language of R ?. http://www.ibm.com/developerworks/ru/library/bd-learnr/, 24.10.2014. (In Russian).
79. M. Zabotnev Methods of presenting information in sparse hypercube data. http://www.olap.ru/basic/theory.asp (In Russian).
80. Podgorsky S. Writing the FEM of the calculator in less than 180 lines of code. https: // habrahabr.ru/post/271723/, 1 Dec. 2015. (In Russian).
81. Lyubimov D. Mahout 0.10.x is coming. http://www.
weatheringthroughtechdays. com/, Apr. 2015.
82. Gens, F. The 3rd Platform: Enabling Digital Transformation. IDC. http://www.idc.com, Nov. 2013
Received 13.07.2016