Control Systems and Computers, N5, 2016, Article 8


Upr. sist. maš., 2016, Issue 5 (265), pp. 62-75.

UDC 004.7:004.75:004.9:004.738.5

Oursatyev Alexey A., PhD (Eng.), Leading Research Associate, International Research and Training Centre of Information Technologies and Systems of the NAS and MES of Ukraine, Glushkov ave., 40, Kyiv, 03187, Ukraine,

Some Frameworks for Big Data Analytics and Machine Learning

Introduction. The work is the final stage of research related to the use of the theory and tools for processing Big Data in problems of intellectual analysis of situations.

Purpose. The machine Learning (Machine Learning, ML) and distributed processing of the large data collections on Apache Mahout with the automatic search ability for relevant laws are considered. Its realization through the use of MapReduce paradigm and framework Spark is compared.

Methods. The representation of data and mechanisms to restore their failures, the method of calculation and the ability to cache data in memory are considered. The latter is a key tool for fast interactive use. Spark is implemented on Scala. It combines the best features of functional and object-oriented programming languages, and uses it as an application of the environment development. It provides the application programming interface for the Java language, Scala, Python and R, invites more than 80 high-level operators that makes it easily accessible for the construction of a parallel applications.

Results. Interactive mathematical environment Mahout Samsara ML includes an extended version of Scala. Mahout Samsara or
the Scala & Spark Bindings are necessary for creation the semantically friendly conditions for еру linear algebra, and is built
in the image of the base package in R. The linear algebra works with scalars, vectors, matrices and distribution lines of the matrices (distributed row matrix, DRMs). DRM is a new abstraction, introduced in Apache Mahout for the representation and processing matrices convenience. One of the main elements of Mahout Samsara is algebraic DSL Scala and expressions optimizer. ML Mllib, supports the scalable universal linear algebra and includes many modern algorithms.

Download full text! (In Russian).

Keywords:  Frameworks for Big Data, Analytics, Machine Learning data processing, Apache Mahout, MapReduce, Mahout Samsara, Spark.

  34. What is Apache Mahout?,
 35. Vorontsov, K.V. Collaborative filtering. pdf, 6 Nov. 2013.
 36. Algorithm of collaborative filtering. http://, 16 Jan. 2010. (In Russian).
 37. Apache Mahout., March 2010.
 38.Chernyak, L., 2014. MapReduce Alternatives for Real Time. Open systems, 5, (In Russian).
 39. Serialization in Java., 24 May, 2009. (In Russian).
 40. M. Zaharia, M. Chowdhury, T. Das et al., 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012, Apr. 2012. edu/matei/publications/
 41. Spark Programming Guide. Spark 1.5.2.
 42. Intro to Apache Spark., (15. 08. 2014).
 43. Dinsmore, T. W. Apache Spark for Big Analytics (Updated for Spark Summit and Release 1.0.1). The Big Analytics Blog., (01. 02. 2014).
 44. Overview – Spark 1.5.2 Documentacion – Apache. ttp://
 45.   Jones, M. Tim., 2012. Spark, an alternative for quick data analysis. com/developer-works/ru/library/os-spark/, 12.07.2012. (In Russian).
 46. Machine Learning Library (MLlib) Guide.
 47. GraphX Programming Guide.
 48. Spark SQL and DataFrames – Spark 1.5.2 Documentation.
 49. Apache Kafka.
 50. Amazon Kinesis.
 51. Spark Streaming Programming Guide. http://spark. html
 52. Spark FAQ.
 53. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters /Matei Zaharia, Tathagata Das, Haoyuan Li et al. Univ. of California, Berkeley. hotcloud_spark_streaming. pdf
 54. Matei Zaharia, Tathagata Das, Haoyuan Li et al. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. Univ. of California, Berkeley
 55. Scala.
 56. Contributed Libraries and Tools.
 57. Awesome Scala.
 58. The Scala Program. Lang.
 59. Creating Domain Specific Languages with Scala – Part 1.
 60. Hunger M. Domain-Specific Lang. http://programmer.
97 things. (23.12.15).
 61. DSLs – A powerful Scala feature.
 62. The Scala Program. Lang. node/25.html
 63. Spark 1.5.2 Cluster Mode Overview. http://spark.
 64. Based on: Cloudera. Apache Spark job optimization. Ch. 1., 20.05. 2015. (In Russian).
 65. Apache Spark.
 66. 18 essential  Hadoop tools. Aug. 2014.
 67. Mahout 0.10.1 Features by Engine.
 68. Ingersoll, G. Apache Mahout: scalable machine learning for all.
/ru/library/j-mahout-scaling/(In Russian).
 69. What is Apache Mahout? Release Notes.
70. Sparkling Water.
 71. Scala & Spark Bindings.
 72. Lyubimov, D. Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines. ScalaSparkBindings.pdf
73. Dunning Ted. Why Apache Mahout is shifting its all algorithms from Java to Scala, i.e. are Apache Spark and Apache Mahout are moving in one direction?.
algorithms-from-Java-to-Scala-i-e-are-Apache-Spark-and-Apache-Mahout-are-moving-in-one-direction, 18 Apr. 2015.
 74. A Free Software Project.
 75. Ferrel Pat. Mahout on Spark: What’s New in Recommenders. mahout-spark-what%E2%80%99s-new-recommenders, 12 Aug. 2014.
 76. Grigorev Alexey. Apache Mahout Samsara: The Quick Start., April 2015.
 77. Friedman Ellen. Advances in Apache Mahout: Highlights for the 0.9 Release.   https://www., 19 Febr. 2014.
78. Delzell K. Do you need to learn the language of R ?., 24.10.2014. (In Russian).
  79. M. Zabotnev Methods of presenting information in sparse hypercube data. (In Russian).
  80. Podgorsky S. Writing the FEM of the calculator in less than 180 lines of code. https: //, 1 Dec. 2015. (In Russian).
 81. Lyubimov D. Mahout 0.10.x is coming. http://www.
weatheringthroughtechdays. com/, Apr. 2015.
 82. Gens, F. The 3rd Platform: Enabling Digital Transformation. IDC., Nov. 2013

Received 13.07.2016