Control Systems and Computers, N3, 2016, Article 4

DOI: https://doi.org/10.15407/usim.2016.03.029

Upr. sist. maš., 2016, Issue 3 (263), pp. 29-42.

UDC 004.7:004.75:004.9:004.738.5

A.A. Oursatyev, PhD in Techn. Sciences, Leading Research Associate, International Research and Training Centre of Information Technologies and Systems of the NAS and MES of Ukraine, Glushkov ave., 40, Kyiv, 03187, Ukraine, aleksei@irtc.org.ua

Some Frameworks for Analytics Big Data

Introduction. The need to extract data from new information forces the developers of the analytic systems to pay attention on radical improvement of the traditional processing technology and to create the advanced analytics environments.

The conceptual issues of data media construction, in particular, on the Hadoop cluster system software platform is presented. The HadoopMapReduce infrastructure is described for the parallel distributed computing on the data and the evolutionary transformation of Hadoop platform using the infrastructure and streaming dynamic loads, as well as HadoopMapReduce infrastructure constraints. It is shown that an introduction of YARN (Yet Another Resource Negotiator) on the computing Hadoop platform allows to perform the different workloads in a linearly scalable cluster  Hadoop YARN  (Hadoop 2.0),
achieving calculations of the high efficiency. Frameworks, Spark, Tez and Storm use the possibility of YARN.

The components that make a total Hadoop 2.0 de facto the standard technology for working with Big Data are analyzed.

These are the constructions Hive for design-oriented interactive queries to SQL-like language HQL (Hive query language) and working with large data storage;  Pig – a high-level procedure language  Pig Latin, designed for accessing the semidistributed lennym datasets; HBase – distributed non-relational DBMS, working effectively with the individual records in real time; Apache Accumulo – oriented on a high level of safety distributed, scalable data repository with the strict requirements of the information and personal data protection.

Results. The problems of large data efficiently various types download of Hadoop ecosystem using Hive and Pig. A comparative analysis of  ELT (extract-load-transform) and  ETL (extract-transform-load) concept is presented. The first one is widely spread due to the emergence of Hadoop technology.

Download full text! (In Russian).

Keywords: Hadoop, HadoopMapReduce, Hadoop technology, advanced analytics environments.

   1. The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things. EMC Digital Universe with Research & Analysis by IDC. http://www.emc.com/leadership/digital-universe/2014-iview/index.htm.April2014
 2. Big Data. Nature. 2008. 455, 7209, pp. 1–136. 
http://www.nature.com/nature/journal/v455/n7209/index. html
 3. John, F. Gantz, David, Reinsel. The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growthin the Far East. Dec. 2012. http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf 
  4. Michael Chui,  Markus Loffler, and  Roger Roberts.”The Internet of Things”, McKinsey Quarterly, March 2010. 22.
 5. Big data: The next frontier for innovation, competition, and productivity / J. Manyika, M. Chui, B. Brown et al. May 2011. www.mckinsey.com/media/McKinsey/dotcom/Insights%20and%20pubs/MGI/Research/Technology%20and%20Innovation/Big%20Data/MGI_big_data_full_report.ashx
 6. Carl W. Olofson,  Dan Vesset. Big Data: Trends, Strategies, and SAP Technology. http://www.itexpocenter.nl/iec/sap/BigDataTrendsStrategiesandSAPTechnology.pdf, Aug. 2012. External Publication of IDC Information and Data.
 7. Hinchcliffe Dion. The enterprise opportunity of Big Data: Closing the “clue gap”. http://www.zdnet.com/article/the-enterprise-opportunity-of-big-data-closing-the-clue-gap/
 8. Chernyak L. Big Data – a new theory and practice. Open systems. 2011, http://www.osp.ru/os/2011/10/13010990. (In Russian).
 9. CLOUDERA. Hadoop and Big Data, http://www.cloudera.com/content/cloudera/ en/about/hadoop-and-big-data.html
 10. Kava Adam. Introduction to YARN,  http://www.ibm. com/developerworks/ru/library/bd-yarn-intro/, 11.11.2014. (In Russian).
 11. Apache Spark, http://cloudera.com/content/cloudera/en/products-and-services/cdh/ spark.html
 12. CDH – 100% Open Source Distribution including Apache Hadoop, http://www. cloudera.com/content/cloudera/en/products-and-services/cdh.html
 13. HORTONWORKS. What is Apache Hadoop?, http://hortonworks.com/hadoop/
  14. Apache Tez, http://hortonworks.com/hadoop/tez/
 15. Apache Storm, http://hortonworks.com/hadoop/storm/
 16. Nikolaenko A., Volkov D. New Hadoop tools. Open systems. 2014, 10. http://www.osp.ru/os/2014/10/ 13044382/. (In Russian).
 17. MAPR. What is Apache™ Hadoop®?, https://www.mapr.com/products/apache-hadoop
 18. Hadoop:what, where and why http://habrahabr.ru/post/240405/, 16 Oct. 2014. (In Russian).
 19. What Is Apache Hadoop?, https://hadoop.apache.org/
  20. Jamak P. Creating a data library with Hive.  http://www.ibm.com/developerworks/ru/library/bd-hivelibrary/, 11 Oct. 2013. (In Russian). 
 21. Nikulin A. Hive vs Pig. Why do I have so many ETL?, http://habrahabr.ru/post/223217/, 23 May 2014. (In Russian). 
 22. Jones M. Tim. Data processing with Apache Pig.  http://www.ibm.com/developerworks/ru/library/l-apachepigdataquery/, 20.11.2012. (In Russian). 
 23. Franks B. The Taming of Big Data: How to Extract Knowledge from Arrays of Information Using In-depth Analytics. M .: Mann, Ivanov and Ferber, 2014. 352 p. (In Russian). 
 24. Jamak P. Hive as a tool for ETL or ELT. http://www.ibm.com/developerworks/ru/library/bd-hive-tool/, 14 May 2014. (In Russian). 
 25. Apache HBase. http://hortonworks.com/hadoop/hbase/, May 2010.
 26. Apache HBase. http://hbase.apache.org/, May 24 2016.
 27. Bodrov I. Strengths and weaknesses NoSQL. http://www.jetinfo.ru/stati/silnye-i-slabye-storony-nosql, Jet Info. July, 2012, 6. (In Russian). 
 28. Apache Accumulo. http://hortonworks.com/hadoop/accumulo/, http://accumulo.apache.org/
 29. The Log-Structured Merge-Tree (LSM-Tree) / P. O’Neil, E. Cheng, D. Gawlick et al. Acta Informatica. 1996. 33 (4), pp. 351–385.
 30. Mezov A. SSTable and LSM-Tree.  http://www.mezhov.com/2013/09/sstable-lsm-tree.html, 24 Sept. 2013. (In Russian). 
 31. Zubinsky A. NoSQL DBMS, Part II, “KVS”. Computer Review. http://ko.com.ua/nosql_subd_chast_vtoraya_kvs_103598, 29 Jan. 2014. (In Russian). 
 32. Sen Ranjan, Farris Andrew, Guerra Peter. Benchmarking Apache Accumulo BigData Distributed Table Store Using Its Continuous Test Suite. Int. Congr. on Big Data, 2013. http://sqrrl.com/media/Accumulo-
Benchmark-10312013-1.pdf
 33. Jackson Joab. The NSA implements a labeling security model in Big Data. IDG News Service, New York. Network\network  world. 2011. 4. http://www.osp.ru/nets/2011/ 04/13010801/.(In Russian). 

Received 30.03.2016