Control Systems and Computers, N2, 2019, Article 5

https://doi.org/10.15407/usim.2019.02.040

Upr. sist. maš., 2019, Issue 2 (280), pp. 40-69.

UDC 004.65:004.7:004.75:004.738.5

Alexey A. Oursatyev, PhD (Eng.), Leading Research Associate, International Research and Training Centre of Information Technologies and Systems of the NAS and MES of Ukraine, Glushkov ave., 40, Kyiv, 03187, Ukraine, aleksei@irtc.org.ua

BIG DATA. ANALYTICAL DATABASES AND DATA WAREHOUSE: GREENPLUM

Introduction. The article is a continuation of the Big Data and tools study, which is transformed into technology of the new generation and architecture of the BD platforms and storage for the intelligent output. In this part the review of DB Netezza is presented. The main attention is paid to the issues of changing the infrastructure, the tool environment and the platform for identifying the necessary information and new knowledge from the Big Data, the initial information about the product is given in the product general description.

Purpose. The purpose is to consider and evaluate the application effectiveness of the infrastructure solutions for new developments in the Big Data study, to identify new knowledge, the implicit connections and in-depth understanding, insight into phenomena and processes.

Methods. The informational and analytical methods and technologies for data processing, the methods for data assessment and forecasting, taking into account the development of the most important areas of the informatics and information technology.

Results. Greenplum, as well as Netezza and Teradata, created its Data Computing Appliance (DCA) complex, and later, an analytical Pivotal database Greenplum Database of corporate class with powerful and fast analytics for large data volumes under the Pivotal trademark. The relational database uses Postgres Core’s large-scale parallel Shared Nothing MPP architecture. The internal elements of PostgreSQL have been modified or added to support the parallel structure of the Greenplum database. Introduced technology Greenplum MPP Scatter / Gather Streaming fast loading (unloading), polymorphic data storage. For mass loading and reading of data, the Append-optimized or append-only storage format is used, providing performance advantages over Heap tables. The concurrency control in the Greenplum PostgreSQL database occurs without using a lock to control concurrency. Data consistency is supported by Multiversion Concurrency Control, MVCC, which provides isolated transaction for each database session. GPORCA is ORCA query optimizer which expands the capabilities of planning and optimizing the legacy of GPQUERY.

For the first time, EMC announced its Big Data analytics development programme in 2011. A comprehensive strategy for the integration and support of open source software Apache Hadoop was presented. As a result of merging the Greenplum database with Hadoop, it became possible to expand data types in analytical studies within a single repository. The adaptation of the Greenplum database to the HDFS distributed file system has been completed for more than two years by the largest development team of Hadoop. It was necessary to create a storage layer that would improve the current HDFS version in terms of performance, availability, and ease of use. Along with Cloudera, Hortonworks and MapR, the largest providers of data storage and processing solutions, EMC, Greenplum, Pivotal appeared on the market of services providing infrastructure support for Hadoop, promoting their own Hadoop distribution.

We have studied the technologies being used to modify the Hadoop infrastructure for the analytics platform in Greenplum products and the RDBMS integration environment with Hadoop: from the Writable External Table to the expansion of the Greenplum database (PXF), which allows exchanging data with third-party heterogeneous systems, Pivotal Greenplum HAWQ – self-contained SQL Hadoop query engine, which combines the key technological advantages of the MPP database with the scalability and convenience of Hadoop, and the Greenplum Chorus self-service data platform.

 General conclusions. Having analysed a relatively small number of analytical DBMS for BigDate by well-known manufacturers in the global IT community through the transforming existing processing methods and infrastructure into solutions, identified by IDC as a new generation of technologies and architectures designed to extract economical benefits from a very large amounts of diverse data providing the high rate of removal and analysis, we can conclude the following.

  1. Along with the generally recognized MPP architecture, the Apache Hadoop cluster is a well-known software infrastructure that integrates a number of modules (frameworks) with the various target functions, a class of computing systems consisting of many nodes organized on the basis of shared nothing. – the ability to create a full-featured platform for storing and processing unstructured data. 
    The arisen dilemma is that the ability to process different types of data within a single repository with highly parallel, SQL-based systems that ensure full compliance with ACID and distributed Hadoop systems, which have quickly become the preferred way of working with unstructured information, has a number of interesting integration solutions, but two of them deserve special attention. One of them, the HAWQ project, is a relational database tier located on top of the Hadoop distributed file system’s HDFS. HAWQ writes and reads data initially from HDFS. HAWQ is a proprietary SQL Hadoop query engine that combines the key technological benefits of the MPP database with the scalability and convenience of Hadoop. The other is the storage, represented by the unified data architecture of the Teradata Unified Data Architecture, and the data management platform that takes into account all the storage applications: traditional, operational, logical, and context-independent. High-performance data access, processing and virtual delivery to systems in heterogeneous analytic environments is provided by the Teradata QueryGrid ™ ecosystem, a kind of matrix that uses parallel data movement between exchange objects. The idea of ​​an ecosystem approach to cover different types of data comes down to linking nodal information points in the different environments. The adopted unified data architecture of Teradata® UDA ™ does not contradict the unified presentation of data, without moving them, to the concept of logical data stores and emphasize the status of LDW as a final solution for the database and for analytics.
  1. Analytical capabilities of the studied systems are not limited to SQL-analysis. So, Greenplum extends the capabilities of SQL through multilingual user functions in languages ​​such as Python, R, Java, Perl, C / C ++. Teradata provides ready-to-use SQL-MapReduce® and Graph expressive functions for high-performance analytics, time series functions, text analytics, and more for BigDate research. Analytical mechanisms (SQL, SQL-MapReduce and SQL-Graph) ensure an optimal processing of the analytical tasks in large amounts of data, for example, full network analytics processing (SNAP) in Teradata allows to call several types of advanced analytics (graph, path / template, text, SQL and statistical forecast analysis).
  2. In the libraries development there are the scalable machine learning algorithms Machine Learning and the use of cluster computing in RAM will ensure the possibility of data mining with more explicable models, which will definitely bring us closer to Data Science platforms, if there are developed mechanisms for identifying the “right data” and libraries of graphic information display.
  3. The annual growth of BigDate causes an increase in data processing speed. The use of in-database analytical computing, which eliminates data movement and reduces processing time, has become customary. Wider use (in-memory computing, IMC) – platforms operating in computer memory are still relatively rare. For example, Kx Systems (kdb +), thanks to IMC, implemented the hybrid transaction / analytical processing, HTAP architecture, allowing applications to analyse data as it arrives and is updated with transaction processing functions. Extended real-time analytics, such as forecasting and modelling, have become an integral part of the observed process, rather than being positioned as a separate action performed after. Teradata implemented an innovative database technology Intelligent Memory Teradata (IMDBMS from Teradata) – expanded memory space outside the cache, which significantly increased еру query performance and provided an effective technology for storing a variety of data in memory. Pivotal GemFire ​​has developed in-memory data grids (IMDG) technology for distributed high-performance in-memory data storage for modern high-speed applications with intensive data processing. IMC technologies such as in-memory database (IMDS) and high-scale, fault-tolerant in-memory data storage grids of low latency will be in demand, especially for advanced analytics such as prediction and modelling.
  4. Hadoop and the components of the ecosystem of its products are fairly well represented and functional to meet the requirements of BigData processing. The technology as a whole has been developed, it has been recognized by well-known IT companies in the world and are actively used for developing infrastructure solutions of the advanced tools, including advanced analytics for business analysis and Data Science platforms. The coming phase of relative stability (commodity phase) indicates that the technology is becoming common and accessible to all.

 Download full text! (In Russian)

Keywords:Greenplum Data Computing Appliance (DCA), Non Shared Nothing MPP architecture based on PostgreSQL core, a MPP Scatter/Gather Streaming technology the data loading and unloading, Polymorphic Data Storage, BigData analytics, integration platforms, Own SQL query engine for Hadoop HAWQ, self-service data.

REFERENCES

  1. Greenplum Database, [online] Available at:<https://greenplum.org/>[Accessed 11 Jun. 2018].
  2. Greenplum® Database 4.1 Administrator Guide, [online] Available at:<media.gpadmin.me/wp-content/uploads/2011/ 05/GP-4100-AdminGuide.pdf >[Accessed 11 Jun. 2018].
  3. EMC Greenplum Data Computing Appliance: Performance and Capacity for Data Warehousing and Business Intelligen­ce, [online] Available at:<https://japan.emc.com/microsites/japan/techcommunity/pdf/h8778-Greenplum DCA-HighCapacity-wp.pdf.>[Accessed 11 Jun. 2018].
  4. Load and Go: Fast Data Loading with the Greenplum Data Computing Appliance (DCA). Massive data news, [online] Available at:<https://www.emc.com/collateral/hardware/white-papers/load-and-go-fast-data-loading-greenplum-data-computing-appliance-wp.pdf >[Accessed 7 Aug. 2018].
  5. EMC Greenplum Data Computing Appliance. Driving the future of data warehousing, [online] Available at:<https://www.ens-inc.com/FileLibrary/2f9a80b2-a267-4c72-a9d6-3952dae13894/> [Accessed 11 Jun. 2018].
  6. Hill D. Greenplum: EMC’s Latest Plum? 10/25/2010, [online] Available at:<https://www.networkcomputing.com/storage/greenplum-emcs-latest-plum/1870771227>[Accessed 7 Aug. 2018].
  7. Pivotal Greenplum®. Greenplum Database Concepts, [online] Available at: <https://gpdb.docs.pivotal.io/580/admin_ guide/intro/partI.html >[Accessed 7 Aug. 2018].
  8. Pivotal. The World’s First Open-Source Based, Multi-Cloud Data Platform Built for Advanced Analytics, [online] Available at:<­https://pivotal.io/pivotal-greenplum >[Accessed 7 Aug. 2018].
  9. Oursatyev A.A., 2018. “Big Data. Analytical Databases and Warehouse: Teradata”. Upravlausie sistemy i masiny, 2, pp. 51 – 67. (In Russian).
    https://doi.org/10.15407/usim.2018.02.051
  10. Oursatyev, A.A. Big Data., 2019. Analytical Databases and Warehouse: NETEZZA”. Upravlausie sistemy i masiny, 1, pp. 52 – 67. (In Russian).
    https://doi.org/10.15407/usim.2019.01.052
  11. Oursatyev A.A., 2018. “Big Data. Analytical Databases and Warehouse: Vertica, Kdb”. Upravlausie sistemy i masiny, 1, pp. 57 -70. (In Russian).
    https://doi.org/10.15407/usim.2018.01.057
  12. New Data Loading Technology from Greenplum Offers Breakthrough Speeds For Large-Scale Data Warehousing. – San Mateo, CA (PRWEB), [online] Available at:<http://www.prweb.com/releases/2009/03/prweb2235864.htm>[Accessed March, 16, 2009].
  13. Pivotal Greenplum®. About Parallel Data Loading, [online] Available at:<https://gpdb.docs.pivotal.io/580/admin_guide/intro/ about_loading.html>.
  14. Pivotal Greenplum®. Choosing the Table Storage Model, [online] Available at:<https://gpdb.docs.pivotal.io/580/admin_guide/ddl/ddl-storage.html>.
  15. Storage Comes At a Price. 22 Jan. 2016, [online] Available at:<https://www.linkedin.com/pulse/storage-comes-price-sandeep-katta>.
  16. Pivotal Greenplum®. About Concurrency Control in Greenplum Database, [online] Available at:<https://gpdb.docs.pivotal.io/580/admin_guide/ intro/about_mvcc.html>.
  17. Pivotal Greenplum v5.1. About GPORCA, [online] Available at:<https://gpdb.docs.pivotal.io/510/admin_guide/query/topics/query-piv-optimizer.html>.
  18. Pivotal Big Data Suite accelerates digital transformation, [online] Available at:<http://www.storagenews.ru/news_ take.asp?Code=2319>[Accessed 21 May, 2015].
  19. Graefe G. Volcano – An Extensible and Parallel Query Evaluation System, [online] Available at:<https://www.researchgate.net/publication/3296396_Volcano-An_Extensible_and_Parallel_ Query_Evaluation_System>[Accessed March, 1994].
    https://doi.org/10.1109/69.273032
  20. Graefe, G. The Cascades Framework for Query Optimization. Jan. 1995, [online] Available at:<https://www.researchgate.net/publication/220282640_ The_Cascades_Framework_for_Query_ Optimization>[Accessed Jan, 1995].
  21. A Modular Query Optimizer Architecture for Big Data. MohamedA.Soliman, Lyublena Antova, Venkatesh Raghavan and etl., [online] Available at:<https://content.pivotal.io/white-papers/orca-a-modular-query-optimizer-architecture-for-big-data>.
  22. Addison Huddy. GPORCA, A Modular Query Optimizer, Is Now Open-Source. Pivotal Engineering Journal. Jan 28. 2016, [online] Available at:<http://engineering.pivotal.io/post/gporca-open-source/>.
  23. Data Lake – universal storage for big data analytics, [online] Available at:<http://www.storagenews.ru/60/EMC_ Data_Lake_60.pdf>.
  24. Serov, D., 2011. Machines for analysts. “OS”, N 04, [online] Available at:<https://www.osp.ru/os/2011/04/13008766/>[Accessed May, 19, 2011].
  25. New Features in Greenplum Database 3.2. Welcome to Greenplum Database 3.2.0.0, [online] Available at:<http://docs.huihoo.com/greenplum/ GPDB-3.2.0.0-README.pdf>.
  26. Oursatyev A.A., 2016. “Some Big Data Analytics Software Environments”. Upravlausie sistemy i masiny, 03, pp. 29 – 42. (In Russian).
    https://doi.org/10.15407/usim.2016.03.029
  27. Lozinskiy A.P., Simakhin V.M., Oursatyev A.A., 2017. “Technologies modeling for processing large data on the local cloud platform”. Upravlausie sistemy i masiny, 3, pp. 6-19. (In Russian).
    https://doi.org/10.15407/usim.2017.03.006
  28. Greenplum Database 4.3.6.1 Release Notes, [online] Available at:<http://docs.huihoo.com/greenplum/pivotal/4.3.6/relnotes/GPDB_ 4361_README.html#topic36>[Accessed Sept., 2015].
  29. Bodkin, R. MapR Releases Commercial Distributions based on Hadoop. InfoQ, [online] Available at:<https://www.infoq.com/news/ 2011/07/mapr>[Accessed Jul 07. 2011].
  30. Harris, D. Startup MapR Underpins EMC’s Hadoop Effort, [online] Available at:<https://gigaom.com/2011/05/25/startup-mapr-underpins-emcs-hadoop-effort/>[Accessed May, 25, 2011].
  31. Clark, J. EMC taps MapR technology for Hadoop distro, [online] Available at:<https://www.zdnet.com/article/emc-taps-mapr-technology-for-hadoop-distro/>[Accessed May, 26, 2011].
  32. Chapter 7: Loading and Unloading Data. Greenplum Database 4.2 Database Administrator Guide. Rev: A07. (4.2.7.1)., [online] Available at:<https://www.emc.com/collateral/TechnicalDocument/docu44316.pdf>[Accessed Feb., 2014].
  33. Chapter 7: Loading and Unloading Data. Greenplum Database Version 4.3 Database Administrator Guide. GoPivotal, Inc., [online] Available at:<https://gpdb.docs.pivotal.io/4300/pdf/ GPDB43_DBAGuide.pdf>[Accessed Apr., 1, 2013].
  34. External Table Support for Avro and Parquet File Formats on HDFS. Greenplum Database 4.3.6.1, [online] Available at:<http://docs.huihoo.com/greenplum/pivotal/4.3.6/relnotes/GPDB_4361_README.html#topic36>[Accessed Sept., 2015].
  35. Jeffrey Wang. Dremel. Data Model. Ternary Search, [online] Available at:<http://ternarysearch.blogspot.com/2013/06/dremel-data-model.html>[Accessed June 30, 2013].
  36. Dremel: Interactive Analysis of Web-Scale Datasets. Melnik Sergey, Andrey Gubarev, Jing Jing Long et. al., Int. Conf. on Very Large Data Bases, 13–17 Sept. 2010, Singapore, [online] Available at:<https://static.googleusercontent.com/media/research.google.com/ ru//pubs/archive/36632.pdf>.
  37. Diwakar Kasibhotla. Greenplum and Hadoop HDFS integration. Oct. 10. 2012, [online] Available at:<https://dwarehouse.wordpress.com/2012/ 10/10/greenplum-and-hadoop-hdfs-integration/>[Accessed Oct., 10, 2012].
  38. New Functionality in Greenplum Database 4.2. Welcome to Greenplum Database 4.2, [online] Available at:<http://media.gpadmin.me/wp-content/uploads/2012/11/ GPDB_4200_README.pdf>[Accessed Nov., 23, 2011.].
  39. Harris, D. EMC Makes a Big Bet on Hadoop, [online] Available at:<https://gigaom.com/2011/05/09/emc-hadoop/>[Accessed May, 9, 2011].
  40. Greenplum Platform Extension Framework (PXF). Using PXF with External Data. Pivotal Greenplum v5.5.0 Documentation, [online] Available at:<https://gpdb.docs.pivotal.io/550/pxf/overview_pxf.html>.
  41. Using PXF to Read and Write External Data. Greenplum Platform Extension Framework (PXF), [online] Available at:<https://gpdb.docs.pivotal. io/550/pxf/using_pxf.html>.
  42. Greenplum 5: first steps in open source. Company blog IBS, [online] Available at:<https://habr.com/company/ibs/blog/343640/>[Accessed Dec., 12, 2017].
  43. Menninger D. EMC Enters Elephant Race with Hadoop, [online] Available at:<https://davidmenninger.ventanaresearch. com/2011/05/12/emc-enters-elephant-race-with-hadoop>[Accessed May, 12, 2011].
  44. MapR Technologies and EMC Announce Technology Licensing Agreement for Next Generation Hadoop Distribution, [online] Available at:<https://mapr.com/company/press/mapr-technologies-and-emc-announce-technology-licensing-agreement-next-generation/>[Accessed May 24, 2011].
  45. MapR File System (MapR-FS), [online] Available at:<https://mapr.com/docs/52/MapROverview/c_maprfs.html
  46. Direct Access NFS, [online] Available at:<https://mapr.com/docs/52/MapROverview/c_direct_NFS.html>.
  47. MapR System Overview, [online] Available at:<https://mapr.com/docs/52/MapROverview/ c_overview_intro.html>.
  48. MapReduce Version 1, [online] Available at:<https://mapr.com/docs/52/MapROverview/c_mrv1.html?hl =directshuffle>.
  49. MapR Technologies: How Direct Shuffle actually works?, [online] Available at:<https://www.quora.com/MapR-Technologies-How-Direct-Shuffle-actually-works>.
  50. White T., 2009. Hadoop: The Definitive Guide. 1st ed. Sebastopol: O’Reilly Media, 528 p., [online] Available at:<http://oreilly.com/catalog/ 9780596521981>.
  51. Hadoop Acceleration Through Network Levitated Merge. Wang Y., Que X., Yu W., [online] Available at:<https://www.cs.fsu.edu/~yuw/pubs/2011-SC-Yu.pdf>.
  52. JVM-Bypass for Efficient Hadoop Shuffling. Wang, Xu C., Li X., Yu W., IPDPS ’13 Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 569–578, [online] Available at:<https://www.cs.fsu.edu/~yuw/pubs/2013-IPDPS-Yu.pdf>.
  53. S. Rao. I-files: Handling Intermediate Data In Parallel Dataflow Graphs (Sailfish), [online] Available at:<https://www.cics.umass.edu/event/i-files-handling-intermediate-data-parallel-dataflow-graphs>.
  54. Camdoop: exploiting in-network aggregation for big data applications. P. Costa, A. Donnelly, A. Rowstron et al., In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, NSDI’12, pages 3–3, Berkeley, CA, USA, 2012. USENIX Association, [online] Available at:<http://www.cs.yale.edu/homes/yu-minlan/teach/csci599-fall12/papers/nsdi12-final11_0.pdf>.
  55. Direct Shuffle on YARN, [online] Available at:<https://mapr.com/docs/home/MapROverview/c_direct_shuffle_ yarn.html>.
  56. EMC Greenplum HD Enterprise Edition. Advancing Hadoop for the Enterprise – Copyright, 2011, EMC Corporation. Data Sheet H8892, [online] Available at:<http://www.netdyninc.com/sw/swchannel/images/ProductCatalog/Product­Page/File/datasheet69.pdf>.
  57. MC Greenplum® HD Enterprise Edition. Administrator Guide Rev: A01, 2011. EMC Corporation, [online] Available at:<https://www.emc.com/collateral/TechnicalDocument/docu34982.pdf>.
  58. Aslett, Matthew. What’s in a name? EMC Greenplum rebrands its Hadoop distros. Jan. 31. 2012, [online] Available at:<https://blogs>. the451group.com/information_management/2012/01/31/whats-in-a-name-emc-greenplum>.
  59. Horizontally scalable network storage for Greenplum HD. System for storing and analyzing big data EMC Isilon, [online] Available at:<https://ukraine.emc.com/collateral/hardware/solution-overview/h8319-scale-out-nas-greenplum-hd-so.pdf-re-brands-its-hadoop-distros/>.
  60. DELL EMC Isilon Big Data Storage and Analytics Solutions. Efficient, Flexible In-Place Hadoop Analytics, [online] Available at:<https://www.emc.com/collateral/hardware/solution-overview/h8319-scale-out-nas-greenplum-hd-so.pdf
  61. Job Overview HDFS c OneFS, [online] Available at:<http://doc.isilon.com/onefs/hdfs/02-ifs-c-hdfs-conceptual-topics.htm>.
  62. Patel M. Chorus Brings Data Science Minds Together, [online] Available at:<https://blog.dellemc.com/en-us/chorus_data_science/>[Accessed Feb. 21, 2013].
  63. The Age of Self-Service Data is Upon Us, [online] Available at:<https://go.unifisoftware.com/Definitive-Guide-to-Self-Service-Data>.
  64. Greenplum Software Introduces Greenplum Chorus. Originally published, [online] Available at:<http://www.b-eye-network.com/ view/13182>[Accessed April 12, 2010].
  65. Howard Philip. Self-service data preparation and cataloguing, [online] Available at:<https://www.bloorresearch.com/research/self-service-data-preparation-cataloguing-p2/>[Accessed Nov. 7, 2016].
  66. Apache HAWQ is Apache Hadoop Native SQL. Advanced Analytics MPP Database for Enterprises, [online] Available at:<http://hawq.apache.org/>.
  67. Prickett Morgan Timothy. EMC morphs Hadoop elephant into SQL database HAWQ, [online] Available at:<https://www.theregister.co.uk/2013/02/25/emc_pivotal_hd _hadoop_hawq_database/>[Accessed Feb., 25, 2013].
  68. Kersteter Bart. What is HAWQ?, [online] Available at:<https://www.quora.com/What-is-HAWQ>[Accessed Feb., 27, 2013].
  69. Pivotal HDB 2.1.1 Documentation, [online] Available at:<https://hdb.docs.pivotal.io/211/hawq/overview/ HAWQOverview.html>[Accessed Feb., 27, 2017].
  70. Pivotal Greenplum: Open-Source, Massively Parallel Data Platform for Advanced Analytics, [online] Available at:<https://content.pivotal.io/datasheets/pivotal-greenplum>.

 Received 03.04.2019