Control Systems and Computers, N1, 2019, Article 5

https://doi.org/10.15407/usim.2019.01.041

Upr. sist. maš., 2019, Issue 1 (279), pp. 41-51.

UDC 004.04.043; 004.912; 004.62

K.A. BOBROVNYK, Master of the Department of Ukrainian Language and Applied Linguistics Institute of  Philology of Taras Shevchenko National University of Kyiv, boulevard Taras Shevchenko, 14, Kyiv, 01601, Ukraine, mailkatherine.bobrovnik@gmail.com

К.К. DUKHNOVSKA, Assistant of the Department of Applied Information Technologies of the Faculty of Information Technologies of Taras Shevchenko National University of Kyiv Glushkov ave., 4, Kyiv, 03022, Ukraine, duchnov@ukr.net

M.V. PIROH, Assistant of the Department of Applied Information Technologies of the Faculty of Information Technologies of Taras Shevchenko National University of Kyiv Glushkov ave., 4, Kyiv, 03022, Ukraine, mykola.pyroh@ukr.net

THEMATIC CLASSIFICATION OF UKRAINIAN TEXTS, DIFFICULTIES OF ITS INTRODUCTION

Introduction. One of the major tasks of artificial intelligence, information search, and word processing is that of classification. The main problem with the automated classification of text information is that a document is presented in ahuman language and does not belong to structured data.

Solutions representing the classification of English and Russian texts are numerous. But, no research describing algorithms of developing a classifier for the texts written in the Ukrainian language as well as their peculiarities has been found by the authors.

Purpose Specify the peculiarities of the automated classification of texts written in the Ukrainian language.

Results.BrUC is the only corpus of Ukrainian texts on open access, the texts of which can be used to develop algorithms and methods of classification of texts in the Ukrainian language.

To develop classifiers of Ukrainian texts the following methods and algorithms have been used: Random Forest Classifier, Support Vector Machines, Naive Bayes Сlassifier,and Logistic Regression.Supervised learning is used for training all these classifiers. The essence of the method is that a ready-made set of classified texts represented by BrUC is used for learning.

Conclusions. The model for classification of Ukrainian texts on the basis of support vector machines has demonstrated the best results. Its mean accuracy is 0,80.

 Download full text! (In Russian)

Keywords: classifier of text documents, corpus of documents, method of supervised learning.

  1. Alekseenko, L.A., Darchuk, N.P., Zuban, O.M., 2001. “Methodology for the Creation of the Automated System of Morpheme-Word Formation Analysis (ACMSA) of the words of the Ukrainian language”. Scientific heritage of Prof. S.V. Semchynsky. Collection of scientific works. K., part 1, pp. 38–49. (In Ukrainian).
  2. Brownian Corps of the Ukrainian Language, https://github.com/brown-uk/corpus/ (In Ukrainian).
  3. Large electronic dictionary of the Ukrainian language, https://github.com/brown-uk/dict_uk/ (In Ukrainian).
  4. Russian-Ukrainian dictionary, https://r2u.org.ua/
  5. LanguageTool API NLP UK. https://github.com/brown-uk/nlp_uk/.
  6. Starko, V.F., 2014. “Formation of the Brownian Corps of the Ukrainian Language”. Linguistic and conceptual pictures of the world, 48, pp. 415–421. (In Ukrainian).
  7. Starko, V.F., 2013. “Classical Approach to Categorization in Language-Cognitive Studies”. Linguistic and conceptual pictures of the world, 43(4), pp. 117–123. (In Ukrainian).
  8. Jones, K.S., 2004. “A statistical interpretation of the term specificity and its application in retrieval”. Journal of Documentation. MCB University Press, 60 (5), pp. 493-502.
    https://doi.org/10.1108/00220410410560573
  9. Pirotte, F., Sunar, F., Piragnolo, M., 2016. “Benchmark of machine learning methods for the classification of a Sentinel-2 image”. Int. Archives of Photogrammetry, Remote Sensing & Spatial Information Sciences, 41, pp. 335-340.
    https://doi.org/10.5194/isprs-archives-XLI-B7-335-2016.
  10. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P., 2007. Section 16.5. Support Vector Machines. Numerical Recipes: The Art of Scientific Computing (View 3rd). New York: Cambridge University Press., 1262 p.
  11. Russell,  S., Norvig P., 2003. Artificial Intelligence: A Modern Approach (2nd ed.). Prentice Hall., 1112 p.
  12. Omid’s Logistic Regression tutorial. http://www.omidrouhani.com/research/logisticregression/html/logisticregression.htm

Received 15.01.2019