Control Systems and Computers, N4, 2024, Article 4

Control Systems and Computers, 2024, Issue 4 (308), pp. 

UDK 681.3.062

Marchenko Oleksandr O., Doctor(physical and math), professor, head of the department, International Research and Training Center for Information Technologies and Systems NAS and MES of Ukraine, Glushkov ave., 40, Kyiv, 03187, Ukraine,  omarchenko@univ.kiev.ua

Nasirov Еmil М., PhD(physical and math), senior Researcher, International Research and Training Center for Information Technologies and Systems NAS and MES of Ukraine, Glushkov ave., 40, Kyiv, 03187, Ukraine,   enasirov@gmail.com

Volosheniuk Dmytro O. PhD(technical), head of the laboratory, International Research and Training Center for Information Technologies and Systems NAS and MES of Ukraine, Glushkov ave., 40, Kyiv, 03187, Ukraine, p-h-o-e-n-i-x@ukr.net

Building the Ukrainian-language Training Dataset for Determining the Sentiment Analysis of Texts

Introduction. Every day, the number of news, pages on social networks and chats on the Internet is increasing, accordingly, there is an increase in information that carries an emotional load. At the same time, the number of information threats is also growing. Under such conditions, the construction of systems for determining the emotional color of texts becomes extremely relevant.

Purpose. Emotional messages can be found and classified using artificial intelligence, namely based on neural network methods. For the process of learning neural networks, it is necessary to have a training sample of texts with a preliminary assessment of their emotional coloring. Such marked learning samples exist for news and texts in English, however, at the moment, no accessible learning sample of Ukrainian news and texts has been created.

Methods. Using statistical methods of sentiment analysis for detecting text tonality with extended vocabulary.

Results. Extended tonality vocabulary of the Ukrainian language was built. A large corpus of texts and their emotional coloring was built with an expertly assessed markup accuracy of 98%, containing 5,318,783 texts of various types in the Ukrainian language.

Conclusion. The built text corpus can be used to train and test neural networks for sentiment analysis of Ukrainian-language texts.

Keywords: artificial intelligence, computational linguistics.