Control Systems and Computers, N2, 2017, Article 3

DOI: https://doi.org/10.15407/usim.2017.02.038

Upr. sist. maš., 2017, Issue 2 (268), pp. 38-45.

UDC 004.934

Sazhok Mykola M., Ph.D. (Eng.), International Research and Training Center for Information Technologies and Systems NAS and MES of Ukraine (IRTC IT and S), Ukraine, Kiev, E-mail: sazhok@gmail.com

Speech Information Technologies and Systems

Introduction. Speech technology and systems became the part of the contemporary world, which helps to transform the society. The Generative Model proposed in Ukraine in 1960s became a base of modern and most productive techniques for speech recognition as well as for pattern recognition in general.

The purpose is to analyze and generalize the theoretical and applied progress in order to characterize the state-of-the-art, shape trends and suggest further research and development.

Scope of application. Speech technology scope extends continuously by introducing new speech signal sources and multilingualism, growing user expectations and IT progress. Voice input and text-to-speech systems allow relieving human hands and eyes that leads to the natural man-machine communication where a user can co-operate with the cybernetic system. An opposite example, where no such co-operation provided, is the automatic broadcast monitoring system.

Methods. The contemporary formulation of Generative Model is presented with the focus on its acoustic component. The recently adapted mathematical tools, which allow the context dependency and pattern hierarchy effective modelling, are referred. For decades, the feature space areas, where a basic speech segment is observed, have been successfully approximated by Gaussian Mixture Model. The further applied Deep Learning technique improves the approximation quality. The way to model the multilingual aspects are described. The general model of the speech recognition system is presented and the key applications are described.

Conclusion. The gap between computer and human speech processing has been significantly reduced for the certain tasks. The extended (e.g. structural) context dependency and feature space modelling showed their effectiveness and is promising for the further research.

Keywords: speech signal recognition, speech understanding, text-to-speech, spoken dialog systems, generative model.

Download full text!

  1. Vintsyuk, TK, 1987. Analysis, recognition and semantic interpretation of speech signals. Kiev: Sciences. Dumka, 264 p.
  2. http://voxalead.labs.exalead.com/
  3. http://tech.ebu.ch/docs/events/metadata15/Petr Vitek and Pavel Ircing_CT_UWB.pdf
  4. Sadaoki, Furui, 2005. “50 years of progress in speech and speaker recognition”. Proc. of 10th Int. Conf. «Speech and Computer». Patras, Greece, pp. 1–9.
  5. Mohri, M., Pereira, F., Riley, M., 2008. Speech recognition with weighted finite-state transducers. Springer Handbook on Speech Processing and Speech Communication. Berlin, Heidelberg: Springer, pp. 559–584.
  6. Gales, M. Discriminative models for speech recognition, ITA Work-shop, Univ. San Diego, USA, Feb. 2007.
  7. Gales, M., Young, S., 2007. “The Application of Hidden Markov Models in Speech Recognition”. Foundations and Trends in Signal Processing, 1(3), pp. 195–304.
    https://doi.org/10.1561/2000000004
  8. Robyko, VV, Sazhok, MM, 2011. “Multivariate Multilevel Model for the Conversion of Spelling Text to Phonemic”. Artificial Intelligence, 4, pp. 117-125.
  9. Povey, D., Ghoshal, A., Boulianne G. et. al., 2011. The Kaldi Speech Recognition Toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding.
  10. Vintsiuk, T., Sazhok, M., 2005. “Multi-Level Multi-Decision Models for ASR”. Proc. of the 10th Int. Conf. on Speech and Computer – SpeCom’2005, Patras, Greece, 17–19 Oct. pp. 69–76.
  11. Dahl, G., Dong, Yu., Deng, Li. et al., 2011. “Context-Dependent Pre-Trained Deep Neural Networks for Large Vocabulary Speech Recognition”. IEEE Trans. Speech and Audio Proc., Special Issue on Deep Learning for Speech Processing, 2011.
  12. Robeyko, V., Sazhok, M., 2012. “Spontaneous Speech Recognition Based on Acoustic Composite Real-Time Model Models”. Artificial Intelligence, 4, pp. 253-263.
  13. Sazhok, M., Yatsenko, V., 2010. “The system of interpretation on the basis of interpretation of speech signal within subject areas”. Pr International conf. UkrObraz’2010, Oct. 25-29 2010, pp. 103-106.
  14. Pilipenko, V.V., Bidnyuk, S.A., Selyukh R.A. et. al., 2013. “Building scenarios of formalized oral dialogue on the example of ordering tickets for railway trains”. Upravlausie sistemy i masiny, 4. pp. 71–75.
  15.  Vasilyeva NB, Sukhoruchkina ON, Yatsenko VV, 2015. “Features of building a model of user speech communication with a multifunctional mobile service robot”. Upravlausie sistemy i masiny, 6, pp. 16-22, 28.
  16. Sazhok, N.N., Robeiko, V.V., Fedorin D.Ya. et al., 2015. “A system for converting television and radio broadcasting to text for the Ukrainian language.” Upravlausie sistemy i masiny, 6, pp. 66–73.
  17. Vasilieva, N.B., Pilipenko, V.V., Radutskiy O.M. et al., 2010. “Creation of an acoustic case of Ukrainian broadcasting”. Processing of signals and images and pattern recognition: X Int. conf. UkrObraz’2010, Oct. 25-29 K. pp.55-58.
  18. Vintsyuk, T.K., Ludovik, T.V., Sage M.M. et. al., 2002. “Automatic speaker of Ukrainian texts based on phoneme-trifon model using natural language signal”. Processing of signals and images and pattern recognition: Pr. 6th All-Ukrainian Intern. conf. UkrObraz’2002, Kyiv, October 8-12, 2002 K .: UASOIRO, 2002. pp. 79-84.

Received 30.04.2017