Control Systems and Computers, N1, 2024, Article 4
https://doi.org/10.15407/csc.2024.01.038
Control Systems and Computers, 2024, Issue 1 (305), pp. 38-49
UDC 004.93
V.O. KHOLIEV, Professor Assistant at Electronic Computers Department, Kharkiv National University of Radio Electronics “KNURE”, Kharkiv, Nauky Ave., 14, 61166, Ukraine, ORCID: https://orcid.org/0000-0002-9148-1561, vladyslav.kholiev@nure.ua
O.Y. BARKOVSKA, Assoc. Professor at Electronic Computers Department, Kharkiv National University of Radio Electronics “KNURE”, Kharkiv, Nauky Ave., 14, 61166, Ukraine, ORCID: https://orcid.org/0000-0001-7496-4353, olesia.barkovska@nure.ua
IMPROVED SPEAKER RECOGNITION SYSTEM USING
AUTOMATIC LIP RECOGNITION
The paper is focused on the relevant problem of speech recognition using additional sources besides the voice itself, in conditions in which the quality or availability of audio information is inadequate (for example, in the presence of noise or additional speakers). This is achieved by using automatic lip recognition (ARL) methods, which rely on non-acoustic biosignals generated by the human body during speech production. Among the applications of this approach are medical applications, as well as processing voice commands in languages with poor audio conditions. The aim of this work is to create a system for speech recognition based on a combination of speaker lip recognition (SSI) and context prediction. To achieve this goal, the following tasks were performed: to substantiate the systems for recognizing voice commands of a silent voice interface (SSI) based on a combination of two neural network architectures, to implement a model for recognizing visemes based on the CNN neural network architecture and an encoder-decoder architecture for the LSTM neural recurrent network model for analyzing and predicting the context of a speaker’s speech. The developed system was tested on a chosen dataset. The results show that the recognition error in different conditions averages from 4,34% to 5,12% for CER and from 5,52% to 6,06% for WER for the proposed ALR system in 7 experiments, which is an advantage over the LipNet project, which additionally processes audio data for the original without noise.
Download full text! (On English)
Keywords: SSI; ALR; AV-ASR; silent speech interface; automatic lip recognition; RNN; LSTM; speech recognition.
- Huang, X., Alleva, F., Hwang, M.-Y. and Rosenfeld, R. (1993). An overview of the SPHINX-II speech recognition system. CiteSeer X (The Pennsylvania State University). doi: https://doi.org/10.3115/1075671.1075690
- Chung, J.S. and Zisserman, A. (2018). “Learning to lip read words by watching videos”. Computer Vision and Image Understanding, 173, pp. 76-85.
https://doi.org/10.1016/j.cviu.2018.02.001 - Rybach, D., Gollan, C., Heigold, G., Hoffmeister, B., Lööf, J., Schlüter, R., Ney, H. (2009). “The RWTH aachen university open source speech recognition system”. Proc. Interspeech 2009, pp. 2111-2114, https://doi.org/10.21437/Interspeech.2009-604
- Tereshchenko, O.V., Barkovsʹka O.Yu. “Analiz vplyvu SSI-pidkhodu na produktyvnistʹ rozpiznavannya holosovykh komand”. Materialy desyatoyi mizhnarodnoyi naukovo-tekhnichnoyi konferencii «Problemy informatyzatsiyi» (November, 24-25 2022) (In Ukrainian).
- Kapur, A., Kapur, S., & Maes, P. (2018). “Alterego: A personalized wearable silent speech interface”. In 23rd International conference on intelligent user interfaces, Association for Computing Machinery, New York, NY, USA, pp. 43-53..
https://doi.org/10.1145/3172944.3172977 - Orosco, E.C., Amorós, J.G., Gimenez, J.A., & Soria, C.M. (2019). “Deep learning-based classification using Cumulants and Bispectrum of EMG signals”. IEEE Latin America Transactions, December 2019, 17(12), pp. 1946-1953. December 2019.
https://doi.org/10.1109/TLA.2019.9011538 - Zhang, T., He, L., Li, X. and Feng, G. (2021). “Efficient End-to-End Sentence-Level Lipreading with Temporal Convolutional Networks”. Applied Sciences, 11 (15), p. 6975. https://doi.org/10.3390/app11156975
- Hueber, T., Benaroya, E.-L., Chollet, G., Denby, B., Dreyfus, G. and Stone, M. (2010). “Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips”. Speech Communication, 52 (4), pp. 288-300. https://doi.org/10.1016/j.specom.2009.11.004
- Mohapatra, D.R., Saha, P., Liu, Y., Gick, B., & Fels, S. (2021). “Vocal tract area function extraction using ultrasound for articulatory speech synthesis”. In Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), pp. 90-95.
https://doi.org/10.21437/SSW.2021-16 - Wand, M., Koutník, J., & Schmidhuber, J. (2016). “Lipreading with long short-term memory”. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. abs/1601.08188. pp. 6115-6119. URL: http://arxiv.org/abs/1601.08188.
https://doi.org/10.1109/ICASSP.2016.7472852 - Gonzalez-Lopez, J.A., Gomez-Alanis, A., Martin Donas, J.M., Perez-Cordoba, J.L. and Gomez, A.M. (2020). “Silent Speech Interfaces for Speech Restoration: A Review”. IEEE Access, 8, pp. 177995-178021.
https://doi.org/10.1109/ACCESS.2020.3026579 - Yalkovskyi, A.Ye. (2009). “Problemy rozpiznavannya movy lyudyny”. Problems of Informatization and Management, 3(27), pp. 163-166 (In Ukrainian). https://doi.org/10.18372/2073-4751.3.570.
https://doi.org/10.18372/2073-4751.3.570 - Kholiev, V., Barkovska, O. (2023). “Analysis of the training and test data distribution for audio series classification”. Informatsiyno-keruyuchi systemy na zaliznychnomu transporti, 28. pp. 38-43. https://doi.org/10.18664/ikszt.v28i1.276343
- Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B. and Shelhamer, E. (2014). cuDNN: Efficient Primitives for Deep Learning. arXiv:1410.0759 [cs]. [online]. Available at: https://arxiv.org/abs/1410.0759.
- Chen S.H.K., Saeli C., Hu G. (2023). “A proof-of-concept study for automatic speech recognition to transcribe AAC speakers’ speech from high-technology AAC systems”. Assistive Technology, pp. 1-8.
https://doi.org/10.1080/10400435.2023.2260860 - Del Rio, M., Delworth, N., Westerman, R., Huang, M., Bhandari, N., Palakapilly, J., McNamara, Q., Dong, J., Zelasko, P., & Jette, M. (2021). “Earnings-21: A practical benchmark for ASR in the wild”. Interspeech, pp. 3465-3469.
https://doi.org/10.21437/Interspeech.2021-1915 - Huh, J., Park, S., Lee, J. E., & Ye, J. C. (2023). “Improving medical speech-to-text accuracy with vision-language pre-training model”. (arXiv:2303.00091). arXiv. http://arxiv.org/abs/2303.00091.
- Shonibare, O., Tong, X., & Ravichandran, V. (2022). “Enhancing ASR for stuttered speech with limited data using detect and pass”. Cureus, 14 (9). https://doi.org/10.48550/ARXIV.2202.05396.
- GitHub. (n.d.). Release 5.0.3: Major bugfix release cmusphinx/pocketsphinx. [online] Available at: https://github.com/cmusphinx/pocketsphinx/releases/tag/v5.0.3 [Accessed 22 Mar. 2024].
Received 24.02.2024