Control Systems and Computers, N4, 2024, Article 5
https://doi.org/10.15407/csc.2024.04.039
Control Systems and Computers, 2024, Issue 4 (308), pp. 39-49.
Yevhen Mrozek, PhD Student, Department of Speech Recognition and Synthesis, International Research and Training Center for Information Technologies and Systems NAS and MES of Ukraine, 40, Akademika Glushkova Avenue, Kyiv, Ukraine, 03187, ORCID: https://orcid.org/0009-0008-4989-5016, zekamrozek@gmail.com
ANALYSIS OF MODERN APPROACHES TO SPEECH RECOGNITION TASKS
Introduction. The necessity for modern approaches to solving speech recognition tasks arises from the rapid development of artificial intelligence and the need to improve the accuracy and speed of human-computer interaction in various areas, such as voice assistants, translation, and automation. This direction is becoming increasingly relevant due to the growing volume of generated audio data and the need for real-time processing, particularly in Ukrainian contexts where multiple languages and dialects coexist. Currently, several approaches to speech recognition, analysis, and transcription exist, including methods based on neural networks, speaker diarization techniques, noise removal, and data structuring. However, the challenge of creating a universal solution that meets the needs of multilingual environments and effectively handles unstructured audio data remains relevant.
Purpose. To review existing tools and algorithms for solving speech recognition tasks, particularly for Ukranian.
Methods. Speech recognition, deep learning, transformers.
Results. Theoretical foundations of approaches and models for speech recognition were considered for building a knowledge base for a multilingual spoken dialogue system. Effective examples of improving transcription accuracy for languages with limited data were also explored, along with potential steps to enhance system speed. Potential datasets for model training were discussed.
Conclusion. A structured review of modern methods for processing and analyzing multilingual audio files was provided, outlining their advantages, disadvantages, and unresolved issues.
Download full text! (On Ukrainian)
Keywords: speech recognition, neural networks, machine learning.
- Jurafsky, D., Martin, J. Speech and Language Processing. 7 Jan. 2023. [online]. Available at: <https://web.stanford.edu/~jurafsky/slp3/A.pdf> [Accessed 1 Aug. 2024].
- Gales, M., and Steve, Yo. (2007) “The Application of Hidden Markov Models in Speech Recognition.” Foundations and Trends in Signal Processing, vol. 1, no. 3, pp. 195-304. [online]. Available at: <https://mi.eng.cam.ac.uk/~mjfg/mjfg_NOW.pdf> [Accessed 4 Aug. 2024].
https://doi.org/10.1561/2000000004 - Jurafsky, D., Martin, J. Speech and Language Processing Automatic Speech Recognition and Text-To-Speech. [online]. Available at: <https://web.stanford.edu/~jurafsky/slp3/16.pdf> [Accessed 20 Aug. 2024].
- Vaswani, A., et al. “Attention Is All You Need”. ArXiv.org, 12 June 2017, [online] Available at: <https://arxiv.org/abs/1706.03762> [Accessed 20 Aug. 2024].
- Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July). Robust speech recognition via large-scale weak supervision. In International conference on machine learning). PMLR. pp. 28492-28518.
- Nouza, J., Zdansky, J., Cerva, P., & Silovsky, J. (2010). “Challenges in speech processing of Slavic languages (case studies in speech recognition of Czech and Slovak)”. Development of Multimodal Interfaces: Active Listening and Synchrony: Second COST 2102 International Training School, Dublin, Ireland, March 23-27, 2009, Revised Selected Papers, pp. 225-241.
https://doi.org/10.1007/978-3-642-12397-9_19 - 24 Channel. “What language do Ukrainians speak at home: survey.” 24 Channel, 17 Aug. 2021, [online]. Available at: <24tv.ua/yakoyu-movoyu-ukrayintsi-spilkuyutsya-vdoma-opituvannya-ukrayina-novini_n1715078> [Accessed 10 Jun. 2024].
- Shubham, K. “Whisper Deployment Decisions: Part I – Evaluating Latency, Costs, and Performance Metrics.” Medium, ML6team, 21 July 2023. [online]. Available at: <blog.ml6.eu/whisper-deployment-decisions-part-i-evaluating-latency-costs-and-performance-metrics-d07f6edc9ec0> [Accessed 12 Sept. 2024]
- Gandhi, S., von Platen, P., & Rush, A. M. (2023). Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling. arXiv preprint arXiv:2311.00430. [online], Available at: <https://arxiv.org/abs/2311.00430> [Accessed 1 Sept. 2024].
- Ferraz, T. P., Boito, M. Z., Brun, C., & Nikoulina, V. (2024). “Multilingual Distilwhisper: Efficient Distillation of Multi-Task Speech Models Via Language-Specific Experts”. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 10716-10720.
https://doi.org/10.1109/ICASSP48485.2024.10447520 - Bartelds, M., San, N., McDonnell, B., Jurafsky, D., & Wieling, M. (2023). “Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation.” ArXiv.org, 2023. [online]. Available at: <https://arxiv.org/abs/2305.10951> [Accessed 26 Aug. 2024].
Received 13.10.2024