Abstract
Recent models for Visual Speech Recognition (VSR) have shown remarkable progress over the last few years. They have however been applied mainly to datasets such as Lip Reading Sentences 3 (LRS3), LRS2 or Lombard GRID, but not yet on social robots. As social robots struggle to recognize speech in more challenging acoustic and crowded environments, we believe such models are promising tools for real-time interaction with users. This paper presents a feasibility study focusing on integration of speech recognition (SR) using mixed modalities - audio, visual (lip-reading) and audio-visual - in social robots. To this end, this paper contributes a pipeline to detect an active speaker based on lip movement, post-processing of audio and video footage and inferencing it with the state-of-the-art Auto-AVSR model. In a user study (N=26), we evaluated the feasibility of audio, visual and mixed modality speech recognition on a Pepper robot. We demonstrate the feasibility of using singular and mixed modalities with speech-to-text inference in natural interaction. The results show that it is feasible to deploy such models on social robots in a controlled, noiseless and non-interactive environment. Additionally, the results revealed that informing participants to emphasize their lip movements significantly improved text-to-speech inference results. Our work provides initial insights into the benefits and challenges of using VSR, ASR and AVSR for HRI.
Original language | English |
---|---|
Publication status | Published - 2024 |
Event | 33rd IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2024 - Pasadena Convention Center, Pasadena, United States Duration: 26 Aug 2024 → 30 Aug 2024 https://www.ro-man2024.org/ |
Conference
Conference | 33rd IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2024 |
---|---|
Country/Territory | United States |
City | Pasadena |
Period | 26/08/24 → 30/08/24 |
Internet address |
Keywords
- Applications of Social Robots
- Multimodal Interaction and Conversational Skills
- Detecting and Understanding Human Activity