Audio-Visual Speech Recognition for Human-Robot Interaction: A Feasibility Study

Sander Goetzee, Konstantin Mihhailov, Roel van de Laar, Kim Baraka, Koen Hindriks*

*Corresponding author for this work

Research output: Contribution to ConferencePaperAcademic

16 Downloads (Pure)

Abstract

Recent models for Visual Speech Recognition (VSR) have shown remarkable progress over the last few years. They have however been applied mainly to datasets such as Lip Reading Sentences 3 (LRS3), LRS2 or Lombard GRID, but not yet on social robots. As social robots struggle to recognize speech in more challenging acoustic and crowded environments, we believe such models are promising tools for real-time interaction with users. This paper presents a feasibility study focusing on integration of speech recognition (SR) using mixed modalities - audio, visual (lip-reading) and audio-visual - in social robots. To this end, this paper contributes a pipeline to detect an active speaker based on lip movement, post-processing of audio and video footage and inferencing it with the state-of-the-art Auto-AVSR model. In a user study (N=26), we evaluated the feasibility of audio, visual and mixed modality speech recognition on a Pepper robot. We demonstrate the feasibility of using singular and mixed modalities with speech-to-text inference in natural interaction. The results show that it is feasible to deploy such models on social robots in a controlled, noiseless and non-interactive environment. Additionally, the results revealed that informing participants to emphasize their lip movements significantly improved text-to-speech inference results. Our work provides initial insights into the benefits and challenges of using VSR, ASR and AVSR for HRI.
Original languageEnglish
Publication statusPublished - 2024
Event33rd IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2024 - Pasadena Convention Center, Pasadena, United States
Duration: 26 Aug 202430 Aug 2024
https://www.ro-man2024.org/

Conference

Conference33rd IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2024
Country/TerritoryUnited States
CityPasadena
Period26/08/2430/08/24
Internet address

Keywords

  • Applications of Social Robots
  • Multimodal Interaction and Conversational Skills
  • Detecting and Understanding Human Activity

Fingerprint

Dive into the research topics of 'Audio-Visual Speech Recognition for Human-Robot Interaction: A Feasibility Study'. Together they form a unique fingerprint.

Cite this