Rahul Sharma and Shrkanth Narayanan. Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection. IEEE Open Journal of Signal Processing, ():1–9, 2023.

Download

[PDF] 

Abstract

Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets–the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows)–we show that a simple late fusion of the two approaches enhances the active speaker detection performance.

BibTeX Entry

@ARTICLE{10102534,
  author={Sharma, Rahul and Narayanan, Shrkanth},
  journal={IEEE Open Journal of Signal Processing},
  title={Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection},
  year={2023},
  volume={},
  number={},
  pages={1-9},
  abstract={Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets–the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows)–we show that a simple late fusion of the two approaches enhances the active speaker detection performance.},
  keywords={},
 doi={10.1109/OJSP.2023.3267269},
  link = {http://sail.usc.edu/publications/files/Sharma-OJSP-2023.pdf},
  ISSN={2644-1322},
  month={},}

Generated by bib2html.pl (written by Patrick Riley ) on Fri Mar 22, 2024 09:15:39