Active speaker detection

Project description

“When” and “where” are the fundamental pillars of Computational media intelligence, for developing a holistic understanding of a scene which translates to locate the action of interest in the space and time. In this project, we develop a system to automatically detect the active speakers in space and time, specifically for media content.

Methodology

This work is inspired by our preliminary work [1], where we presented a cross-modal system for the task of voice activity detection by observing just the visual frames. We performed a thorough analysis and showed that the learned embeddings can locate the human bodies and faces. The cross-modal architecture is shown below:

Demonstration of the system output.

The heatmaps represented the intermediate output of the system signifying the salient regions in the visual frames pertaining to active speakers. The active speakers are shown in green bounding boxes and all other faces in blue.

References

R. Sharma, K. Somandepalli and S. Narayanan, “Toward Visual Voice Activity Detection for Unconstrained Videos,” 2019 IEEE International Conference on Image Processing (ICIP), 2019, pp. 2991-2995, doi: 10.1109/ICIP.2019.8803248.
Sharma, Rahul, Krishna Somandepalli, and Shrikanth Narayanan. “Crossmodal learning for audio-visual speech event localization.” arXiv preprint arXiv:2003.04358 (2020)

Automatic visual detection of persons actively speaking on screen

Project description

Methodology

Demonstration of the system output.

References