Active speaker detection

Automatic visual detection of persons actively speaking on screen

Project description

When” and “where” are the fundamental pillars of Computational media intelligence, for developing a holistic understanding of a scene which translates to locate the action of interest in the space and time. In this project, we develop a system to automatically detect the active speakers in space and time, specifically for media content.


This work is inspired by our preliminary work [1], where we presented a cross-modal system for the task of voice activity detection by observing just the visual frames. We performed a thorough analysis and showed that the learned embeddings can locate the human bodies and faces. The cross-modal architecture is shown below:

Demonstration of the system output.

The heatmaps represented the intermediate output of the system signifying the salient regions in the visual frames pertaining to active speakers. The active speakers are shown in green bounding boxes and all other faces in blue.