Subtitle Aligned Movie (SAM) Corpus

A large-scale dataset for speech activity detection in movies

Project Description

SAM Corpus consists of speech segments from over 500 top-grossing hollywood movies spanning 5 years (2014-19). Easily scalable techniques are employed to align movie-audio to subtitles, which are then used to develop Speech Activity Detection models

Gentle alignment

Speech boundaries in movie audio are demarcated using an automatic alignment tool called gentle. Gentle aligns a text transcript to corresponding audio at phoneme-level, resulting in accurate word-boundaries for successfully aligned words. We use subtitles extracted from OpenSubtitles as text-input to gentle. Around 70% of all words in the transcript are successfully aligned using this method, resulting in over 650K audio-words.

Inter-Pausal Units (IPUs)

IPUs are short segments of contiguous speech separated by pauses. An IPU is characterized by two threshold values: pause duration (PT), and segment length (ST). After gentle-alignment, consecutive aligned words are grouped together into IPUs which are atleast ST seconds long. Two consecutive words belong to the same IPU if they are no more than PT seconds apart. Construction of IPUs in such a fashion enables “segment” level processing. This is not only computationally efficient, but also captures context which is useful for several downstream tasks.

Dataset stats

500+

Movies

Data from over 500 top-grossing hollywood movies

225

Hours

225 hours of gentle-aligned speech

275K+

IPUs

Over 275K IPUs of duration 1.25s or longer

Network Architecture


CNN architectures for Speech Activity Detection [1]

Visualization of CNN model

On the right we see the visualization of the activation maps from the CNN-GAP model architecture. For speech segments, the model attends to lower frequency regions corresponding to the first few harmonics. On the other hand, for noise regions, the model attends to higher frequencies which correspond to non-speech sounds.

Results

Summary

Automatic, easily scalable methods are used to extract reliable speech data from large number of movies for training SAD models. Using novel CNN architectures, we show state-of-the-art performance on two benchmark movie datasets

Publications

[1]. Hebbar, R., Somandepalli, K., & Narayanan, S. “Robust speech activity detection in movie audio: Data resources and experimental evaluation” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.