Gender Identification System for Movie Audio

Robust gender identification (GI) system developed to enable speaking-time estimation from movie audio. GI from audio is usually a two-step process: First, Speech Activity Detection (SAD) is performed to detect speech segments. Then, GI on the speech segments gives the detected gender labels. The proposed system implements parallel SAD and GI systems. Both tasks are challenging due to the variety of acoustic scenes and background noises present in movies (you can listen to a sample here). Hence, both the systems are trained specifically using audio acquired in such diverse acoustic conditions.

System Schematic

schematic
Description of the components of the audio gender identification system.

SAD Pipeline

Training Data and Features

95 movies which belong to the top 100 movies of 2014 are used for SAD training. We selected 13 movies from diverse genres, which served as our validation set, while the remaining movies were used for training.

The data for training was created by extracting speech and non-speech segments from the movie using subtitle information. For extracting subtitles for movies, we used VobSub2SRT. We then used a gentle forced-aligner called Gentle, to align words from movie-audio with text transcript obtained from the subtitle file. These word-segments act as our speech data. We used the region between any two subtitle segments as non-speech data. 23 log-mel filterbank energies are used as input features to the neural-network.

System architecture

A BLSTM network has been proposed as our SAD network. The network architecture has been show: INP[(23,16)] → BLSTM[150] → FC[256] → FC[128] → FC[64] → σ(FC[2])

GI Pipeline

Training Data

Features

System architecture

Results

table1