Motivation
Speaker recognition is the process of automatically determining the personal identity by analysis of spoken phrases. Many applications have been considered for speaker recognition: secure access controlled by voice, customizing services by voice, surveillance, forensic investigations, banking transaction over a telephone line or the other types of network, voice mail, and so on. Speaker recognition technology is expected to make our daily lives more convenient. One of the speaker recognition applications is speaker indexing, the process of determining who is talking when. It is an integral element of speech data monitoring and content-based data mining applications. Consider, for example, applications such as meeting/teleconference monitoring, archiving and browsing. A key motivation arises from the fact that it is impossible or tedious to attend all relevant meetings face to face. Multimedia meeting or teleconference monitors and browsers can be useful for getting meeting information, such as who is saying what and when, remotely through on-line or off-line systems. Specially, these applications commonly include a speaker indexing process that is segmentation and classification of data with respect to speakers.
General Diagram of the Unsupervised Speaker Indexing with Generic Models

The first step is front-end analysis where the incoming audio samples are classified into foreground speech and other background audio (noise) types. Generally audio data can be categorized into four broad classes: speech, music, environmental noise, and silence. In speaker indexing, we only need speech/non-speech discrimination. When there is background noise or music, it is likely to be overlapped with speech. Corrupted speech is not easily discriminated from noise. Since it is critical that we should not lose any speech data, the focus of the classification is to minimize false rejection, perhaps, even at the cost of false acceptance. Usually, for speech/non-speech discrimination, a zero-crossing rate and short-time energy are used. Speech signal usually has a higher silence ratio and level of variation in zero-crossing rates than music. From the result of audio segmentation, we collect all segment classified as speech: clean or corrupted by music or other background noise. Only the speech data are used for the next step, speaker change detection. In this step, the system sequentially detects whether a speaker change occurs in the middle of a speech analysis frame, without assuming any specific knowledge about speakers. Once the speaker change detection determines the boundary, all the data between the speaker change points are used for speaker clustering. In the clustering step, we use speaker models from a predetermined generic model set. After clustering, the speaker independent generic model is adapted into an appropriate speaker dependent model during the indexing process. The adapted model is replaced with the original model before adaptation or inserted back in the generic model set. When new audio samples after the boundary of the current speaker come into the system, the previous steps are repeated until all data are exhausted.

Example of Unsupervised Speaker Indexing (Broadcast News)

Journals
Conference Proceedings/Presentations