research

The speech and language pipeline has multiple modules including audio pre-processing and feature extraction, voice activity detection (VAD), speaker diarization and role assignment, automatic speech recognition (ASR) and speech/language analysis by unimodal and multimodal approaches.



Audio Pre-Processing and Feature Extraction

The features exploited are filter banks such as MFCC, PLP and LPC, and prosodic features like pitch, intensity etc. Sometimes speech enhancement is used before feature extraction when the quality of audio data is not so good.

highlighted publications


Voice Activity Detection (VAD)

The VAD module marks regions of voice/speech from an audio stream. Several approaches for speech detection are employed in each research group including simple energy-based VAD, CNN-based VAD and DNN-based VAD pre-trained on RAT/Aspire and adapt to on in-domain fields. Different models are selected in each project due to its situation and constraints.

highlighted publications


Speaker Diarization and Role Assignment

Speaker diarization determines who spoke at what time from an audio stream challenging due to overlapping speech, rapid speaker changes and non-stationary noise. X-vector pre-trained diarization model and information bottleneck (IB) based speaker diarization are implemented in different speech pipelines. Role assignment after diarization determines most likely speaker from collection of enrolled speakers by comparing the entire text (ASR output) of the speaker with pre-built language models. Moreover, multimodal approaches of the audio and language embeddings are set for detecting the speaker change to enhance the module performance.

highlighted publications


Automatic Speech Recognition (ASR)

The (ASR) system produces N-best lists along with the decoding confidence scores. It aims to build a robust and accurate speech-to-text transformation for different scenarios. For example, in the CARE project the ASR models for child are DNN-HMM hybrid models based on time-delay neural network (TDNN) and self-attention trained using 2700 hours (clean speech: 700 hrs, noise and reverb augmented: 2000 hrs) of prompted and spontaneous child speech. While in AAA project the ASR model for the psychology counseling sessions is trained on a big collection of public datasets by Kaldi ASpIRE recipe.

highlighted publications


Speech and Language Analysis

SAIL pays high attention to understanding the human behavior through both speech and language. Besides the language processing of the text information, many works have been done by analyzing and modeling multimodal behavior signals from both text and speech prosody involving pitch, loudness, pause, word duration, voice quality and etc.

highlighted publications