research

The speech and language pipeline has multiple modules including audio pre-processing and feature extraction, voice activity detection (VAD), speaker diarization and role assignment, automatic speech recognition (ASR) and speech/language analysis by unimodal and multimodal approaches.

Audio Pre-Processing and Feature Extraction

The features exploited are filter banks such as MFCC, PLP and LPC, and prosodic features like pitch, intensity etc. Sometimes speech enhancement is used before feature extraction when the quality of audio data is not so good.

highlighted publications

Voice Activity Detection (VAD)

The VAD module marks regions of voice/speech from an audio stream. Several approaches for speech detection are employed in each research group including simple energy-based VAD, CNN-based VAD and DNN-based VAD pre-trained on RAT/Aspire and adapt to on in-domain fields. Different models are selected in each project due to its situation and constraints.

highlighted publications

Speaker Diarization and Role Assignment

Speaker diarization determines who spoke at what time from an audio stream challenging due to overlapping speech, rapid speaker changes and non-stationary noise. X-vector pre-trained diarization model and information bottleneck (IB) based speaker diarization are implemented in different speech pipelines. Role assignment after diarization determines most likely speaker from collection of enrolled speakers by comparing the entire text (ASR output) of the speaker with pre-built language models. Moreover, multimodal approaches of the audio and language embeddings are set for detecting the speaker change to enhance the module performance.

highlighted publications

Automatic Speech Recognition (ASR)

The (ASR) system produces N-best lists along with the decoding confidence scores. It aims to build a robust and accurate speech-to-text transformation for different scenarios. For example, in the CARE project the ASR models for child are DNN-HMM hybrid models based on time-delay neural network (TDNN) and self-attention trained using 2700 hours (clean speech: 700 hrs, noise and reverb augmented: 2000 hrs) of prompted and spontaneous child speech. While in AAA project the ASR model for the psychology counseling sessions is trained on a big collection of public datasets by Kaldi ASpIRE recipe.

highlighted publications

Speech and Language Analysis

SAIL pays high attention to understanding the human behavior through both speech and language. Besides the language processing of the text information, many works have been done by analyzing and modeling multimodal behavior signals from both text and speech prosody involving pitch, loudness, pause, word duration, voice quality and etc.

highlighted publications

threads

Audio Pre-Processing and Feature Extraction
Voice Activity Detection (VAD)
Speaker Diarization and Role Assignment
Automatic Speech Recognition (ASR)
Speech and Language Analysis

copyright notice

The documents distributed here have been provided as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.