The speech and language pipeline has multiple modules including audio pre-processing and feature extraction, voice activity detection (VAD), speaker diarization and role assignment, automatic speech recognition (ASR) and speech/language analysis by unimodal and multimodal approaches.

Audio Pre-Processing and Feature Extraction

The features exploited are filter banks such as MFCC, PLP and LPC, and prosodic features like pitch, intensity etc. Sometimes speech enhancement is used before feature extraction when the quality of audio data is not so good.

highlighted publications
  • Tiantian Feng, Amrutha Nadarajan, Colin Vaz, Brandon M. Booth, Shrikanth S. Narayanan, "TILES Audio Recorder: An unobtrusive wearable solution to track audio activity", WearSys '18: 4th ACM Workshop on Wearable Systems and Applications, Munich, Germany, 2018.

Voice Activity Detection (VAD)

The VAD module marks regions of voice/speech from an audio stream. Several approaches for speech detection are employed in each research group including simple energy-based VAD, CNN-based VAD and DNN-based VAD pre-trained on RAT/Aspire and adapt to on in-domain fields. Different models are selected in each project due to its situation and constraints.

highlighted publications
  • Amrutha Nadarajan, Krishna Somandepalli, Shrikanth Narayanan, "Speaker Agnostic Foreground Speech Detection From Audio Recordings in Workplace Settings From Wearable Recorders", Proceedings of ICASSP, 2019.
  • Rajat Hebbar, Krishna Somandepalli, Shrikanth Narayanan, "Robust Speech Activity Detection in Movie Audio: Data Resources and Experimental Evaluation", Proceedings of ICASSP, 2019.

Speaker Diarization and Role Assignment

Speaker diarization determines who spoke at what time from an audio stream challenging due to overlapping speech, rapid speaker changes and non-stationary noise. X-vector pre-trained diarization model and information bottleneck (IB) based speaker diarization are implemented in different speech pipelines. Role assignment after diarization determines most likely speaker from collection of enrolled speakers by comparing the entire text (ASR output) of the speaker with pre-built language models. Moreover, multimodal approaches of the audio and language embeddings are set for detecting the speaker change to enhance the module performance.

highlighted publications
  • Nikolaos Flemotomos, Panayiotis Georgiou, Shrikanth Narayanan, "Linguistically Aided Speaker Diarization Using Speaker Role Information$\$$\$", Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 117-124, 2020.
  • Nikolaos Flemotomos, Panayiotis Georgiou, David Atkins, Shrikanth Narayanan, "Role Specific Lattice Rescoring for Speaker Role Recognition From Speech Recognition Outputs", Proceedings of ICASSP, 2019.
  • Nikolaos Flemotomos, Zhuohao Chen, David Atkins, Shrikanth Narayanan, "Role Annotated Speech Recognition for Conversational Interactions", In proceedings of IEEE Workshop on Spoken Language Technology, 2018.
  • Nikolaos Flemotomos, Pavlos Papadopoulos, James Gibson, Shrikanth S. Narayanan, "Combined Speaker Clustering and Role Recognition in Conversational Speech", Proceedings of InterSpeech, Hyderabad, India, 2018.
  • Manoj Kumar, Pavlos Pavlos Papadopoulos, Ruchir Travadi, Daniel Bone, Shrikanth Narayanan, "IMPROVING SEMI-SUPERVISED CLASSIFICATION FOR LOW-RESOURCE SPEECH INTERACTION APPLICATIONS", Proceedings of ICASSP, 2018.

Automatic Speech Recognition (ASR)

The (ASR) system produces N-best lists along with the decoding confidence scores. It aims to build a robust and accurate speech-to-text transformation for different scenarios. For example, in the CARE project the ASR models for child are DNN-HMM hybrid models based on time-delay neural network (TDNN) and self-attention trained using 2700 hours (clean speech: 700 hrs, noise and reverb augmented: 2000 hrs) of prompted and spontaneous child speech. While in AAA project the ASR model for the psychology counseling sessions is trained on a big collection of public datasets by Kaldi ASpIRE recipe.

highlighted publications
  • Manoj Kumar, Daniel Bone, Kelly McWilliams, Shanna Williams, Thomas Lyon, Shrikanth Narayanan, "Multi-scale Context Adaptation for Improving Child Automatic Speech Recognition in Child-Adult Spoken Interactions", Proceedings of Interspeech, 2017.

Speech and Language Analysis

SAIL pays high attention to understanding the human behavior through both speech and language. Besides the language processing of the text information, many works have been done by analyzing and modeling multimodal behavior signals from both text and speech prosody involving pitch, loudness, pause, word duration, voice quality and etc.

highlighted publications
  • Karan Singla, Zhuohao Chen, David Atkins, Shrikanth Narayanan, "Towards end-2-end learning for predicting behavior codes from spoken utterances in psychotherapy conversations", Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3797-3803, 2020.
  • Zhuohao Chen, Karan Singla, James Gibson, Dogan Can, Zac Imel, David Atkins, Panayiotis Georgiou, Shrikanth Narayanan, "Prediction of Therapist Behaviors in Addiction Counseling by Exploiting Class Confusions", Proceedings of ICASSP, 2019.
  • Victor Martinez, Krishna Somandepalli, Karan Singla, Anil Ramakrishna, Yalda Uhls, Shrikanth Narayanan, "Violence Rating Prediction from Movie Scripts", In proceedings of Proceedings of Thirty-Third AAAI Conference on Artificial Intelligence, 2019.
  • Karan Singla, Zhuohao Chen, Nikolaos Flemotomos, James Gibson, Dogan Can, David Atkins, Shrikanth S. Narayanan, "Using Prosodic and Lexical Information for Learning Utterance-level Behaviors in Psychotherapy", Proceedings of InterSpeech, Hyderabad, India, 2018.
  • Nikolaos Flemotomos, Victor Martinez, James Gibson, David Atkins, Torrey Creed, Shrikanth S. Narayanan, "Language Features for Automated Evaluation of Cognitive Behavior Psychotherapy Sessions", Proceedings of InterSpeech, Hyderabad, India, 2018.
  • Rajat Hebbar, Krishna Somandepalli, Shrikanth S. Narayanan, "Improving Gender Identification in Movie Audio using Cross-Domain Data", Proceedings of InterSpeech, Hyderabad, India, 2018.
  • Karan Singla, Dogan Can, Shrikanth S. Narayanan, "A Multi-task Approach to Learning Multilingual Representations", Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 2018.
  • Che-Wei Huang, Shrikanth Narayanan, "Characterizing Types of Convolution in Deep Convolutional Recurrent Neural Networks for Robust Speech Emotion Recognition", IEEE Transactions on Affective Computing, 2018.
  • James Gibson, Dogan Can, Panayiotis Georgiou, David Atkins, Shrikanth Narayanan, "Attention Networks for Modeling Behavior in Addiction Counseling", Proceedings of Interspeech, 2017.
  • Anil Ramakrishna, Victor Martinez, Nikolaos Malandrakis, Karan Singla, Shrikanth Narayanan, "Linguistic analysis of differences in portrayal of movie characters", InProceedings of Association for Computational Linguistics, 2017.
  • Md Nasir, Naveen Kumar, Panayiotis Georgiou, Shrikanth S. Narayanan, "Robust Multichannel Gender Classification from Speech in Movie Audio", Proceedings of Interspeech, 2016.
  • Naveen Kumar, Tanaya Guha, Che Wei Huang, Colin Vaz, Shrikanth S. Narayanan, "Novel Affective Features for Multiscale Prediction of Emotion in Music", Proceedings of 2016 IEEE Workshop on Multimedia Signal Processing, 2016.
  • Anil Ramakrishna, Nikolaos Malandrakis, Elizabeth Staruk, Shrikanth S. Narayanan, "A quantitative analysis of gender differences In movies using psycholinguistic normatives", Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, pp. 1996-2001, 2015.