Tiantian Feng and Shrikanth Narayanan. Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, and Augmenting. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 12116–12120, , April 2024.

Download

[PDF] 

Abstract

Significant advances are being made in speech emotion recognition (SER) using deep learning models. Nonetheless, training SER systems remains challenging, requiring both time and costly resources. Like many other machine learning tasks, acquiring datasets for SER requires substantial data annotation efforts, including transcription and labeling. These annotation processes present challenges when attempting to scale up conventional SER systems. Recent developments in foundational models have had a tremendous impact, giving rise to applications such as ChatGPT. These models have enhanced human-computer interactions including bringing unique possibilities for streamlining data collection in fields like SER. In this research, we explore the use of foundational models to assist in automating SER from transcription and annotation to augmentation. Our study demonstrates that these models can generate transcriptions to enhance the performance of SER systems that rely solely on speech data. Furthermore, we note that annotating emotions from transcribed speech remains a challenging task. However, combining outputs from multiple LLMs enhances the quality of annotations. Lastly, our findings suggest the feasibility of augmenting existing speech emotion datasets by annotating unlabeled speech samples.

BibTeX Entry

@INPROCEEDINGS{10448130,
  author={Feng, Tiantian and Narayanan, Shrikanth},
  booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, and Augmenting},
  year={2024},
  volume={},
  number={},
  pages={12116-12120},
  abstract={Significant advances are being made in speech emotion recognition (SER) using deep learning models. Nonetheless, training SER systems remains challenging, requiring both time and costly resources. Like many other machine learning tasks, acquiring datasets for SER requires substantial data annotation efforts, including transcription and labeling. These annotation processes present challenges when attempting to scale up conventional SER systems. Recent developments in foundational models have had a tremendous impact, giving rise to applications such as ChatGPT. These models have enhanced human-computer interactions including bringing unique possibilities for streamlining data collection in fields like SER. In this research, we explore the use of foundational models to assist in automating SER from transcription and annotation to augmentation. Our study demonstrates that these models can generate transcriptions to enhance the performance of SER systems that rely solely on speech data. Furthermore, we note that annotating emotions from transcribed speech remains a challenging task. However, combining outputs from multiple LLMs enhances the quality of annotations. Lastly, our findings suggest the feasibility of augmenting existing speech emotion datasets by annotating unlabeled speech samples.},
  keywords={Training;Emotion recognition;Annotations;Speech recognition;Speech enhancement;Signal processing;Data models;Speech;Emotion recognition;Foundation model;Large Language Model},
  doi={10.1109/ICASSP48485.2024.10448130},
  ISSN={2379-190X},
  link = {https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10448130},
  month={April},}

Generated by bib2html.pl (written by Patrick Riley ) on Fri Mar 22, 2024 09:15:39