Rajat Hebbar, Digbalay Bose, and Shrikanth Narayanan. SEAR: Semantically-Grounded Audio Representations. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 2785–2794, MM 23, Association for Computing Machinery, New York, NY, USA, 2023.

Download

[PDF] 

Abstract

Audio supports visual story-telling in movies through the use of different sounds. These sounds are often tied to different visual elements, including foreground entities, the interactions between them as well as background context. Visual captions provide a condensed view of an image, providing a natural language description of entities and the relationships between them. In this work, we utilize visual captions to semantically ground audio representations in a self-supervised setup. We leverage state-of-the-art vision-language models to augment movie datasets with visual captions at scale to the order of 9.6M captions to learn audio representations from over 2500 hours of movie data. We evaluate the utility of the learned representations and show state-of-the art performance on two movie understanding tasks, genre and speaking-style classification, outperforming video based methods and audio baselines. Finally, we show that the learned model can be transferred in a zero-shot manner through application in both movie understanding tasks and general action recognition.

BibTeX Entry

@inproceedings{HebbarACM-MM2023,
author = {Hebbar, Rajat and Bose, Digbalay and Narayanan, Shrikanth},
title = {SEAR: Semantically-Grounded Audio Representations},
year = {2023},
isbn = {9798400701085},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3581783.3612592},
doi = {10.1145/3581783.3612592},
abstract = {Audio supports visual story-telling in movies through the use of different sounds. These sounds are often tied to different visual elements, including foreground entities, the interactions between them as well as background context. Visual captions provide a condensed view of an image, providing a natural language description of entities and the relationships between them. In this work, we utilize visual captions to semantically ground audio representations in a self-supervised setup. We leverage state-of-the-art vision-language models to augment movie datasets with visual captions at scale to the order of 9.6M captions to learn audio representations from over 2500 hours of movie data. We evaluate the utility of the learned representations and show state-of-the art performance on two movie understanding tasks, genre and speaking-style classification, outperforming video based methods and audio baselines. Finally, we show that the learned model can be transferred in a zero-shot manner through application in both movie understanding tasks and general action recognition.},
booktitle = {Proceedings of the 31st ACM International Conference on Multimedia},
pages = {2785–2794},
numpages = {10},
keywords = {audio representation learning, cross-modal supervision, movie understanding, self-supervised learning},
location = {Ottawa ON, Canada},
link = {https://dl.acm.org/doi/pdf/10.1145/3581783.3612592},
series = {MM 23}
}

Generated by bib2html.pl (written by Patrick Riley ) on Fri Mar 22, 2024 09:15:39