Digbalay Bose, Rajat Hebbar, Tiantian Feng, Krishna Somandepalli, Anfeng Xu, and Shrikanth Narayanan. MM-AU:Towards Multimodal Understanding of Advertisement Videos. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 86–95, MM 23, Association for Computing Machinery, New York, NY, USA, 2023.

Download

[PDF] 

Abstract

Advertisement videos (ads) play an integral part in the domain of Internet e-commerce, as they amplify the reach of particular products to a broad audience or can serve as a medium to raise awareness about specific issues through concise narrative structures. The narrative structures of advertisements involve several elements like reasoning about the broad content (topic and the underlying message) and examining fine-grained details involving the transition of perceived tone due to the sequence of events and interaction among characters. In this work, to facilitate the understanding of advertisements along the three dimensions of topic categorization, perceived tone transition, and social message detection, we introduce a multimodal multilingual benchmark called MM-AU comprised of 8.4 K videos (147hrs) curated from multiple web-based sources. We explore multiple zero-shot reasoning baselines through the application of large language models on the ads transcripts. Further, we demonstrate that leveraging signals from multiple modalities, including audio, video, and text, in multimodal transformer-based supervised models leads to improved performance compared to unimodal approaches.

BibTeX Entry

@inproceedings{BoseACM-MM2023,
author = {Bose, Digbalay and Hebbar, Rajat and Feng, Tiantian and Somandepalli, Krishna and Xu, Anfeng and Narayanan, Shrikanth},
title = {MM-AU:Towards Multimodal Understanding of Advertisement Videos},
year = {2023},
isbn = {9798400701085},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3581783.3612371},
doi = {10.1145/3581783.3612371},
abstract = {Advertisement videos (ads) play an integral part in the domain of Internet e-commerce, as they amplify the reach of particular products to a broad audience or can serve as a medium to raise awareness about specific issues through concise narrative structures. The narrative structures of advertisements involve several elements like reasoning about the broad content (topic and the underlying message) and examining fine-grained details involving the transition of perceived tone due to the sequence of events and interaction among characters. In this work, to facilitate the understanding of advertisements along the three dimensions of topic categorization, perceived tone transition, and social message detection, we introduce a multimodal multilingual benchmark called MM-AU comprised of 8.4 K videos (147hrs) curated from multiple web-based sources. We explore multiple zero-shot reasoning baselines through the application of large language models on the ads transcripts. Further, we demonstrate that leveraging signals from multiple modalities, including audio, video, and text, in multimodal transformer-based supervised models leads to improved performance compared to unimodal approaches.},
booktitle = {Proceedings of the 31st ACM International Conference on Multimedia},
pages = {86–95},
numpages = {10},
keywords = {advertisements, multimodal learning, media understanding},
location = {Ottawa ON, Canada},
link = {https://dl.acm.org/doi/pdf/10.1145/3581783.3612371},
series = {MM 23}
}

Generated by bib2html.pl (written by Patrick Riley ) on Fri Mar 22, 2024 09:15:39