Why video ads?Video advertisements (ads) or TV commercials have become an indispensable tool for marketing. Media advertising spending in the United States for the year 2017 was about 206 Billion USD. Companies not only invest heavily in advertising, but several com-panies generate revenue from ads. For example, the ad revenue for Alphabet, Inc., rose from about 43 to 95 Billon USD from 2012–2017 Considering the sheer number of ads being produced, it has become crucial to develop tools for a scalable and automatic analysis of ads. Automatic analysis of advertisements (ads) poses an interesting problem for learning multimodal representations. A promising direction of research is the development of deep neural network autoencoders to obtain inter-modal and intra-modal representations.
Why crossmodal learning?With lots of unlabeled data, we can exploit the co-occurring video and audio streams with their corresponding feature spaces to learn the audio-to-video and video-to-audio embeddings to capture the crossmodal relationships.
Proposed Method: Crossmodal Autoencoders
In this work, we propose a system to obtain segment-level unimodal and joint representations. These features are concatenated, and then averaged across the duration of an ad to obtain a single multimodal representation. The autoencoders are trained using segments generated by time-aligning frames between the audio and video modalities with forward and backward context.
In order to assess the multimodal representations, we consider the tasks of classifying attributes of an ad such as funny, exciting, sentiment and vertical (topic) in a publicly available dataset of 2,720 ads.
For this purpose we train the segment-level autoencoders on a larger, unlabeled dataset of 9,740 ads, agnostic of the test set.
As shown above, multimodal AE features manage to capture the complementary information from audio and video modalities to be powerful for content analytic tasks such as the ones described above.
To understand why our embeddings learnt in an unsupervised fashion perform well, we examined what the individual and audio-to-video and video-to-audio representations capture, so we examined the correlation of the segment level features across different systems. The correlation pattern for one video ad is shown below: