Multimodal representation using segment-level autoencoders

Table of Contents

Introduction

Video advertisements (ads) or TV commercials have become an indispensable tool for marketing. Media advertising spending in the United States for the year 2017 was about 206 Billion USD.Companies not only invest heavily in advertising, but several com-panies generate revenue from ads. For example, the ad revenue for Alphabet, Inc., rose from about 43 to 95 Billon USD from 2012–2017.

Considering the sheer number of ads being produced, it has become crucial to develop tools for a scalable and automatic analysis of ads. Automatic analysis of advertisements (ads) poses an interesting problem for learning multimodal representations. A promising direction of research is the development of deep neural network autoencoders to obtain inter-modal and intra-modal representations.

Proposed Model

In this work, we propose a system to obtain segment-level unimodal and joint representations. These features are concatenated, and then averaged across the duration of an ad to obtain a single multimodal representation. The autoencoders are trained using segments generated by time-aligning frames between the audio and video modalities with forward and backward context.

drawing

In order to assess the multimodal representations, we consider the tasks of classifying an ad as funny or exciting in a publicly available dataset of 2,720 ads. For this purpose we train the segment-level autoencoders on a larger, unlabeled dataset of 9,740 ads, agnostic of the test set.

Our experiments show that: 1) the multimodal representations outperform joint and unimodal representations, 2) the different representations we learn are complementary to each other, and 3) the segment-level multimodal representations perform better than classical autoencoders and cross-modal representations – within the context of the two classification tasks. We obtain an improvement of about 5\% in classification accuracy compared to a competitive baseline.