Knowledge Transfer and Boosting Approach for Affect Prediction in Movies

Table of Contents

  1. Introduction
  2. Dataset
  3. Features
  4. Methodology
    1. Baseline
    2. Knowledge Transfer
    3. Gradient Boosting
  5. Result ***

Introduction

We intend to predict continuous valence and arousal affect labels for movies in the LIRIS-ACCEDE dataset, released as part of MediaEval 2016. Due to the constraint of limited availability of in-domain data, we first learn models for a larger dataset of short video clips and adapt those to make predictions on the data of interest (Knowledge Transfer or KT). GB (Gradient Boosting) farther updates these predictions based on models learnt on in-domain data. We achieve a Concordance Correlation Coefficient (CCC) values of 0.13 and 0.27 for valence and affect prediction against a baseline of 0.12 and 0.11 respectively.

Dataset

The LIRIS-ACCEDE dataset consists of 30 movies of length varying from 3 to 28 minutes. A set of ten annotators rate each movie for the affective dimensions of arousal and valence at a frame rate of 1 sample per second, within a range of -1 and 1. The final ratings are obtained as frame-wise mean of annotations from each annotator.

Features

We use an assembly of audio-visual features, shown in Table 1 below. Due to different frame rates, we synchronise the features with the annotations by computing statistics along a temporal window that is shifted by 1 second, thereby obtaining a feature vector for each second. The statistics computed are - mean, median, standard deviation, kurtosis, lower quartile, upper quartile, minimum, maximum and range. We choose the length of the temporal windows by maximising the mutual information between the computed features and affective dimension labels of the training set.

| Source | Features | FPS | Toolbox | |——–|————————————|—–|———| | Visual | Luminance, Intensity, Optical Flow | 30 | OpenCV | | Audio | MFCC, Voicing Probability, Harmonic to Noise Ratio, Zero Crossing Rate, Crossing Rate, Fundamental Frequency, Log Energy, Chroma features (12 Semitones) | 100 | OpenSmile |

Table 1 Audio-Visual Features

Methodology

Baseline

We initially set up a baseline using a set of three regression schemes - linear, ridge and neural network, with and without feature selection. During feature selection, we remove features with absolute value of the correlation coefficient below a certain threshold (tuned on the development set). All experiments are performed using Leave One Out cross validation strategy, where we use 25 movies for training, 4 for validation and 1 for testing. The results are shown below in Table 2.

Knowledge Transfer

During KT, we train models on a different dataset called discrete LIRIS-ACCEDE, which contains 9800 short video clips with an average time-span of 10 seconds. We have a single valence and affect label for each video clip that ranges from 1 to 5. We train a regression model using the same features as in Table 1, computed for the entire duration of the video clip. The choice of regression is empirically determined to be a ridge regressor, and we use a 3 fold cross validation. In order to apply the KT model, we recompute the feature statistics for the movies, but over a different window length of 10 seconds to match the length of clips in the discrete dataset. We learn a linear scaling of the KT model predictions on the training set, which is applied on the test set for evaluation.

Gradient Boosting

We use the gradient boosting approach similar to the one proposed by Gupta et al. We minimise mean squared error using linear regression on the feature statistics after introducing a temporal delay. We summarise the results of the KT and GB models in Table 2

Results

Model Valence Arousal
Linear Regression 0.06 0.06
Linear Regression + FS 0.07 0.11
Ridge Regression 0.04 0.03
Ridge Regression + FS 0.04 0.10
Neural Networks 0.12 0.02
Knowledge Transfer 0.13 0.22
Gradient Boosting 0.13 0.27

Table 2 CCC values for baseline, KT and GB

From the results, we observe that the KT model on its own outperforms the baseline. Despite being mismatched, the larger dataset contains more representative data, thereby improving performance on the data of interest. Training the model further using GB incorporating KT model as the first base learner improves performance for arousal.