MovieCLIP: Visual Scene Recognition in Movies

Digbalay Bose¹

Rajat Hebbar¹

Krishna Somandepalli²

Haoyang Zhang¹

Yin Cui²

Kree-Cole McLaughlin²

Huisheng Wang²

Shrikanth Narayanan¹

¹Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California

²Google

WACV 2023

[Paper]

[Supplementary]

[GitHub]

Overview diagram highlighting the challenges associated with visual scene recognition in movies (a) Domain mismatch between Natural scene images vs frames from Movies for living room (b) Movie centric visual scene classes like prison, control room etc that are absent from existing taxonomies (c) Change in visual scene between shots in the same movie clip.

Abstract

Longform media such as movies have complex narrative structures, with events spanning a rich variety of ambient visual scenes. Domain specific challenges associated with visual scenes in movies include transitions, person coverage, and a wide array of real-life and fictional scenarios. Existing visual scene datasets in movies have limited taxonomies and don't consider the visual scene transition within movie clips. In this work, we address the problem of visual scene recognition in movies by first automatically curating a new and extensive movie-centric taxonomy of 179 scene labels derived from movie scripts and auxiliary web-based video datasets. Instead of manual annotations which can be expensive, we use CLIP to weakly label 1.12 million shots from 32K movie clips based on our proposed taxonomy. We provide baseline visual models trained on the weakly labeled dataset called MovieCLIP and evaluate them on an independent dataset verified by human raters. We show that leveraging features from models pretrained on MovieCLIP benefits downstream tasks such as multi-label scene and genre classification of web videos and movie trailers.

Visual scene taxonomy

The tsne plot of 179 scene classes is shown in the above figure after projecting 384 to 2 dimensional representations . Groups of visual scene classes that are semantically close together are marked by circles. Examples include scene classes associated with

study like classroom, school and kindergarten (Dark Brown circle)
Natural landforms like desert, mountain and valley (Brown circle)
Water bodies like lake, river, pond, pool, lake, waterfall (Dark blue circle)

The share of different sources i.e. Movie Sluglines, HVU, Common labels and Human expert is shown in the following figure:

MovieCLIP dataset

Shot statistics

Shots were extracted using PySceneDetect from the movie clips available in Condensed Movies [1] dataset. The shot statistics are listed as follows:

CLIP based visual scene labeling

Overview of the CLIP based visual scene labeling pipeline is shown in the above figure. For tagging individual frames using CLIP, we use a background specific prompt: A photo of a {label}, a type of background location", where label is the visual scene class from our proposed taxonomy. If a shot contains T frames, we utilize CLIP's visual encoder to extract frame-wise visual embeddings $v_{t}(t=1,...,T)$. For each of the individual scene labels in our taxonomy, we utilize CLIP's text encoder to extract embeddings $e_{l} (l=1,2,...,L)$ for the background specific prompts We use the label-wise (prompt-specific) text and frame-wise visual embeddings to obtain a similarity score matrix $S$, whose entries $S_{lt}$ are computed as follows: \begin{equation}\label{simmatrixscore} S_{lt}=\frac{e_{l}^{T}v_{t}}{\lVert e_{l} \rVert_{2} \lVert v_{t} \rVert_{2}} \end{equation} We compute an aggregate shot specific score for individual scene labels by temporal average pooling over the similarity matrix $S_{lt}$, since the visual content within a shot remains fairly unchanged. The computation of shot specific score called $\texttt{CLIPSceneScore}_{\texttt{l}}$ for $l$th visual scene label is shown as follows: \begin{equation} \texttt{CLIPSceneScore}_{\texttt{l}}=\frac{\sum_{t=0}^{T}(S_{lt})}{T} \label{score pooling} \end{equation}

Prompt design choices

Examples of various prompt templates and associated CLIPSceneScores for top-k visual scene labels. Here k=5

Based on the above templates, the prompts associated with generic background information tend to perform better in terms of associating visual scene labels to movie shots with higher confidence, when compared with people-centric prompts. Further inclusion of the contextual phrase type of background location tends to perform better than a type of location in associating top-1 visual scene labels with higher $CLIPSceneScore$ values. When people-centric prompts are used, the CLIP based labeling scheme can result in incorrect associations like interrogation room in Figure (a) and cockpit in Figure (c)

Analysis of CLIP Labeling

Sample frames from the movie shots labeled by CLIP with high confidence ($\texttt{CLIPSceneScore} \geq 0.6$ and Labels shown in yellow)

Sample frames from the movie shots labeled by CLIP with low confidence and Labels shown in yellow)

Above figure showcases diversity within various visual scene classes in the curated taxonomy. 15 sample classes from our scene taxonomy shown.

Dataset Access

Please refer to the DATASET Prepraration section for details on how to access the dataset. Please contact directly dbose@usc.edu for any questions.

References

[1] Condensed Movies: Story Based Retrieval with Contextual Embeddings ; M.Bain, A.Nagrani, A.Brown, A.Zisserman, ACCV 2020

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.