Robust multiview face clustering
Projects | | Links: code | paper
Self supervised mining of face images for robust face clustering in movies.
What is the need for robust face clustering in videos?
Robust face clustering is a key step towards computational understanding of visual character portrayals in media i.e., automatically extracting "who" appears "where" statistics. A first step toward developing such tools is the ability to automatically identify the characters in the visual modality i.e., the "who". This is a difficult task because characters are often portrayed with varying degrees of complexity with respect to appearance and design in different media formsWhat are the challenges in video content?
Face recognition and clustering in videos in the absence of domain-matched training data remains a challenging problem. In TV shows and movies, we need to robustly identify the characters in the presence of multiple "visual distractors" such as changes in appearance, background imagery, facial expression, size (resolution), view points (pose), illumination, partial detection (occlusion), and in some cases, age.Our idea to mine weakly labeled face tracks
We use self-supervised methods to harvest movie face tracks by exploiting temporally co-occurring faces in a video. Faces that belong to the same track are segmented into ‘tracklets’ that are used as positive samples. Faces from co-occurring tracks are used as hard-negatives.
Data details:
The table below summarizes the face tracks extracted per our method. We use this dataset for adapting pre-trained face embeddings for movies.
Data/statistic | Count |
---|---|
No. Hollywood movies | 240 |
Total no. tracks | 335,845 (~1390 face tracks per movie) |
Total no. co-occurring tracks | 169,201 (~726 face tracks per movie) |
Total no. faces | 10.2 million faces |
Faces per track (mean, std.) | 63.3, 9.1 |
Benchmarking dataset
We augmented current video datasets used for benchmarking with more racially diverse datasets.
A few exemplars from the Hidden Figures (2014) movie is shown below:
The summary of the complete benchmarking dataset we created is shown below:
movie_name | Mean faces/track | No. tracks | No. characters | cluster density range |
---|---|---|---|---|
Hidden Figures (2016) | 59.82 | 1407 | 24 | 0.57 - 20.11 |
About Last Night (2014) | 38.17 | 1565 | 10 | 1.02 - 31.37 |
Maleficent (2014) | 64.07 | 694 | 10 | 2.02 - 36.60 |
Dumb and Dumber To (2014) | 70.77 | 1457 | 10 | 0.82 - 38.23 |
Notting Hill (1999) | 79.15 | 1435 | 12 | 1.25 - 40.77 |
We showed that adapting pre-trained face embeddings using the weakly labeled faces improve over face verification and face clustering perfrmance in movies. Please see our preprint paper for more details on experiments and results.