We present a framework for estimating race from faces in movie data. Please refer to a detailed survey “Learning race from face: A survey”, Fu S. et. al., 2014 for a discussion on studies on recognizing race from faces. The definitions and taxonomy of race from a computer vision perspective are borrowed from this survey. The list of sources we used can be found here.

Table of contents

  1. Data preparation
  2. Sample size
  3. Model architecture
  4. Performance evaluation
    i. Confusion matrix and accuracy
    ii. ROC curves
  5. What is the network learning?

Data and model architecture

We compiled face data across opensource databases which have clear definitions of labeling races and are consistent with our nomenclature. Additionally, we annotated race for IMDb identities as described in Ramakrishna et. al., 2017, ACL. Sources which required institutional EULA agreements were excluded due to the processing delays involved.

Data preparation

  • All face images were resized to 100x100 and were converted to grayscale. Since a good number of images in our database were grayscale images, we converted all images to grayscale images
  • Skin color has been shown to be one of the least important factors in race prediction (See survey)
  • All images were aligned to achieve in-plane rotation to align landmarks of eyes and nose
  • Left-right flipping of the images was performed for data augmentation
  • Test images for evaluation were chosen to have no overlap of the person’s identity with the training data as well as to be variable with respect to pose, illumination, background and occlusions
  • Due to lack of data from the nativeamerica/pacificislander class, we only considered the other five classes in our prediction model
  • Summary of the face database
race num_faces num_images
eastasian 14472 28133
caucasian 267478 529741
latino/hispanic 10912 21374
asian-indian 28132 55473
african/african-american 22646 44627
nativeamerican/pacificislander 580 1069
TOTALs 344220 680417

Average faces for each race

Model architecture

A simple modified VGG-like architecture was adopted for a 5-class race classification model. The architecture is as shown below.
racenet

Performance Evaluation

  • Class-wise system performance is tabulated below. Overall system performance accuracy was 82.6% for a training setup of mini-batch size 32, trained for 50 epochs.
Race african asianindian caucasian eastasian latino
accuracy (%) 83.2 85.9 93.2 87.4 65.6

Confusion matrix

  • Note that the model is most confused between latino and caucasian faces which is consistent with previous work as well the structural similarity of the faces between these two races.
    conf mat
  • Receiver operating characteristics (ROC) curves for race prediction
    roc

What is the network learning

We were inspired by visualizing filter activations in a CNN as described here and extended the concept to identify the input (or distorted input) that minimizes the output loss for each class. In this, we initialize the input with an average face and minimize the categorical cross entropy loss for each class and visualize the changes to the input that achieves this minimization. The resulting modified inputs are shown below.
Not surprisingly, the modified inputs resemble the per-class average faces and are quite distinct. The results were similar when the input was initialized by average face from a specific class, rather than across all images.

under the hood