Establishing Ground Truth in Human Behavior Research

19 June 2017

By Brandon Booth

Welcome to the very first SAIL lab blog post!

Establishing Ground Truth in Human Behavior Research

Research in human behavior is often confounded by "human factors," meaning gathering quality information from people is hard (NP hard?!).

Say we are interested in teaching a machine to predict how engaging a video game experience is to a human player at different moments in time. Regardless of which machine learning approach we use, we are going to need several examples of engaging and non-engaging game play. How do we measure this?

Questions like this are at the root of studies in human behavior because they cannot answered without help from people. At some point, we need actual humans to provide an assessment of the level of engagement. There are several ways to collect this information:

Each of these approaches introduces bias and warrants a whole separate post to discuss the types of annotation artifacts produced. Today we will focus just on the mechanics of these assessments. Namely, how do we obtain information about engagement, or any other mental construct, from people?

Assessment of Ratings versus Rankings

Here is an example observer assessment task (featuring you as the observer!): On a scale from 1 to 10 rate how goofy Einstein looks in each of these pictures.

Goofy pictures of Einstein

Easy? Not for me. The simple task leaves me wondering whether the first image looks more like laughter than being goofy, or more generally wondering which number I should pick once I decide the image is fairly goofy. Maybe you had an easier time with it, but are you confident you would choose the same answers tomorrow?

Now try something else: compare just the first two images on the left and select the goofier one.

Much easier, right? Two things happened when we changed the question. First, you only had to make a binary decision instead of picking one of ten different values, so there are fewer options and thus the choice is simpler. Second, you didn't have to decide how much more or less goofy one image was than the other, you only needed to report that one was goofier. This is just one simple example, but it brings us to an important point:

Ranking is easier than rating

This observation is far from new to psychologists but is finding its way into the affective computing and behavior signal processing research communities [1][2][3].

Why is this important? If we ask people to provide results by comparing things rather than assigning values, then we usually get more accurate results.

Triplet comparisons

So instead I ask you to compare images. I could ask that for each category and for each pair of images you should pick the one that most belongs in the corresponding group, but you still have the semantic ambiguity to wrestle with for every comparison you make.

A better approach is to remove the semantic ambiguity and ask you to handle it later. How? I can ask you to examine three images, one reference image and two candidate image, and tell me which of the two candidate images is most similar to the reference image.

Back to the Einstein example. Use the left image as a reference and tell me which of the middle and right images is most similar.

I haven't performed this experiment at scale to make any statistical claims, but I bet most people would say the right image is more similar: the smile is more the same, his tongue is not visible, and his hair is poofy instead of swept back.

What does this accomplish? Notice I didn't ask you to make any categorical judgments about goofiness or any of the others. You had to make a few pairwise comparisons, but didn't have to manage the ambiguity - a tradeoff. This type of similarity measurement (on tiplets of images) can be used to first group the images and then the categories can be assigned later to each group rather than each individual image. Not only does this make the assessment task easier for you, it allows the images to be grouped naturally without imposing a particular point of view (e.g. goofiness), and leads to better results.

Next up...

A exploration and discussion of the category of mathematical techniques used to sort objects (like images) given these comparisons on triplets: Ordinal Embedding.


  1. Angeliki Metallinou and Shrikanth Narayanan. Annotation and processing of continuous emotional attributes: challenges and opportunities. In 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 1–8. IEEE, 2013. URL:
  2. Georgios N Yannakakis and Héctor P Martínez. Ratings are overrated! Frontiers in ICT, 2:13, 2015. URL:
  3. Georgios N. Yannakakis and John Hallam. Ranking vs. preference: a comparative study of self-reporting. In Affective Computing and Intelligent Interaction: 4th International Conference, 437–446. Springer, 2011. URL: