Establishing Ground Truth in Human Behavior Research

19 June 2017

By Brandon Booth

Welcome to the very first SAIL lab blog post!

Establishing Ground Truth in Human Behavior Research

Research in human behavior is often confounded by "human factors." This is a fancy way of saying that getting quality information out of people is hard (NP hard?!). Let's say we are interested in teaching a machine to predict how engaging a video game is at different moments in time. Regardless of which machine learning approach we use, we are going to need several examples of engaging and non-engaging game play. How do we measure this?

Questions like this are at the root of studies in human behavior because they cannot answered without help from people. At some point, we need actual humans to provide an assessment of the level of engagement. There are several ways to collect this information:

Each of these approaches introduces bias and warrants a whole separate post to discuss the types of annotation artifacts produced. Today we will focus just on the mechanics of these assessments. Namely, how do we obtain information about engagement (or any other mental construct) from people?

Assessment of Ratings versus Rankings

One type of assessment could ask the subject/observer/annotator to rate the constructs on a scale. Let's take a fun example different from our video game player scenario:

Goofy pictures of Einstein

On a scale from 1 to 10 rate how goofy Einstein looks in each of these pictures. Easy? Not for me! Doesn't the first one look more like laughter than being goofy? How do you rate this on a goofiness scale? The point is, this is tricky.

Now try looking at the first two images on the left and picking the goofier one. Much easier, right? This is just one simple example, but it brings us to an important point:

Ranking is easier than rating

This observation is far from new to psychologists but is finding its way into the affective computing and behavior signal processing research communities [@@metallinou2013annotation][@@yannakakis2015ratings][@@Yannakakis2011].

Why is this important? If we ask people to provide results by comparing things rather than assigning values, then we get more accurate results in general. Of course, asking for annotations of comparisons comes with its own set of problems...

Triplet comparisons

Drawing comparisons between pairs of images is easy when the scale has a unique and intuitive interpretation, but difficult when there is any ambiguity.

If you look at the first and last images in our Einstein example, you may decide one is goofier than the other or you might have trouble deciding whether the laugh is goofier than the smile. In this case different evaluators may provide different answers because there is ambiguity in the interpretation of "goofiness."

Let's try another approach: Which of the middle or right images is more similar to the one on the left?

I bet most would agree the one on the right is most similar. This type of similarity measurement (on 3-tuples or triplets of images) seems much more natural in cases where ambiguity in the assessment scale (goofiness) exists. Moreover, the results can still be used to establish an ordering of all images along the goofiness scale.

Next up...

The broader category of mathematical techniques used to sort objects (like images) given these comparisons on triplets is called Ordinal Embedding. We will explore this in detail in a future post.