Media Understanding Virtual Workshop III

Context & Environment

Workshop Summary

Workshop III focuses on understanding the context and environment


Keynote Speaker: Dr. Tapaswi

There is a growing interest in AI to build socially intelligent robots. This requires machines to have the ability to read peoples’ emotions, motivations, and other factors that affect behavior. Towards this goal, I will introduce MovieGraphs, a dataset that provides detailed, graph-based annotations of social situations depicted in movie clips. I will show how common-sense can emerge by analyzing various social aspects of movies, and also describe applications ranging from querying videos with graphs, ordered interaction understanding, and reason prediction. Finally, I will present our recent work on analyzing whether machines can recognize and benefit from the joint interplay of interactions and relationships between movie characters.

Lightning Talk Abstracts

Title: Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development. Data is a crucial component of machine learning. The field is reliant on data to train, validate, and test models. With increased technical capabilities, machine learning research has boomed in both academic and industry settings, and one major focus has been on computer vision. Computer vision is a popular domain of machine learning increasingly pertinent to real-world applications, from facial recognition in policing to object detection for autonomous vehicles. Given computer vision’s propensity to shape machine learning research and impact human life, we seek to understand disciplinary practices around dataset documentation—how data is collected, curated, annotated, and packaged into datasets for computer vision researchers and practitioners to use for model tuning and development. Specifically, we examine what dataset documentation communicates about the underlying values of vision data and the larger practices and goals of computer vision as a field. To conduct this study, we collected a corpus of about 500 computer vision datasets, from which we sampled 114 dataset publications across different vision tasks. Through both a structured and thematic content analysis, we document a number of values around accepted data practices, what makes desirable data, and the treatment of humans in the dataset construction process. We discuss how computer vision datasets authors value efficiency at the expense of care; universality at the expense of contextuality; impartiality at the expense of positionality; and model work at the expense of data work. Many of the silenced values we identify sit in opposition with social computing practices. We conclude with suggestions on how to better incorporate silenced values into the dataset creation and curation process
Title: Feeling it in our fingers; how technology impacts our embodied experience of climate change. The understanding of what climate change really is and how it operates is something that is usually not completely understood by the general population, leading to misunderstanding and confusion about how human actions impact it. James Hansen (NASA) theorized that this is because it has, until recently, been more of an “abstract” concept involving invisible gases, where only select elements of the impacts of its effects are felt by the vast majority of human populations. In recent years, the impacts have become very much more embodied, as the instances of major heat waves, flooding, and wildfires have increased around the world where the effects are felt in our bodies. Embodied experiences are often understood as being much more formative, which leads to the conclusion that climate change should be becoming less of an abstract concept for people who have experienced these events. In our modern lives, understanding the context of the environment around us usually involves technological mediation, which is often decried as adding a barrier between what is really experienced in our bodies. Therefore, in thinking through how technology can impact our embodied experiences of climate change, I want to explore how it both obstructs and enables more tangible experiences, leading to designing practices that can embrace those experiences.
Title: STORi – contextualizing the power of storytelling in kids books and exploring AI to diversify characters and stories. “Cradle to career” exploration of how stories and characters in children’s books hold a unique power in shaping their imagination, inclination, and future careers. Less than 5% of kids’ books feature characters of color in lead roles; STEM books less than 1%. Could lack of diverse characters in early childhood correlate with gender gaps in the future workforce? In this talk, Komal talks about STORI – a 20% project at Google that tackles the lack of representation in books, and explores multi-modal AI approaches as interventions to diversify characters and to make stories more inclusive. This is applied and experimental work; we share ideas, success with early prototypes and future direction.
Title: EMOTIons in Context Dataset. We attempt to perceive emotions of people on a daily basis (partner, children, colleagues, and anyone with whom we interact). Therefore, the act of estimating a person’s emotional state becomes essential to being more caring and handling the situation in a more skilful and sensitive manner. Recognizing emotions requires understanding the visual scene in which a person is immersed. This talk is about EMOTIC (EMOTIons in Context) dataset – a database of images with people in real environments, annotated with their apparent emotions. Annotations include a list of 26 emotion categories, and 3 common continuous dimensions: Valence, Arousal, and Dominance. Images in the database are annotated using the Amazon Mechanical Turk (AMT) platform. The talk will describe how EMOTIC was created and the information available. This dataset can help to open new horizons in creating systems able to recognize rich information about people’s apparent emotional states.
Title: Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations. Majority voting and averaging are common approaches employed to resolve annotator disagreements and derive single ground truth labels from multiple annotations. However, annotators may systematically disagree with one another, often reflecting their individual biases and values, especially in the case of subjective tasks such as detecting affect, aggression, and hate speech. Annotator disagreements may capture important nuances in such tasks that are often ignored while aggregating annotations to a single ground truth. In order to address this, we investigate the efficacy of multi-annotator models. In particular, our multi-task based approach treats predicting each annotators’ judgements as separate subtasks, while sharing a common learned representation of the task. We show that this approach yields same or better performance than aggregating labels in the data prior to training across seven different binary classification tasks. Our approach also provides a way to estimate uncertainty in predictions, which we demonstrate better correlate with annotation disagreements than traditional methods. Being able to model uncertainty is especially useful in deployment scenarios where knowing when not to make a prediction is important.

Lightning Talk Abstract

Digbalay Bose: Title: Understanding Context in Movies: Taxonomies, Benchmarks, and Challenges. Abstract: Movies provide narrative structure for understanding long-term interactions between characters and their surrounding context. A key element of context understanding involves reasoning about the background location where the character interactions take place in different segments of a movie. Associated downstream applications include analysis of media portrayal of groups in different locations, location-based scene retrieval system, and character-centric analysis for various locations. In this talk, we look at existing taxonomies and benchmarks used for understanding background locations in movies and auxiliary video datasets followed by their subsequent usage for constructing a large-scale holistic taxonomy.