What are multiple views and modalities?Data that is sampled from observing an event with different instruments to capture its various presentations. For example, consider the task of identifying the wake word(s) "OK Google" or "Alexa" regardless of age, accent, acoustic background and other factors, different variations of a given wake word are multiple views of the underlying task, i.e., recognizing the wake word. Another example is understanding the topic of a video using sound design, visuals and language as multiple modalities.
Why do multiview learning?So can machines. Learning from different devices or instruments observing multiple aspects of an underlying event can help build robust machine learning models.
Open research questions addressed by this work:
- How can we model multiple views/modalities in parallel?
- There is a need to develop methods that scale for a large number (>2) of views.
- Need for view-agnostic methods i.e., having view-correspondence but not details of view acquistion
- Developing mechanisms for multimodal “disentanglement”
- How to disocciate modality-specific information and shared information across modalities?
- Need for generalizable unsupervised and self-supervised methodologies.
Central idea of this work:
- Inherent variability of a semantic class can be uncorrelated across multiple views in the data
- Maximizing multiview correlation can transform input high-dim. data streams to low-dim. shared subspace across views
- The subspaces estimated by factoring out the variabilities arising from the many views of an event are naturally discriminative of the classes
We evaluated our methodology on several datasets in the domains of audio, video and images. Below are three demos of such applications:
Click on the panels below to explore the learnt embeddings on different datasets
Wake word classification
From the speech commands dataset, our method learns to discriminate the 30 commands no matter who the speaker is in an semi-supervised manner.
Text-dependent speaker ID
From the speech commands dataset, our method learns to cluster the speakers no matter which of the 30 commands they spoke.
Domestic activity classification
Learn to model multi-channel audio segments acquired by a microphone array in a house. The multiview shared subspace can robustly disentangle channel from class information to classify domestic activities such as cooking, cleaning.
Multiview object recognition
Embeddings learnt from multiple views of an object: Successfully classify 40 object categories imaged from 12 and 40 different views.
The visualizations above are developed using this tool
Why does this work?
The objective we maximize and the evolution of the eigenvalues in the objective through the training process is illustrated below: