Computer Vision & Machine Learning Researcher
Hey! My name is Rahul Sharma. I am a Ph.D. candidate at Ming Hsieh Institute of Electrical and Computer Engineering, University of Southern California. I work with Prof. Shrikanth Narayanan and am part of the Signal Analysis and Interpretation Lab.
My research interest lies in multimodal signal processing, more inclined towards the visual signal (Computer Vision), to understand human actions and behavior in multimedia content. Furthermore, I am keenly interested in semi-supervised systems and the notion of weeker-than-full supervision.
Before USC, I obtained my Bachelors' and Masters' degrees in Electrical Engineering from the Indian Institute of Technology, Kanpur. For my masters' thesis, I worked on developing a computational framework to quantify a speaker's performance in the context of public speaking. At IITK, I was advised by Dr. Tanya Guha and Dr. Gaurav Sharma.
An objective understanding of media depictions, such as about inclusive portrayals of how much someone is heard and seen on screen in film and television, requires the machines to discern automatically who, when, how and where someone is talking. Media content is rich in multiple modalities such as visuals and audio which can be used to learn speaker activity in videos. In this work, we present visual representations that have implicit information about when someone is talking and where. We propose a crossmodal neural network for audio speech event detection using the visual frames. We use the learned representations for two downstream tasks: i) audio-visual voice activity detection ii) active speaker localization in video frames. We present a state-of-the-art audio-visual voice activity detection system and demonstrate that the learned embeddings can effectively localize to active speakers in the visual frames.
Paper 1 (ICIP'19) pdf Poster ICIP'19
Paper 2 (arXiv: preprint)
Due to its ability to visualize and measure the dynamics of vocal tract shaping during speech production, real-time magnetic resonance imaging (rtMRI) has emerged as one of the prominent research tools. The ability to track different articulators such as the tongue, lips, velum, and the pharynx is a crucial step toward automating further scientific and clinical analysis. Recently, various researchers have addressed the problem of detecting articulatory boundaries, but those are primarily limited to static-image based methods. In this work, we propose to use information from temporal dynamics together with the spatial structure to detect the articulatory boundaries in rtMRI videos. We train a convolutional LSTM network to detect and label the articulatory contours. We compare the produced contours against reference labels generated by iteratively fitting a manually created subject-specific template. We observe that the proposed method outperforms solely image-based methods, especially for the difficult-to-track articulators involved in airway constriction formation during speech.
An effort to automatically understand the key characteristics of ASD kids’ behavior as with respect to TD kids using the videos of home and clinical sessions. The system involves classifying the humans present in the scene into child and interlocutor. Further it involves computing the features representing the child-interlocutor dynamics. To quantify the the child-interlocutor interaction, we compute the physical proximity between them and the gaze directions of their view and study their dynamics over time along with an additional speaker diarization information.
In this work we analyze the eyetracking data from the kids to study the key characteristicsin the visual patterns demostrted by the kids having Cortical Visual Impairment against control subjects. We compute the slaiency maps over the presented stimuli to generate a representation of the eyetracks.
We investigate the importance of human centered visual cues for predicting the popularity of a public lecture. We construct a large database of more than 1800 TED talk videos and leverage the corresponding (online) viewers' ratings from YouTube for a measure of popularity of the TED talks. Visual cues related to facial and physical appearance, facial expressions, and pose variations are learned using convolutional neural networks (CNN) connected to an attention-based long short-term memory (LSTM) network to predict the video popularity. The proposed overall network is end-to-end-trainable, and achieves state-of-the-art prediction accuracy indicating that the visual cues alone contain highly predictive information about the popularity of a talk. We also demonstrate qualitatively that the network learns a human-like attention mechanism, which is particularly useful for interpretability, i.e. how attention varies with time, and across different visual cues as a function of their relative importance.
This work proposes a trajectory clustering-based approach for segmenting flow patterns in high density crowd videos. The goal is to produce a pixel-wise segmentation of a video sequence (static camera), where each segment corresponds to a different motion pattern. Unlike previous studies that use only motion vectors, we extract full
trajectories so as to capture the complete temporal evolution of each
region (block) in a video sequence. The extracted trajectories are
dense, complex and often overlapping. A novel clustering algorithm
is developed to group these trajectories that takes into account the information about the trajectories’ shape, location, and the density of
trajectory patterns in a spatial neighborhood. Once the trajectories
are clustered, final motion segments are obtained by grouping of the
resulting trajectory clusters on the basis of their area of overlap, and
average flow direction. The proposed method is validated on a set of
crowd videos that are commonly used in this field. On comparison
with several state-of-the-art techniques, our method achieves better
Paper Pdf Poster
If you need to discuss anything regarding my work, shoot me an email!