USC

University of Southern California

Viterbi School of Engineering

Signal Analysis And Interpretation Laboratory

Research

 

Research in the Speech Analysis and Interpretation Laboratory spans both fundamental and applied aspects of speech processing and spoken language communication. The research is interdisciplinary: in addition to signal processing and communication theory, our work relies on knowledge from a variety of fields including AI/Computer Science, Linguistics, and Psychology.

Research projects involve graduate and undergraduate students as well as faculty and industry collaborators. Although primarily SAIL students come from Electrical Engineering or Computer Science, students from the Annenberg School of Communication and the Linguistics department participate in our research through collaborative projects. SAIL which is a part of IMSC (Integrated Media Systems Center), an NSF-ERC, has ties with the USC Information Sciences Institute and Institute for Creative Technologies.

SAIL addresses a variety of problems in speech research -- ranging from design, implementation and optimization of speech recognition/synthesis algorithms and conversational multimedia systems to fundamental experiments and modeling related to human vocal tract anatomy, physiology, acoustics and dynamics.

(real-time MRI)

 

Sponsors

Research in SAIL is supported by a variety of federal grants: NSF, DARPA, NIH, US Army/STRICOM.
The lab has also received support from the Powell Foundation and a Zumberge grant for interdisciplinary research.
Details on specific funded projects can be found in the projects page.
In addition SAIL has received grants, equipment and other collaborative support including software from a number of industry sponsors including
Lockheed Martin, HRL Labs, Lucent Technologies, Speechworks International, AT&T, SUN Microsystems and Intel.

 

Overview of Current Research Activities

Details of specific research activities are listed below. SAIL publications can be found in the publications page.

    Transonics - SpeechLinks

    US federal agencies are seeking new technologies to support multinational, multilingual operations. One significant challenge in such operations is cross-language communication between speakers of different native languages. DARPA and other DoD research groups envision enabling cross-language communication via computer-mediated translation of speech between the native languages of the speakers. The University of Southern California will develop SpeechLinks Solutions prototypes enabling human-human communication via automated two-way speech-to-speech language translation of English-Farsi and English-Dari.

    Members: Murtaza Bulut, Emil Ettelaie, Kyu Jeong Han, Ozlem Kalinli, Tom Murray, JongHo Shin, Shiva Sundaram, Andreas Tsiartas, Pezhman Zarifian, Shokufeh Farazmand, Panayiotis Gerogiou, Shrikanth Narayanan

    Transonics projects page


    Emotion Research

    Communicative channels, such as gestures, facial expressions and speech, are jointly invoked to convey and express an intended message, which includes not only the spoken content (verbal channel), but also implicit cues (non-verbal channel) that enrich human communication. Notable among these cues are the emotional states, which play a crucial role in inter-personal human interaction. Recent findings suggest that rational and intelligent communication between humans is closely related to emotions. Therefore, it is essential that emotions be integrated within human-machine interfaces that are more in tune with the users' needs and preferences. Toward creating interfaces, it is first necessary to study how emotions modulate and interact with the verbal and non-verbal channels in human communication. This is the main goal of our research group. The areas that we are studying are emotion recognition in speech, expressive speech synthesis, multi-modal analysis of human expressions, expressive speech production, emotions in text, expressive human-robot interfaces, human perception of simulated emotions, and scoring users-confidence and certainty in the domain of problem solving in a multimodal environment.

    Members: Murtaza Bulut, Carlos Busso, Jeannette Chang, Abe Kazemzadeh, Chi-Chun (Jeremy) Lee, Emily Mower, Samuel Kim, Sungbok Lee, Shrikanth Narayanan

    Emotion group page


    Pronunciation Group

    Tball Project: Technology-based assessment of language and literacy

    Our objective is to derive meaningful representations of both, suprasegmental and segmental events in speech followed by subsequent modeling and automatic detection for improved speech recognition. At the segmental level, we are interested in deriving new articulatory representations that can offer new insights into human speech production and at the suprasegmental level, we are interested in the detection and modeling of prosodic events that convey vital cues in human speech communication. In summary, we are interested in the analysis of signal representations at both segmental and suprasegmental level, core algorithms for exploiting these representations and their application in automatic speech recognition.

    This sub-group aims to automatically assess the literacy and pronunciation skills of pre-literate children and nonnative adult speakers, analyzing developmentally-based tests such as letter naming, letter sounding, isolated word reading, story reading, and listening comprehension. A large component of this research involves understanding how humans (especially experts) perceive and assess pronunciation and reading. With the help of teachers and linguists, we administer subjective human evaluations to gain insight into how to adapt to a speaker's accent, and ultimately, how to rate the speaker's performance.

    Recently, we have designed specialized grammars and adapted speech recognition and pronunciation evaluation methods to identify if a child is speaking fluently. We have also incorporated Bayesian Networks to automatically score pronunciations in the isolated word-reading task; this system's performance nears the expert inter-evaluator agreement.

    Members: Abe Kazemzadeh, Joseph Tepperman, Matthew Black, Matteo Gerosa, Sungbok Lee, Shrikanth Narayanan

    Tball projects page


    SPAN: Speech Production and Articulation kNowledge

    The SPAN Group bridges multiple interdisciplinary departments, labs and projects at USC. It brings together faculty and students from the Viterbi School of Electrical Engineering, the College's Department of Linguistics, and the Department of Computer Science.

    An appropriate understanding of how language is produced in space and time directly informs, and is informed by, our knowledge of the phonological representation of speech, a representation that we take to be intrinsically articulatory and dynamic. The SPAN Group is interested in using cutting-edge imaging and signal processing technologies to understand language production from its cognitive conception to its biomechanical execution to its signal properties. Our group uses articulator movement tracking, real-time imaging of the vocal tract, and state-of-art movement, image, and acoustic analysis techniques, including those found in automatic speech recognition paradigms

    Members: Erik Bresch, Jon-Fredrik Nielsen, Yoon Kim, Daylen Riggs, Abhinav Sethy, Shrikanth Narayanan, Krishna Nayak, Sungbok Lee, Dani Byrd, Richard Leahy

    SPAN group page


    Smart Room: Tracking of Meeting Dynamics

    University of Southern California Smart Room is designed with the main goal to enable identification and tracking of the dynamics and engagement of participants in meetings. Basic features of interest for this task are relative positions of meeting participants, their head orientations, speaker activity and speaker identity. Obtained annotation (meeting type, interaction dynamics, speaker activity) represents a basis for the second task of interest: meeting summarization and content retrieval.
    In our previous work [1] [2] we have presented multimodal system that performs fusion of four different modalities (ceiling multi-camera tracking system, a 360-degree face detection system, microphone array and a speaker identification system. Our most recent development [3] includes incorporation of the mixture particle filter algorithm for active speaker tracking and sequential change-point detection algorithm for speaker detection in track-before-detection scenario.

    In our research we focus primarily on development of sequential Monte Carlo filtering techniques for the multi-modal multi-participant tracking. We are interested in development of methods that explore different representations of posterior distribution for efficient tracking of variable number of participants.

    Future research plans include exploring potential of the developed SmartRoom as an educational tool for enhancement of the meeting productivity and analysis of the social interaction patterns.

    Members: Viktor Rozgic, Carlos Busso, Samuel Kim, Kyu Han, Panayiotis Gerogiou, Shrikanth Narayanan


    MURI: Human-like Speech Processing

    Our objective is to derive meaningful representations of both, suprasegmental and segmental events in speech followed by subsequent modeling and automatic detection for improved speech recognition. At the segmental level, we are interested in deriving new articulatory representations that can offer new insights into human speech production and at the suprasegmental level, we are interested in the detection and modeling of prosodic events that convey vital cues in human speech communication. In summary, we are interested in the analysis of signal representations at both segmental and suprasegmental level, core algorithms for exploiting these representations and their application in automatic speech recognition.

    We use electromagnetic articulography (EMA) technique to collect articulatory measurements of the vocal tract and exploit various representations of this data to improve phone and overall speech recognition. We also use lexical, syntactic and acoustic correlates of prosody in different modeling frameworks to automatically detect and enrich the input speech stream with pitch accent, boundary tones and dialog act tags. Such an approach can characterize human-like speech at multiple scales and can thus improve overall speech recognition.

    Members: Jorge Silva, Vivek Kumar Rangarajan Sridhar, Viktor Rozgic, Sankaranarayanan Ananthakrishnan, Shrikanth Narayanan


    Audio and Environmental Sound Processing

    Conventional content-based audio processing methods try to identify individual objects in a clip. While, this works for a limited set of sources, it ignores the inherent relationship between acoustic sources that arises due to the acoustic properties they share. The relationship is in form of many-to-one mapping between the assigned class labels (the direct name of sources) and its acoustic properties. A given source, say, “Nail Hammered on Bench” , (which is the given direct name) shares similar acoustic properties with “Knocking on Door” which also shares its properties with “Footsteps on Tarmac” . In such a scenario, an audio retrieval system will be required to retrieve the other two clips when queried with one of them. A perceptually intuitive scheme like this is going to be difficult with the content-based methods since it creates individual class labels for each of these sources. This approach is also not easily scalable because as more number of identifiable sources are involved, the complexity of the system would also increase by many folds. It will also require reorganization of the data and retraining of the audio processing system for every new domain of application.

    Another aspect of our research includes the investigation of ambient environmental sounds and events. Environment sounds are neither speech nor music, but comprised mainly of unstructured natural sounds that we hear everyday. Our goal is to build and design an acoustic environmental recognition system. We are interested in exploring new feature extraction, machine learning and pattern recognition algorithms related to this problem.

    Members: Shiva Sundaram, Selina Chu, Shrikanth Narayanan


    Music Information Retrieval

    A User-centric Content-based approach to Indexing, Query and Retrieval of Music through Signal Processing and Knowledge-based Methods

    Due to advances in computer and network technologies, development of efficient data storage and retrieval techniques have received much attention in recent years. Music Information Retrieval (MIR) is one example of technologies that focus on identifying desired music data within large music collections. The query input to such systems may be of various types, such as modes of natural human interactions (humming, singing, recorded audio samples) or metadata (lyrics, genres, artists.) Given the metadata, retrieval can be straightforward; string matching algorithms that are used in web search engines are capable of these kinds of tasks. On the other hand, when the input query is in the form of audio, signal processing algorithms and music knowledge based techniques need to be incorporated.

    Music retrieval engines usually have two tasks to achieve: a) Transcription, a symbolic representation of audio is created and b)Retrieval, representations from input query and the database entries are compared for matching.

    Our target domain is monphonic music and polyphonic music together. For monophonic music, we try to build a Query by Humming system, where a system accepts user's humming as the input and searches this query in a database for matching with similar melodies. The main challenge here is the variability in the input generated by users. For polyphonic music, the challenge is the complex structure of the audio that prevents us to achieve very accurate mapping of audio signal into symbolic representation. Also, the variability may come from different expressive performances, different orchestral setups (different instruments...) and etc... To achieve robust performance under these conditions, we propose statistical techinques.

    Members: Erdem Unal, Shrikanth Narayanan

    Music group page


    Auditory Implants and Perception

    Cochlear Implant (CI) has successfully provided hearing to profoundly deaf people that could not benefit from hearing aids. The average performance of speech communication in CI users has been significantly improved over years and almost reached a plateau in quiet. However, accompanying this average performance improvement, a remarkable increase in inter-subject difference in performance has been observed. Furthermore, the speech perception in CI was observed to have some unique patterns. In the next generation of CI, it is important to use such performance variability as a tool to optimize speech perception performance, and to further understand the exact cause of the problem. Our research focuses on optimization for speech perception in cochlear implants through signal processing approaches; meanwhile, the specialty of speech perception in CI was taken into account.

    Members: Chuping Liu, Qian-Jie Fu, Shrikanth Narayanan