Research in the Speech Analysis and Interpretation Laboratory spans both fundamental and applied aspects of speech processing and spoken language communication. The research is interdisciplinary: in addition to signal processing and communication theory, our work relies on knowledge from a variety of fields including AI/Computer Science, Linguistics, and Psychology.
Research projects involve graduate and undergraduate students as well as faculty and industry collaborators. Although primarily SAIL students come from Electrical Engineering or Computer Science, students from the Annenberg School of Communication and the Linguistics department participate in our research through collaborative projects. SAIL which is a part of IMSC (Integrated Media Systems Center), an NSF-ERC, has ties with the USC Information Sciences Institute and Institute for Creative Technologies.
SAIL addresses a variety of problems in speech research -- ranging from design, implementation and optimization of speech recognition/synthesis algorithms and conversational multimedia systems to fundamental experiments and modeling related to human vocal tract anatomy, physiology, acoustics and dynamics.
Sponsors
Research in SAIL is supported by a variety of federal grants: NSF, DARPA, NIH, US Army/STRICOM.
The lab has also received support from the Powell Foundation and a Zumberge grant for interdisciplinary research.
Details on specific funded projects can be found in the projects page.
In addition SAIL has received grants, equipment and other collaborative support including software from a number of industry sponsors including Lockheed Martin, HRL Labs, Lucent Technologies, Speechworks International, AT&T, SUN Microsystems and Intel.
Overview of Current Research Activities
Details of specific research activities are listed below. SAIL publications can be found in the publications page.
US federal agencies are seeking new technologies to support multinational, multilingual operations. One significant challenge in such operations is cross-language communication between speakers of different native languages. DARPA and other DoD research groups envision enabling cross-language communication via computer-mediated translation of speech between the native languages of the speakers. The University of Southern California will develop SpeechLinks Solutions prototypes enabling human-human communication via automated two-way speech-to-speech language translation of English-Farsi and English-Dari.
Members:Murtaza Bulut, Emil Ettelaie, Kyu Jeong Han, Ozlem Kalinli, Tom Murray, JongHo Shin, Shiva Sundaram,
Andreas Tsiartas, Pezhman Zarifian, Shokufeh Farazmand, Panayiotis Gerogiou, Shrikanth Narayanan
Communicative channels, such as gestures, facial expressions and speech, are jointly invoked to convey and express an intended message, which includes
not only the spoken content (verbal channel), but also implicit cues (non-verbal channel) that enrich human communication. Notable among these
cues are the emotional states, which play a crucial role in inter-personal human interaction. Recent findings suggest that rational and intelligent
communication between humans is closely related to emotions. Therefore, it is essential that emotions be integrated within human-machine interfaces
that are more in tune with the users' needs and preferences. Toward creating interfaces, it is first necessary to study how emotions modulate and
interact with the verbal and non-verbal channels in human communication. This is the main goal of our research group. The areas that we are studying
are emotion recognition in speech, expressive speech synthesis, multi-modal analysis of human expressions, expressive speech production, emotions in
text, expressive human-robot interfaces, human perception of simulated emotions, and scoring users-confidence and certainty in the domain of
problem solving in a multimodal environment.
Members:Murtaza Bulut, Carlos Busso, Jeannette Chang, Abe Kazemzadeh, Chi-Chun (Jeremy) Lee, Emily Mower, Samuel Kim, Sungbok Lee, Shrikanth Narayanan
Our objective is to derive meaningful representations of both, suprasegmental and segmental events in speech followed by subsequent modeling and automatic detection for improved speech recognition. At the segmental level, we are interested in deriving new articulatory representations that can offer new insights into human speech production and at the suprasegmental level, we are interested in the detection and modeling of prosodic events that convey vital cues in human speech communication. In summary, we are interested in the analysis of signal representations at both segmental and suprasegmental level, core algorithms for exploiting these representations and their application in automatic speech recognition.
This sub-group aims to automatically assess the literacy and pronunciation
skills of pre-literate children and nonnative adult speakers, analyzing developmentally-based tests such as letter naming,
letter sounding, isolated word reading, story reading, and listening comprehension. A large
component of this research involves understanding how humans
(especially experts) perceive and assess pronunciation and reading. With the help of teachers and linguists, we
administer subjective human evaluations to gain insight into how to
adapt to a speaker's accent, and ultimately, how to rate the speaker's performance.
Recently, we have designed specialized grammars and adapted speech
recognition and pronunciation evaluation methods to identify
if a child is speaking fluently. We have also incorporated Bayesian Networks to automatically
score pronunciations in the isolated word-reading task; this system's performance nears the expert inter-evaluator
agreement.
Members:Abe Kazemzadeh, Joseph Tepperman, Matthew Black, Matteo Gerosa, Sungbok Lee, Shrikanth Narayanan
The SPAN Group bridges multiple interdisciplinary departments, labs and projects at USC. It brings together faculty and students from the Viterbi School of Electrical Engineering, the College's Department of Linguistics, and the Department of Computer Science.
An appropriate understanding of how language is produced in space and time directly informs, and is informed by, our knowledge of the phonological representation of speech, a representation that we take to be intrinsically articulatory and dynamic. The SPAN Group is interested in using cutting-edge imaging and signal processing technologies to understand language production from its cognitive conception to its biomechanical execution to its signal properties. Our group uses articulator movement tracking, real-time imaging of the vocal tract, and state-of-art movement, image, and acoustic analysis techniques, including those found in automatic speech recognition paradigms
Members:Erik Bresch, Jon-Fredrik Nielsen, Yoon Kim, Daylen Riggs, Abhinav Sethy, Shrikanth Narayanan, Krishna Nayak, Sungbok Lee, Dani Byrd, Richard Leahy
University of Southern California Smart Room is designed with the main goal to enable identification and tracking of the dynamics and engagement of participants in meetings. Basic features of interest for this task are relative positions of meeting participants, their head orientations, speaker activity and speaker identity. Obtained annotation (meeting type, interaction dynamics, speaker activity) represents a basis for the second task of interest: meeting summarization and content retrieval.
In our previous work [1][2] we have presented multimodal system that performs fusion of four different modalities (ceiling multi-camera tracking system, a 360-degree face detection system, microphone array and a speaker identification
system. Our most recent development [3] includes incorporation of the mixture particle filter algorithm for active speaker tracking and sequential change-point detection algorithm for speaker detection in track-before-detection scenario.
In our research we focus primarily on development of sequential Monte Carlo filtering techniques for the multi-modal multi-participant tracking. We are interested in development of methods that explore different representations of posterior distribution for efficient tracking of variable number of participants.
Future research plans include exploring potential of the developed SmartRoom as an educational tool for enhancement of the meeting productivity and analysis of the social interaction patterns.
Members:Viktor Rozgic, Carlos Busso, Samuel Kim, Kyu Han, Panayiotis Gerogiou, Shrikanth Narayanan
MURI: Human-like Speech Processing
Our objective is to derive meaningful representations of both, suprasegmental and segmental events in speech followed by subsequent modeling and automatic detection for improved speech recognition. At the segmental level, we are interested in deriving new articulatory representations that can offer new insights into human speech production and at the suprasegmental level, we are interested in the detection and modeling of prosodic events that convey vital cues in human speech communication. In summary, we are interested in the analysis of signal representations at both segmental and suprasegmental level, core algorithms for exploiting these representations and their application in automatic speech recognition.
We use electromagnetic articulography (EMA) technique to collect articulatory measurements of the vocal tract and exploit various representations of this data to improve phone and overall speech recognition. We also use lexical, syntactic and acoustic correlates of prosody in different modeling frameworks to automatically detect and enrich the input speech stream with pitch accent, boundary tones and dialog act tags. Such an approach can characterize human-like speech at multiple scales and can thus improve overall speech recognition.
Conventional content-based audio processing methods try to identify individual objects in a clip. While, this works for a limited set of sources, it ignores the inherent relationship between acoustic sources that arises due to the acoustic properties they share. The relationship is in form of many-to-one mapping between the assigned class labels (the direct name of sources) and its acoustic properties. A given source, say, “Nail Hammered on Bench” , (which is the given direct name) shares similar acoustic properties with “Knocking on Door” which also shares its properties with “Footsteps on Tarmac” . In such a scenario, an audio retrieval system will be required to retrieve the other two clips when queried with one of them. A perceptually intuitive scheme like this is going to be difficult with the content-based methods since it creates individual class labels for each of these sources. This approach is also not easily scalable because as more number of identifiable sources are involved, the complexity of the system would also increase by many folds. It will also require reorganization of the data and retraining of the audio processing system for every new domain of application.
Another aspect of our research includes the investigation of ambient environmental sounds and events. Environment sounds are neither speech nor music, but comprised mainly of unstructured natural sounds that we hear everyday. Our goal is to build and design an acoustic environmental recognition system. We are interested in exploring new feature extraction, machine learning and pattern recognition algorithms related to this problem.
A User-centric Content-based approach to Indexing, Query and Retrieval of Music through Signal Processing and Knowledge-based Methods
Due to advances in computer and network technologies, development of efficient data storage and retrieval techniques
have received much attention in recent years. Music Information Retrieval (MIR) is one example of technologies that
focus on identifying desired music data within large music collections. The query input to such systems may be of various
types, such as modes of natural human interactions (humming, singing, recorded audio samples) or metadata (lyrics, genres,
artists.) Given the metadata, retrieval can be straightforward; string matching algorithms that are used in web search engines
are capable of these kinds of tasks. On the other hand, when the input query is in the form of audio, signal processing
algorithms and music knowledge based techniques need to be incorporated.
Music retrieval engines usually have two tasks to achieve: a) Transcription, a symbolic representation of audio is created and b)Retrieval, representations from input query and
the database entries are compared for matching.
Our target domain is monphonic music and polyphonic music together. For monophonic music, we
try to build a Query by Humming system, where a system accepts user's humming as the input and
searches this query in a database for matching with similar melodies. The main challenge here is the
variability in the input generated by users. For polyphonic music, the challenge is the complex
structure of the audio that prevents us to achieve very accurate mapping of audio signal into symbolic
representation. Also, the variability may come from different expressive performances,
different orchestral setups (different instruments...) and etc... To achieve robust performance under
these conditions, we propose statistical techinques.
Cochlear Implant (CI) has successfully provided hearing to profoundly deaf people that could not benefit from hearing aids. The average performance of speech communication in CI
users has been significantly improved over years and almost reached a plateau in quiet. However, accompanying this average performance improvement, a remarkable increase
in inter-subject difference in performance has been observed. Furthermore, the speech perception in CI was observed to have some unique patterns. In the next generation of CI,
it is important to use such performance variability as a tool to optimize speech perception performance, and to further understand the exact cause of the problem. Our research
focuses on optimization for speech perception in cochlear implants through signal processing approaches; meanwhile, the specialty of speech perception in CI was taken into account.