Introduction

The SpeechLinks project focuses on creating robust, widely-deployable and cost-effective technology solution for enabling and enhancing cross-lingual spoken language interaction between people who do not share a common language. One of our target applications focuses on communication between healthcare personnel who speak English only and patients with limited English proficiency (sponsored by the NSF). The interaction represents machine-mediated interpersonal communication, which is different from conventional human-machine dialogs.

The state-of-the-art in speech to speech (S2S) system design that enables such cross-lingual interactions is characterized by (1) a pipelined architecture of speech recognition, machine translation and speech synthesis that relies primarily on lexical information, largely ignoring other rich information present in speech and spoken discourse, (2) minimal interaction between the constituent components beyond the pipelining and (3) not adequately taking advantage of the humans in the loop for automatically learning, adapting and collaboratively managing the interaction just as an experienced interpreter would do. Overcoming these fundamental limitations requires improving robust intelligence at all levels - signal, system, and human - and directly leads to the research questions of this project.

Our central premise is that the mediating system should utilize the rich context, beyond what is conveyed by words, while being cognizant of and working along with the participating humans for enabling successful cross-lingual interactions in terms of improved information transfer, communication efficiency and social co-presence. We propose to capture, model and transfer rich contextual information conveyed through speech prosody, discourse, and user state behavior to aid robust translation and expressive synthesis in the target language.

In addition to the above, the SpeechLinks team also works on the design and portability issues of S2S systems in collaboration with BBN Technologies as part of the DAPRA TRANSTAC program.

Top-level Overview of an S2S System

The SpeechLinks translation device operates by recognizing the user's speech input and converting it into text. The text is subsequently translated, and re-synthesized back into the target language.

The Automatic Speech Recognition (ASR) subsystem that produces N-best lists along with the decoding confidence scores. The ASR operates in real time in the two languages of interest. The ASR uses models of the acoustic patterns of human speech trained by example recordings and statistical knowledge about the structure of the language trained by example transcripts.

The Dialog Manager (DM) acts as the heart of the system, redirecting messages accordingly. Every output of the speech recognizer, tagged with time stump, serial number, and other important information is received by the DM, and usually is redirected for display to the Graphical User Interface (GUI), and for translation to the Machine Translation unit. However, the DM also has the options of rejecting recognition results based on several factors, such as confidence reported by the ASR.

The Machine Translation (MT) unit operates in two modes: The classifier attempts to assign a concept to an utterance, so for example a sentence such as "Umm and do you have any headaches" would become the concept "Do you have a headache". The classifier can perform really accurately in the case of utterances that are in domain, in other words, for those sentences that we trained the system exhaustively to work with. However out of domain utterances are translated in a statistical manner by the Statistical Machine Translation (SMT) unit. The SMT breaks the sentence into segments and translates piece by piece, reconstructing the result on the destination language.

Finally, a unit selection based Text To Speech synthesizer (TTS) provides the spoken output. The TTS operates in several modes: for utterance results of the classifier, the naturally sounding human recording of the utterance is played out. If the utterance was statistically classified, then the unit of the synthesizer drops from the utterance level to the word level and a concatenation of words is played out. Finally, for words that we have no human recordings of, synthesis of the individual words, and of the sentence takes place at the diphone level.

PEOPLE

SAIL Faculty

External Collaborators

USC Annenberg School of Communication: Prof. Margaret McLaughlin
USC Keck School of Medicine: Dr. Baezconde-Garbanati
AT&T Research Labs: Dr. Srinivas Bangalore
BBN Technologies: Dr. John Makhoul, Prem Natarajan, Rohit Prasad and Dr. Vivek Sridhar

SAIL Students

PUBLICATIONS

Click Here

DEMO

An old version of the SpeechLinks system: low resolution (4.82MB) or high resolution (35.8MB).

NEWS

The first data collection for the NSF-sponsored "Context enriched speech-to-speech translation" was held on 30^th January 2010 in the Electrical Engineering Building at USC.
"The Doctor Can Understand You Now", US News and World Report Science, September 23, 2009.