Introduction
US federal agencies are seeking new technologies to support multinational, multilingual operations.
One significant challenge in such operations is cross-language communication between speakers of
different native languages. DARPA and other DoD research groups envision enabling cross-language
communication via computer-mediated translation of speech between the native languages of the speakers.
The University of Southern California will develop SpeechLinks Solutions prototypes enabling human-human
communication via automated two-way speech-to-speech language translation of English-Farsi and English-Dari.
SpeechLinks Solution
The SpeechLinks translation device operates by recognizing the user'sspeech input and converting it into text. The text is subsequently
translated, and re-synthesized back into the target language. The
following is a more detailed description of the system's
functionality:
The Automatic Speech Recognition (ASR) subsystem that produces n-best
lists along with the decoding confidence scores. The ASR operates in
real time in two languages: English and Persian(Farsi). The English
ASR operates on a vocabulary of over 22,000 words and the Persian with
over 9000 words. The ASR uses models of the acoustic patterns of human
speech trained by example recordings and statistical knowledge about
the structure of the language trained by example transcripts.
The Dialog Manager (DM) acts as the heart of the system, redirectingmessages accordingly. Every output of the speech recognizer, tagged
with time stump, serial number, and other important information is
received by the DM, and usually is redirected for display to the
Graphical User interface, and for translation to the Machine
Translation unit. However, the DM also has the options of rejecting
recognition results based on several factors, such as confidence
reported by the ASR.
The Machine Translation (MT) unit operates in two modes: The
Classifier attempts to assign a concept to an utterance, so for
example a sentence such as "Umm and do you have any headaches" would
become the concept "Do you have a headache". The classifier can
perform really accurately in the case of utterances that are in
domain, in other words, for those sentences that we trained the system
exhaustively to work with (about 1200 utterances). However out of
domain utterances are translated in a statistical manner by the
Statistical Machine Translation (SMT) unit. The SMT breaks the
sentence into segments and translates piece by piece, reconstructing
the result on the destination language.
Finally, a unit selection based Text To Speech synthesizer (TTS)
provides the spoken output. The TTS operates in several modes: for
utterance results of the classifier, the naturally sounding human
recording of the utterance is played out. If the utterance was
statistically classified, then the unit of the synthesizer drops from
the utterance level to the word level and a concatenation of words is
played out. Finally, for words that we have no human recordings of,
synthesis of the individual words, and of the sentence takes place at
the diphone level.
PEOPLE
USC Faculty
USC Post-Doctors and Students
- Murtaza Bulut, Ph.D. student
- Emil Ettelaie , Ph.D. student
- Kyu Jeong Han, Ph.D. student
- Ozlem Kalinli , Ph.D. student
- Tom Murray, Ph.D. student
- JongHo Shin, Ph.D. student
- Shiva Sundaram , Ph.D. student
- Andreas Tsiartas, Ph.D. student
- Pezhman Zarifian, Undergraduate student
- Shokufeh Farazmand, Undergraduate student
PAPERS
- Jong Ho Shin, Panayiotis Georgiou, and Shrikanth Narayanan. Analyzing the
multimodal behaviors of users of a speech-to-speech translation device by using
concept matching scores. In Proceedings of IEEE Multimedia Signal Processing Workshop, Chania, Greece,
October 2007.
- Shiva Sundaram and Shrikanth Narayanan. Experiments in automatic genre
classification of full-length music tracks using audio activity rate. In Proceedings of IEEE Multimedia Signal Processing Workshop,
Chania, Greece, October 2007.
- Ozlem Kalinli and Shrikanth Narayanan. Early auditory processing inspired
features for robust automatic speech recognition. In Proceedings of EUSIPCO, Poznan, Poland, September 2007.
URL: http://sail.usc.edu/publications/ozlem_eusipco07.pdf.
- Murtaza Bulut, Sungbok Lee, and Shrikanth Narayanan. Analysis of emotional
speech prosody in terms of part of speech tags. In Proceedings of InterSpeech ICSLP, Antwerp, Belgium, August
2007. URL: http://sail.usc.edu/publications/murtaza_icslp07.pdf.
- Kyu Jeong Han and Shrikanth Narayanan. A robust stopping criterion for
agglomerative hierarchical clustering in a speaker diarization system. In Proceedings of InterSpeech ICSLP, Antwerp, Belgium, August
2007. URL: http://sail.usc.edu/publications/kyuhan_icslp07.pdf.
- Ozlem Kalinli and Shrikanth Narayanan. A saliency-based auditory attention model
with applications to unsupervised prominent syllable detection in speech. In
Proceedings of InterSpeech ICSLP, Antwerp, Belgium,
August 2007. URL: http://sail.usc.edu/publications/ozlem_icslp07.pdf.
- Sankaranarayanan Ananthakrishnan and Shrikanth Narayanan. Prosody-enriched
lattices for improved syllable recognition. In Proceedings of InterSpeech
ICSLP, Antwerp, Belgium, August 2007.
- Shiva Sundaram and Shrikanth Narayanan. Analysis of audio clustering using word
descriptions. In Proceedings of ICASSP, Honolulu,
Hawaii, April 2007. URL: http://sail.usc.edu/publications/shiva_cluster_icassp07.pdf.
- Sankaranarayanan Ananthakrishnan and Shrikanth Narayanan. Improved speech
recognition using acoustic and lexical correlates of pitch accent in a n-best
rescoring framework. In Proceedings of ICASSP,
Honolulu, Hawaii, April 2007. URL: http://sail.usc.edu/publications/ananthakrishnan_icassp07.pdf.
- Abhinav Sethy, Shrikanth Narayanan, and Bhuvana Ramabhadran. Data driven
approach for language model adaptation using stepwise relative entropy
minimization. In Proceedings of ICASSP, Honolulu,
Hawaii, April 2007. URL: http://sail.usc.edu/publications/sethy_icassp07.pdf.
- Murtaza Bulut, Sungbok Lee, and Shrikanth Narayanan. A statistical approach for
modeling prosody features using pos tags for emotional speech synthesis. In
Proceedings of ICASSP, Honolulu, Hawaii, April 2007.
URL: http://sail.usc.edu/publications/bulut_icassp07.pdf.
- Shiva Sundaram and Shrikanth Narayanan. Discriminating two types of noise
sources using cortical representation and dimension reduction technique. In
Proceedings of ICASSP, Honolulu, Hawaii, April 2007.
URL: http://sail.usc.edu/publications/shiva_noise_icassp07.pdf.
- JongHo Shin, Panayiotis Georgiou, and Shrikanth Narayanan. User modeling in a
speech translation driven mediated interaction setting. In Proceedings of First International Workshop on
Human-Centered Multimedia (ACM Multimedia),
Santa Barbara, CA, October 2006. URL: http://sail.usc.edu/publications/JongHo_acm06_hcm.pdf.
- Shiva Sundaram and Shrikanth Narayanan. An attribute-based approach to audio
description applied to segmenting vocal sections in popular music songs. In
Proceedings of MMSP, Victoria, Canada, October 2006.
URL: http://sail.usc.edu/publications/sundaram_MMSP06.pdf.
- Shiva Sundaram and Shrikanth Narayanan. Vector-based representation and
clustering of audio using onomatopoeia words. In Proceedings
of AAAI 2006 Fall Symposium, Aurally Informed
Performance: Integrating Machine Listening and Auditory Presentation in Robotic Systems, Arlington, VA,
October 2006. URL: http://sail.usc.edu/publications/sundaram_AAAI06.pdf.
- Shankar Ananthakrishnan and Shrikanth Narayanan. Combining acoustic, lexical,
and syntactic evidence for automatic unsupervised prosody labeling. In Proceedings of InterSpeech ICSLP, Pittsburgh, PA, September
2006. URL: http://sail.usc.edu/publications/ananthak_icslp06.pdf.
- Emil Ettelaie, Panayiotis Georgiou, and Shrikanth Narayanan. Cross-lingual
dialog model for speech to speech translation. In Proceedings of InterSpeech
ICSLP, Pittsburgh, PA, September 2006. URL: http://sail.usc.edu/publications/emil-icslp2006.pdf.
- Abhinav Sethy, Panayiotis Georgiou, and Shrikanth Narayanan. Text data
acquisition for domain-specific language models. In Proceedings of EMNLP, Sydney, Australia, July 2006.
URL: http://sail.usc.edu/publications/abhinav_emnlp06.pdf.
- Panayiotis Georgiou, Abhinav Sethy, JongHo Shin, and Shrikanth Narayanan. An
english-persian automatic speech translator: Recent developments in domain
portability and user modeling. In Proceedings of International Conference on Intelligent Systems And
Computing: Theory And Applications, Ayia Napa,
Cyprus, July 2006. URL: http://sail.usc.edu/publications/georgiou_isyc2006_s2s.pdf.
- Abhinav Sethy, Panayiotis Georgiou, and Shrikanth Narayanan. Selecting relevant
text subsets from webdata for building topic specific language models. In Proceedings of HLT, New York City, New York, June 2006.
URL: http://sail.usc.edu/publications/abhinav_hlt06.ps.
- Shrikanth Narayanan, Panayiotis Georgiou, Abhinav Sethy, Dagen Wang, Shankar
Ananthakrishnan, Emil Ettelaie, Horacio Franco, Kristin Precoda, Dimitra
Vergyri, Jing Zheng, Wen Wang, Ramana Rao Gadde, Martin Graciarena, Victor
Abrash, and Colleen Richey. Speech recognition engineering issues in speech to
speech translation system design for low resource languages and domains. In
Proceedings of ICASSP, Toulose, France, May 2006.
URL: http://sail.usc.edu/publications/shri_icassp06.pdf.
- Shankar Ananthakrishnan, Srinivas Bangalore, and Shrikanth Narayanan. Automatic
diacritization of arabic transcripts for automatic speech recognition. In Proceedings of International Conference On Natural Language Processing, Kanpur, India, December 2005.
URL: http://sail.usc.edu/publications/shankar_ICON-2005-Arabic-Diacritization-Final.pdf.
- Abhinav Sethy, Panayiotis Georgiou, and Shrikanth Narayanan. Building topic specific language models from webdata using competitive models. In Proc. of EUROSPEECH, Interspeech, Lisbon, Portugal, 2005. PDF
- Dagen Wang and Shrikanth Narayanan. Piecewise linear stylization of pitch via wavelet analysis. In Proc. of EUROSPEECH, Interspeech, Lisbon, Portugal, 2005. PDF
- Dagen Wang and Shrikanth Narayanan. Speech rate estimation via temporal correlation and selected sub-band correlation. In Proc. ICASSP, Philadelphia, PA, March 2005. PDF
- Dagen Wang and Shrikanth Narayanan. An unsupervised quantitative measure for word prominence in spontaneous speech. In Proc. ICASSP, Philadelphia, PA, March 2005. PDF
- Shankar Ananthakrishnan and Shrikanth Narayanan. An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model. In Proc. ICASSP, Philadelphia, PA, March 2005. PDF
- Joseph Tepperman and Shrikanth Narayanan. Automatic syllable stress detection using prosodic features for pronunciation evaluation of language learners. In Proc. ICASSP, Philadelphia, PA, March 2005. PDF
- Emil Ettelaie, Sudeep Gandhe, Panayiotis Georgiou, Scott Millward Kevin Knight, Daniel Marcu, Shrikanth Narayanan, David Traum, Robert Belvin, and Howard E. Neely. Transonics: A practical speech-to-speech translator for english-farsi medical dialogues. In Proc. ACL, Ann Arbor, MI, 2005.
- Abhinav Sethy and S. Narayanan. Measuring convergence in language model estimation using relative entropy. In Proceedings of ICSLP, Jeju, Korea, October 2004. PS
- Panayiotis G. Georgiou, Hooman Shirani Mehr, and Shrikanth S. Narayanan. Context dependent statistical augmentation of persian transcripts. In Proceedings of ICSLP, Jeju, Korea, October 2004.
- Robert Belvin, Win May, Shrikanth Narayanan, Panayiotis Georgiou, and Shadi Ganjavi. Creation of a doctor-patient dialogue corpus using standardized patients. In Proc. LREC, Lisbon, Portugal, 2004.
- S. Narayanan, S. Ananthakrishnan, R. Belvin, E. Ettaile, S. Gandhe, S. Ganjavi, P. G. Georgiou, C. M. Hein, S. Kadambe, K. Knight, D. Marcu, H. E. Neely, N. Srinivasamurthy, D. Traum, and D. Wang. The transonics spoken dialogue translator: An aid for english-persian doctor-patient interviews. In AAAI Fall Symposium, 2004.
- Dagen Wang and Shrikanth Narayanan. A multi-pass linear fold algorithm for sentence boundary detection using prosodic cues. In Proc. ICASSP, Montreal, Canada, May 2004. PDF
- Shadi Ganjavi, Panayiotis Georgiou, and Shrikanth Narayanan. Ascii based transcription systems with the arabic script: The case of persian. In Proc. IEEE ASRU, St. Thomas, U.S. Virgin Islands, December 2003. PDF
- S. Narayanan, S. Ananthakrishnan, R. Belvin, E. Ettaile, S. Ganjavi, P. Georgiou, C. Hein, S. Kadambe, K. Knight, D. Marcu, H. Neely, N. Srinivasamurthy, D. Traum, and D. Wang. Transonics: A speech to speech system for english-persian interactions. In Proc. IEEE ASRU, St.Thomas, U.S. Virgin Islands, Decmeber 2003. PDF
- Naveen Srinivasamurthy and Shrikanth Narayanan. Language-adaptive persian speech recognition. In Proc. Eurospeech, Geneva, Switzerland, 2003. PDF
Screenshots
DEMOS
Previous version of SpeechLinks system
- Transonics 2.0 demo is ready, click here (4.82MB)
or here (35.8MB) to view