Introduction

SpeechLinks Solutions

SpeechLinks Solution

The SpeechLinks translation device operates by recognizing the user'sspeech input and converting it into text. The text is subsequently translated, and re-synthesized back into the target language. The following is a more detailed description of the system's functionality:

The Automatic Speech Recognition (ASR) subsystem that produces n-best lists along with the decoding confidence scores. The ASR operates in real time in two languages: English and Persian(Farsi). The English ASR operates on a vocabulary of over 22,000 words and the Persian with over 9000 words. The ASR uses models of the acoustic patterns of human speech trained by example recordings and statistical knowledge about the structure of the language trained by example transcripts.

The Dialog Manager (DM) acts as the heart of the system, redirectingmessages accordingly. Every output of the speech recognizer, tagged with time stump, serial number, and other important information is received by the DM, and usually is redirected for display to the Graphical User interface, and for translation to the Machine Translation unit. However, the DM also has the options of rejecting recognition results based on several factors, such as confidence reported by the ASR.

The Machine Translation (MT) unit operates in two modes: The Classifier attempts to assign a concept to an utterance, so for example a sentence such as "Umm and do you have any headaches" would become the concept "Do you have a headache". The classifier can perform really accurately in the case of utterances that are in domain, in other words, for those sentences that we trained the system exhaustively to work with (about 1200 utterances). However out of domain utterances are translated in a statistical manner by the Statistical Machine Translation (SMT) unit. The SMT breaks the sentence into segments and translates piece by piece, reconstructing the result on the destination language.

Finally, a unit selection based Text To Speech synthesizer (TTS) provides the spoken output. The TTS operates in several modes: for utterance results of the classifier, the naturally sounding human recording of the utterance is played out. If the utterance was statistically classified, then the unit of the synthesizer drops from the utterance level to the word level and a concatenation of words is played out. Finally, for words that we have no human recordings of, synthesis of the individual words, and of the sentence takes place at the diphone level.

PEOPLE

USC Faculty

Dr. Shri Narayanan (SAIL)
Dr. Panayiotis Georgiou , Project Integration Leader (SAIL)

USC Post-Doctors and Students

Murtaza Bulut, Ph.D. student
Emil Ettelaie , Ph.D. student
Kyu Jeong Han, Ph.D. student
Ozlem Kalinli , Ph.D. student
Tom Murray, Ph.D. student
JongHo Shin, Ph.D. student
Shiva Sundaram , Ph.D. student
Andreas Tsiartas, Ph.D. student
Pezhman Zarifian, Undergraduate student
Shokufeh Farazmand, Undergraduate student

PAPERS

Jong Ho Shin, Panayiotis Georgiou, and Shrikanth Narayanan. Analyzing the multimodal behaviors of users of a speech-to-speech translation device by using concept matching scores. In Proceedings of IEEE Multimedia Signal Processing Workshop, Chania, Greece, October 2007.
Shiva Sundaram and Shrikanth Narayanan. Experiments in automatic genre classification of full-length music tracks using audio activity rate. In Proceedings of IEEE Multimedia Signal Processing Workshop, Chania, Greece, October 2007.
Ozlem Kalinli and Shrikanth Narayanan. Early auditory processing inspired features for robust automatic speech recognition. In Proceedings of EUSIPCO, Poznan, Poland, September 2007. URL: http://sail.usc.edu/publications/ozlem_eusipco07.pdf.
Murtaza Bulut, Sungbok Lee, and Shrikanth Narayanan. Analysis of emotional speech prosody in terms of part of speech tags. In Proceedings of InterSpeech ICSLP, Antwerp, Belgium, August 2007. URL: http://sail.usc.edu/publications/murtaza_icslp07.pdf.
Kyu Jeong Han and Shrikanth Narayanan. A robust stopping criterion for agglomerative hierarchical clustering in a speaker diarization system. In Proceedings of InterSpeech ICSLP, Antwerp, Belgium, August 2007. URL: http://sail.usc.edu/publications/kyuhan_icslp07.pdf.
Ozlem Kalinli and Shrikanth Narayanan. A saliency-based auditory attention model with applications to unsupervised prominent syllable detection in speech. In Proceedings of InterSpeech ICSLP, Antwerp, Belgium, August 2007. URL: http://sail.usc.edu/publications/ozlem_icslp07.pdf.
Sankaranarayanan Ananthakrishnan and Shrikanth Narayanan. Prosody-enriched lattices for improved syllable recognition. In Proceedings of InterSpeech ICSLP, Antwerp, Belgium, August 2007.
Shiva Sundaram and Shrikanth Narayanan. Analysis of audio clustering using word descriptions. In Proceedings of ICASSP, Honolulu, Hawaii, April 2007. URL: http://sail.usc.edu/publications/shiva_cluster_icassp07.pdf.
Sankaranarayanan Ananthakrishnan and Shrikanth Narayanan. Improved speech recognition using acoustic and lexical correlates of pitch accent in a n-best rescoring framework. In Proceedings of ICASSP, Honolulu, Hawaii, April 2007. URL: http://sail.usc.edu/publications/ananthakrishnan_icassp07.pdf.
Abhinav Sethy, Shrikanth Narayanan, and Bhuvana Ramabhadran. Data driven approach for language model adaptation using stepwise relative entropy minimization. In Proceedings of ICASSP, Honolulu, Hawaii, April 2007. URL: http://sail.usc.edu/publications/sethy_icassp07.pdf.
Murtaza Bulut, Sungbok Lee, and Shrikanth Narayanan. A statistical approach for modeling prosody features using pos tags for emotional speech synthesis. In Proceedings of ICASSP, Honolulu, Hawaii, April 2007. URL: http://sail.usc.edu/publications/bulut_icassp07.pdf.
Shiva Sundaram and Shrikanth Narayanan. Discriminating two types of noise sources using cortical representation and dimension reduction technique. In Proceedings of ICASSP, Honolulu, Hawaii, April 2007. URL: http://sail.usc.edu/publications/shiva_noise_icassp07.pdf.
JongHo Shin, Panayiotis Georgiou, and Shrikanth Narayanan. User modeling in a speech translation driven mediated interaction setting. In Proceedings of First International Workshop on Human-Centered Multimedia (ACM Multimedia), Santa Barbara, CA, October 2006. URL: http://sail.usc.edu/publications/JongHo_acm06_hcm.pdf.
Shiva Sundaram and Shrikanth Narayanan. An attribute-based approach to audio description applied to segmenting vocal sections in popular music songs. In Proceedings of MMSP, Victoria, Canada, October 2006. URL: http://sail.usc.edu/publications/sundaram_MMSP06.pdf.
Shiva Sundaram and Shrikanth Narayanan. Vector-based representation and clustering of audio using onomatopoeia words. In Proceedings of AAAI 2006 Fall Symposium, Aurally Informed Performance: Integrating Machine Listening and Auditory Presentation in Robotic Systems, Arlington, VA, October 2006. URL: http://sail.usc.edu/publications/sundaram_AAAI06.pdf.
Shankar Ananthakrishnan and Shrikanth Narayanan. Combining acoustic, lexical, and syntactic evidence for automatic unsupervised prosody labeling. In Proceedings of InterSpeech ICSLP, Pittsburgh, PA, September 2006. URL: http://sail.usc.edu/publications/ananthak_icslp06.pdf.
Emil Ettelaie, Panayiotis Georgiou, and Shrikanth Narayanan. Cross-lingual dialog model for speech to speech translation. In Proceedings of InterSpeech ICSLP, Pittsburgh, PA, September 2006. URL: http://sail.usc.edu/publications/emil-icslp2006.pdf.
Abhinav Sethy, Panayiotis Georgiou, and Shrikanth Narayanan. Text data acquisition for domain-specific language models. In Proceedings of EMNLP, Sydney, Australia, July 2006. URL: http://sail.usc.edu/publications/abhinav_emnlp06.pdf.
Panayiotis Georgiou, Abhinav Sethy, JongHo Shin, and Shrikanth Narayanan. An english-persian automatic speech translator: Recent developments in domain portability and user modeling. In Proceedings of International Conference on Intelligent Systems And Computing: Theory And Applications, Ayia Napa, Cyprus, July 2006. URL: http://sail.usc.edu/publications/georgiou_isyc2006_s2s.pdf.
Abhinav Sethy, Panayiotis Georgiou, and Shrikanth Narayanan. Selecting relevant text subsets from webdata for building topic specific language models. In Proceedings of HLT, New York City, New York, June 2006. URL: http://sail.usc.edu/publications/abhinav_hlt06.ps.
Shrikanth Narayanan, Panayiotis Georgiou, Abhinav Sethy, Dagen Wang, Shankar Ananthakrishnan, Emil Ettelaie, Horacio Franco, Kristin Precoda, Dimitra Vergyri, Jing Zheng, Wen Wang, Ramana Rao Gadde, Martin Graciarena, Victor Abrash, and Colleen Richey. Speech recognition engineering issues in speech to speech translation system design for low resource languages and domains. In Proceedings of ICASSP, Toulose, France, May 2006. URL: http://sail.usc.edu/publications/shri_icassp06.pdf.
Shankar Ananthakrishnan, Srinivas Bangalore, and Shrikanth Narayanan. Automatic diacritization of arabic transcripts for automatic speech recognition. In Proceedings of International Conference On Natural Language Processing, Kanpur, India, December 2005. URL: http://sail.usc.edu/publications/shankar_ICON-2005-Arabic-Diacritization-Final.pdf.
Abhinav Sethy, Panayiotis Georgiou, and Shrikanth Narayanan. Building topic specific language models from webdata using competitive models. In Proc. of EUROSPEECH, Interspeech, Lisbon, Portugal, 2005. PDF
Dagen Wang and Shrikanth Narayanan. Piecewise linear stylization of pitch via wavelet analysis. In Proc. of EUROSPEECH, Interspeech, Lisbon, Portugal, 2005. PDF
Dagen Wang and Shrikanth Narayanan. Speech rate estimation via temporal correlation and selected sub-band correlation. In Proc. ICASSP, Philadelphia, PA, March 2005. PDF
Dagen Wang and Shrikanth Narayanan. An unsupervised quantitative measure for word prominence in spontaneous speech. In Proc. ICASSP, Philadelphia, PA, March 2005. PDF
Shankar Ananthakrishnan and Shrikanth Narayanan. An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model. In Proc. ICASSP, Philadelphia, PA, March 2005. PDF
Joseph Tepperman and Shrikanth Narayanan. Automatic syllable stress detection using prosodic features for pronunciation evaluation of language learners. In Proc. ICASSP, Philadelphia, PA, March 2005. PDF
Emil Ettelaie, Sudeep Gandhe, Panayiotis Georgiou, Scott Millward Kevin Knight, Daniel Marcu, Shrikanth Narayanan, David Traum, Robert Belvin, and Howard E. Neely. Transonics: A practical speech-to-speech translator for english-farsi medical dialogues. In Proc. ACL, Ann Arbor, MI, 2005.
Abhinav Sethy and S. Narayanan. Measuring convergence in language model estimation using relative entropy. In Proceedings of ICSLP, Jeju, Korea, October 2004. PS
Panayiotis G. Georgiou, Hooman Shirani Mehr, and Shrikanth S. Narayanan. Context dependent statistical augmentation of persian transcripts. In Proceedings of ICSLP, Jeju, Korea, October 2004.
Robert Belvin, Win May, Shrikanth Narayanan, Panayiotis Georgiou, and Shadi Ganjavi. Creation of a doctor-patient dialogue corpus using standardized patients. In Proc. LREC, Lisbon, Portugal, 2004.
S. Narayanan, S. Ananthakrishnan, R. Belvin, E. Ettaile, S. Gandhe, S. Ganjavi, P. G. Georgiou, C. M. Hein, S. Kadambe, K. Knight, D. Marcu, H. E. Neely, N. Srinivasamurthy, D. Traum, and D. Wang. The transonics spoken dialogue translator: An aid for english-persian doctor-patient interviews. In AAAI Fall Symposium, 2004.
Dagen Wang and Shrikanth Narayanan. A multi-pass linear fold algorithm for sentence boundary detection using prosodic cues. In Proc. ICASSP, Montreal, Canada, May 2004. PDF
Shadi Ganjavi, Panayiotis Georgiou, and Shrikanth Narayanan. Ascii based transcription systems with the arabic script: The case of persian. In Proc. IEEE ASRU, St. Thomas, U.S. Virgin Islands, December 2003. PDF
S. Narayanan, S. Ananthakrishnan, R. Belvin, E. Ettaile, S. Ganjavi, P. Georgiou, C. Hein, S. Kadambe, K. Knight, D. Marcu, H. Neely, N. Srinivasamurthy, D. Traum, and D. Wang. Transonics: A speech to speech system for english-persian interactions. In Proc. IEEE ASRU, St.Thomas, U.S. Virgin Islands, Decmeber 2003. PDF
Naveen Srinivasamurthy and Shrikanth Narayanan. Language-adaptive persian speech recognition. In Proc. Eurospeech, Geneva, Switzerland, 2003. PDF

Screenshots

	coming soon

DEMOS

Previous version of SpeechLinks system

- Transonics 2.0 demo is ready, click here (4.82MB) or here (35.8MB) to view