Welcome to EE 619: Advanced Topics in Automatic Speech Recognition.
This course will help you learn how modern automatic speech recognition
systems (ASRs) are built and how they work. The emphasis will be on
statistical methods and modeling techniques. You will learn about
Hidden Markov Models as generative models for speech (including HMM
training, evaluation, and decoding algorithms), acoustic modeling using
HMMs, front end processing for robustness, statistical language models,
and dialogue modeling. Finally, you will see how these techniques can
be brought together to construct complex, useful applications, such as
speech translation systems, multimodal information processing, and even
speech synthesis. Course details follow.
Course Goals: Theory, Practice
and Research in ASR and speech processing
Meeting
Times: Friday 12:00-2:50pm
Location: TBD (to be determined)
Instructor: Shri Narayanan
Office hours: Wednesday
9:00-11:00am
Location: EEB 430
E-mail: shri@sipi.usc.edu
Teaching Assistant: Jangwon Kim
Office hours: Thursday 3-5pm
Location: EEB 413
E-mail: jangwon@usc.edu
Pre-requisites
- Probability (EE 464) and
Speech Processing (EE 519).
- Knowledge of text-processing
tools such as sed, and either Perl or Python will help a lot.
- Familiarity with Linux / UNIX
environment will be a definite
plus.
Syllabus
0. Engineering spoken language systems - overview and problems
- Review probability, estimation
and information theory
1. Overview of speech recognition problem
- Front-end signal processing,
acoustic and language modeling, and search
- Background Material:
Probability, Statistics, Information Theory, Pattern Recognition
2. Hidden Markov Models for speech recognition
- Search for the most likely
state sequenc
- Parameter estimation for HMMs.
3. Acoustic modeling
- Context-dependent and
context-independent models
- Pronunciation modeling
4. Implementing a continuous speech recognizer
- The speech decoder: beam
search, review of the state-of-the-art in performance
- Search techniques
5. Robust speech recognition
- Front end signal processing
methods: cepstral subtraction, perceptually motivated approaches
- Parallel model combination
techniques
- Vocal-tract normalization
- Supervised and unsupervised
adaptation (LDA, MLLR etc)
6. Language modeling
- Languages and grammars
- Stochastic language models
(N-grams)
7. Natural language understanding
- Parsing and robust parsing
- Semantic representation for
machine understanding
8. Spoken dialog systems
- Design principles and dialog
system types
- Architecture issues
- Mathematical modeling of spoken
dialog systems
- The Markov decision process
model
- Evaluation and testing of
spoken dialog systems
9. Speaker recognition
10. Additional application topics:
- Audio-visual speech/speaker
recognition
- Distributed speech recognition
- Speech-to-speech translation
- Speech recognition techniques
(especially HMM) for speech synthesis
- Music transcription
Course Materials
a) Reference texts:
* Spoken Language Processing: A
guide to theory, algorithm and system development, X. Huang, A. Acero,
H-W. Hon, Prentice Hall 2001
* Cambridge University HTK Book,
Steve Young et al. Download from
http://htk.eng.cam.ac.uk/docs/docs.html
[you need to first register]
* Fundamentals of Speech
Recognition, Rabiner and Juang, Prentice Hall, 1993
* Statistical Methods for Speech
recognition, F. Jelinek, MIT Press, 1997
* Speech and Language Processing,
Jurafsky and Martin, Prentice Hall, 2000
* Automatic speech and speaker
recognition: Advanced topics, Chin-hui Lee, Frank Soong, K. Paliwal,
Kluwer, 1996
b) Other References
Conference Proceedings
* ICASSP: International
Conference on Acoustics, Speech and Signal Processing
* Interspeech: International Conference
on Spoken Language Processing (ICSLP) and Eurospeech (each biennial)
* ASRU: Automatic Speech
Recognition and Understanding Workshop
* Workshops of ESCA (European
Speech Communication Association), DARPA
Journals
* IEEE Transactions of Speech and
Audio Processing
* IEEE Transactions of Pattern
Analysis and Machine Intelligence
* Speech Communication
* IEEE Transactions on Multimedia
* Computer, Speech and Language
* Computational Linguistics
Open Source Software
* HTK (HMM Toolkit),
http://htk.eng.cam.ac.uk/,
Cambridge University
* Sphinx,
http://www.speech.cs.cmu.edu/speech/sphinx,
CMU
* KALDI,
http://kaldi.sourceforge.net/about.html
Course Outline
The course will be organized in a seminar style, and some sessions may
be hands-on lab work. There will be no homeworks, but one
mini-project-style assignment will be given in order for you to get
familiar with building a speech recognizer using the HTK toolkit. In
addition to this assignment, you are expected to work on a term project
with KALDI (automatic speech recognition toolkit). The project is expected to
be somewhat comprehensive. A list of possible ideas will be handed out
within the first few weeks. The project requires a midterm progress
report presentation and a final report and presentation. Dates to note:
Midterm presentation: TBD
Final presentation: TBD
There will be no other exams.
Grading Policy: Class
assignment 25%, course project 75%
Back to top
Last updated: 16 January 2013