
Digital Speech Processing for Multimedia
- Recognition, Coding, Synthesis
Prof. Narayanan (shri@sipi.usc.edu)
1. Refer to handout for course related information, aprrox. time line for topics etc.
2. Why Speech?
Most natural and efficient means for communication
-- to express intent, ideas, desires
Some characteristics of the speech signal:
- produced by human ("biological" signal arising out
of a physical process, with "meaning" encapsulated
- inter (i.e., between) and intra (i.e., within)
speaker variability; multilingual
3. Speech Processing?
Basic Aim: How to quantify and represent information in
the speech signal?
-- application to synthesis, coding, recognition, speech
enhancement aids for handicapped etc.
Refer to Figure 1 for an overview of the speech production and perception process.
Speech processing is one of the main
application areas of signal processing
-- course aims to provide techniques and tools
for R&D in this area.
4. The field of speech research is multidisciplinary
- brings together signal processing with systems,
communications, computer science (e.g, AI, computational linguistics),
psychology, linguistics, acoustics, biomedical engineering,...
5. Specific application area of speech technology:
Conversational Human-Machine interfaces
Why needed?
- To provide ubiquitous, easy-to-use,
low cost, communication/computingfor everyone everywhere: speech is a
natural modality
+ eyes free, hands free, provides
random/direct access to information (no menu navigation), no typing
involved
Demo: Listen to a conversation between a human and a machine during a travel task application (Courtesy, AT&T DARPA-Communicator, circa July 2000): shows the use of synthesis, recognition, machine understanding of speech, and intelligent management of interaction
Compare speech with other modalities
(anecdotal figures!):
INPUT
speech: 50-250
wpm (words per minute)
typing: 20-90
wpm
pointing
(mouse,touch,stylus): 10-40 wpm
writing: 25 wpm
OUTPUT:
speech/audio: sequential,
non-persistent, medium-low bandwidth
text (w/ highlights): low
bandwidth, persistent
graphics: high bandwidth
(can be persistent)
What does it take to build a conversational
system?
(not necessarily all of
them, depending on the
complexity of you
application)
automatic speech
recognition (ASR)
text-to-speech synthesis
(TTS)
natural language
understanding (NLU)
natural language generation
(concept to text/speech)
speaker
verification/recognition
dialog/discourse management
information retrieval
Some current application areas:
*Information retrieval
and negotiation
- airline/train
information/reservations
- banking,
stock/weather/news
- news on demand
- web browsing:
voice (web) portals
* Command and
control (desktop, PDAs)
* voice dialing,
directory assistance, yellow/white pages
*
learning/educational aids, games
Problems and Issues?
*Expectations are high:
people are quite good at
spoken communication,
and expect the same from machines
- even a three year
old does continuous speech, large vocabulary,
speaker-independent recognition in noisy environments
pretty
well.
* Underlying ASR, NLU
technology error prone (data driven)
- not robust
especially in challenging situations
(e.g., noisy acoustic environments, out of domain speech etc.)
i.e, input is
highly variable -- various sources of signal degradation:
+
environment: noisy, multi-speaker (airports, streets, office,..)
+ speaker
variability
+ channel:
telephone (wired and wireless), IP, satellite
+
microphone variability
* Technology is still
immature: it is a still evolving field
* Quickly developing
applications: portability still a major
issue
* Testing and evaluating
speech systems is an open research topic
6. A brief history of speech technology
* Why speech in the first
place? One view (many abound), from
Sir Richard Paget (1930) in
Human Speech:
"What drove man to the invention
of speech, as I imagine, not so much the need of expressing his
thoughts (for that have been been done quite satisfactorily by bodily
gesture) as the difficulty of talking with his hands full"
* Studies in
Speech
Production - how speech is produced by human are centuries old.
* Much progress in speech technology has been made with the advent of electrical communication, with knowledge drawn multiple disciplines fueling this progress.
* "Broad" milestones:
o TALKING MACHINES
+ Talking statues
(oracles) of ancient Rome and Greece [Fig.
2]
+ Kratzenstein (1779)
synthesis of vowel sounds using a set of acoustic resonators
excited by a vibrating reed. [Fig
3]
+ von Kemplen (1791) A
complex mechanical system of reeds and resonators - had little
credibility due to a human found operating his
chess-playing automaton.
Sir Charles
Wheatstone reproduced it later in 1879 [Fig
4]
+ Dudley of Bell Labs (1939) demonstrated the first electrical synthesizer.
o TRANSMISSION OF MESSAGES
Acoustic waves propagate spherically, not good for long distances
+ electrical
telegraph, dots-dashes (Morse code): mid 19th century
+ telephony:
breakthrough - transmission of voice over electrical wires (1876). Not
suitable for long distances (amplifiers not around yet)
+ Transcontinental
telephony using electromechanical repeaters (1915)
+ Trans-atlantic radio
telephony (1927). Not enough bandwidth.
Fueled
research in coding (Vocoders)
+ Transatlantic
telephone over cable (1956)
(use of
submersible amplifiers)
+ Present day: voice
over wireless, over IP (both wired and wireless)
Problems:
quality degradation, latency
o SIGNAL PROCESSING REVOLUTION (DSP)
+ Interest in
Spoken human-machine interactions
- ASR and
TTS research
o PRESENT-FUTURE
Better
and more reliable ASR, TTS, natural language understanding, translation
of spontaneous spoken language
Multimodal
communication technology
Better
quality coders with really low bit rates (especially for mobile,
cellular and personal communication systems)
7. A brief review of the state-of-the art
o Coding: compressing information in the speech (audio) signal for transmission and storage
* International standardization helped immensely in furthering coding development [Figure 5a, 5b]
* Lower bit rates result in degradation in perceived quality of coded speech [Figure 6]. There is quality degradation knee around 8 Kbps.
*
Perceptually motivated techniques -- based on models of human hearing
have been found greatly useful
e.g., PAC - perceptual audio coding
o Synthesis: Generate machine response in conversational human-machine interaction
+ Speech production model based [Figure
7]
Tries to imitate physical process of human speech production.
Tedious to derive rules for deriving control parameters.
Language dependent.
Poor voice quality -- attributed to approximations in source-filter
model
But provides control over synthesis in features such as prosody
+ Concatenative Synthesis (data-driven)
excellent quality for limited domains
very successful recently
prosodic control problematic (open research topic)
e.g., Edinburgh Festival synthesis,
AT&T Next Generation synthesis
o
Recognition: map acoustic speech signal to phones ( or words)
[Figure
8]
Word Error Rate (WER) provides a measure of ASR performance.
Sentence (i.e., "word string") accuracy probability decreases
exponentially as the string length increases and WER increases.
Some idea of the error rates for benchmark tasks are in
Figures 9,10,11
Alleviation (used presently):
By Semantic processing (extracting meaning)
Dialog for error control and recovery
Multimodal communication