EE519
Digital Speech Processing for Multimedia: Coding, Recognition and Synthesis
Fall 2004



EE519 : Intro Lecture (week 1.1)

Digital Speech Processing for Multimedia
 - Recognition, Coding, Synthesis

Prof. Narayanan (shri@sipi.usc.edu)

1. Refer to handout for course related information,  aprrox. time line for topics etc.

2.  Why Speech?

Most natural and efficient means for communication
  -- to express intent, ideas, desires

Some characteristics of the speech signal:
    - produced by human ("biological" signal arising out of a physical process, with "meaning" encapsulated
    - inter (i.e., between) and intra (i.e., within) speaker variability; multilingual

3. Speech Processing?

   Basic Aim: How to quantify and represent information in the speech signal?
   -- application to synthesis, coding, recognition, speech enhancement aids for handicapped etc.

    Refer to Figure 1 for an overview of the speech production and perception process.

     Speech processing is one of the main application areas of signal processing
     -- course aims to provide techniques and tools for R&D in this area.

4. The field of speech research is multidisciplinary
    - brings together signal processing with systems, communications, computer science (e.g, AI, computational linguistics), psychology, linguistics, acoustics, biomedical engineering,...
 

5. Specific application area of speech technology:
     Conversational Human-Machine interfaces

     Why needed?
     - To provide ubiquitous, easy-to-use,  low cost, communication/computingfor everyone everywhere: speech is a natural modality
       + eyes free, hands free, provides random/direct access to information (no menu navigation), no typing involved

     Demo: Listen to a conversation between a human and a machine during a travel task application (Courtesy, AT&T DARPA-Communicator, circa July 2000): shows the use of synthesis, recognition, machine understanding of speech, and intelligent management of interaction

     Compare speech with other modalities (anecdotal figures!):
      INPUT
          speech: 50-250 wpm (words per minute)
          typing: 20-90 wpm
          pointing (mouse,touch,stylus): 10-40 wpm
          writing: 25 wpm
      OUTPUT:
        speech/audio: sequential, non-persistent, medium-low bandwidth
        text (w/ highlights): low bandwidth, persistent
        graphics: high bandwidth (can be persistent)

    What does it take to build a conversational system?
        (not necessarily all of them, depending on the
         complexity of you application)

        automatic speech recognition (ASR)
        text-to-speech synthesis (TTS)
        natural language understanding (NLU)
        natural language generation (concept to text/speech)
        speaker verification/recognition
        dialog/discourse management
        information retrieval

     Some current application areas:
         *Information retrieval and negotiation
          - airline/train information/reservations
          - banking, stock/weather/news
          - news on demand
          - web browsing: voice (web) portals

          * Command and control (desktop, PDAs)
          * voice dialing, directory assistance, yellow/white pages
          * learning/educational aids, games

      Problems and Issues?
        *Expectations are high: people are quite good at
         spoken communication, and expect the same from machines
         - even a three year old does continuous speech, large vocabulary,
           speaker-independent recognition in noisy environments
           pretty well.
        * Underlying ASR, NLU technology error prone (data driven)
          - not robust especially in challenging situations
            (e.g., noisy acoustic environments, out of domain speech etc.)
          i.e, input is highly variable -- various sources of signal degradation:
           + environment: noisy, multi-speaker (airports, streets, office,..)
           + speaker variability
           + channel: telephone (wired and wireless), IP, satellite
           + microphone variability
        * Technology is still immature: it is a still evolving field
        * Quickly developing applications: portability  still a major
          issue
        * Testing and evaluating speech systems is an open research topic
 

6. A brief history of speech technology
        * Why speech in the first place? One view (many abound), from
        Sir Richard Paget (1930) in Human Speech:
       "What drove man to the invention of speech, as I imagine, not so much the need of expressing his thoughts (for that have been been done quite satisfactorily by bodily gesture) as the difficulty of talking with his hands full"

        * Studies in
          Speech Production - how speech is produced by human are centuries old.

        * Much progress in speech technology has been made with the advent of electrical communication, with knowledge drawn multiple disciplines fueling this progress.

        * "Broad" milestones:

        o TALKING MACHINES

        + Talking statues (oracles) of ancient Rome and Greece [Fig. 2]
        + Kratzenstein (1779) synthesis of vowel sounds  using a set of acoustic resonators excited by a vibrating reed.  [Fig 3]
        + von Kemplen (1791) A complex mechanical system of reeds and resonators - had little credibility due to a human  found  operating his chess-playing automaton.
          Sir Charles Wheatstone  reproduced it later in 1879 [Fig 4]

        + Dudley of Bell Labs (1939) demonstrated the first electrical synthesizer.

        o TRANSMISSION OF MESSAGES

          Acoustic waves propagate spherically, not good for long distances

         + electrical telegraph, dots-dashes (Morse code): mid 19th century
         + telephony: breakthrough - transmission of voice over electrical wires (1876). Not suitable for long distances (amplifiers not around yet)
         + Transcontinental telephony using electromechanical repeaters (1915)
         + Trans-atlantic radio telephony (1927). Not enough bandwidth.
           Fueled research in coding (Vocoders)
         + Transatlantic telephone over cable (1956)
           (use of submersible amplifiers)
         + Present day: voice over wireless, over IP (both wired and wireless)
           Problems: quality degradation, latency

         o SIGNAL PROCESSING REVOLUTION (DSP)

         + Interest in Spoken human-machine interactions
           - ASR and TTS research

         o PRESENT-FUTURE

           Better and more reliable ASR, TTS, natural language understanding, translation of spontaneous spoken language
           Multimodal communication technology
           Better quality coders with really low bit rates (especially for mobile, cellular and personal communication systems)
 

 7. A brief review of the state-of-the art

          o Coding: compressing information in the speech (audio) signal for transmission and storage

          * International standardization helped immensely in furthering coding development  [Figure  5a, 5b]

          * Lower bit rates result in degradation in perceived quality of coded speech [Figure 6]. There is quality degradation knee around 8 Kbps.

          * Perceptually motivated techniques -- based on models of human hearing have been found greatly useful
            e.g., PAC - perceptual audio coding
 

          o  Synthesis: Generate machine response in conversational human-machine interaction

             + Speech production  model based [Figure 7]
                Tries to imitate physical process of human speech production.
                Tedious to derive rules for deriving control parameters.
                Language dependent.
                Poor voice quality -- attributed to approximations in source-filter model
                But provides control over synthesis in  features such as prosody

             + Concatenative Synthesis (data-driven)
                excellent quality for limited domains
                very successful recently
                prosodic control problematic (open research topic)

                e.g., Edinburgh Festival synthesis,
                      AT&T Next Generation synthesis
 

           o Recognition: map acoustic speech signal to phones ( or words)
               [Figure 8]

              Word Error Rate (WER) provides a measure of ASR performance.

              Sentence (i.e., "word string") accuracy probability decreases
              exponentially as the string length increases and WER increases.

              Some idea of the error rates for benchmark tasks are in
              Figures 9,10,11

              Alleviation (used presently):
               By Semantic processing (extracting meaning)
               Dialog for error control and recovery
               Multimodal communication