Audio-visual integration of speech

(see Massaro and Stork, 1998)

 

Visual and auditory information are integrated into a single amodal phonetic percept.

The syllable (gestural pattern) that is perceived is the one that is maximally consistent with all the available evidence.

 

(Massaro and Stork formalize this using a Bayesian decision rule).

Example:

Video: ga

Audio: ba

Hypothesis

Consistent with Video?

Consistent with Audio?

ba

NO

YES

da

passably

passably

ga

YES

NO

Audio and visual information are typically complementary

Some gestures are very clearly represented in visual signal, but less clearly in audio:

"place of articulation"
(constrictor used for oral closure gesture)

Lips closures are very distinct from tongue tip or body closures in visual signal. (ba vs. da or ga).

Visual information can sometimes distinguish tongue tip from tongue body. (da vs. ga)

Acoustic differences in formant transitions between lip, tongue tip, and tongue body closures are very subtle.

Some gestures are very weakly represented in visual signal (or represented not at all). but are well represented in the audio signal.

Back to Experimental Evidence for Motor Theory