Audiovisual integration

Audio-visual integration of speech

(see Massaro and Stork, 1998)

Visual and auditory information are integrated into a single amodal phonetic percept.

The syllable (gestural pattern) that is perceived is the one that is maximally consistent with all the available evidence.

(Massaro and Stork formalize this using a Bayesian decision rule).

Example:

Video: ga

Audio: ba

Hypothesis
Consistent with Video?
Consistent with Audio?

ba
NO
YES

da
passably
passably

ga
YES
NO

Audio and visual information are typically complementary

Some gestures are very clearly represented in visual signal, but less clearly in audio:

"place of articulation"
(constrictor used for oral closure gesture)

Lips closures are very distinct from tongue tip or body closures in visual signal. (ba vs. da or ga).

Visual information can sometimes distinguish tongue tip from tongue body. (da vs. ga)

Acoustic differences in formant transitions between lip, tongue tip, and tongue body closures are very subtle.

Some gestures are very weakly represented in visual signal (or represented not at all). but are well represented in the audio signal.

velic opening gesture (ba vs. ma)
laryngeal gesture (ba vs. pa)

Back to Experimental Evidence for Motor Theory

Hypothesis	Consistent with Video?	Consistent with Audio?
ba	NO	YES
da	passably	passably
ga	YES	NO