Visual and auditory information are integrated into a single amodal phonetic percept.
The syllable (gestural pattern) that is perceived is the one that is maximally consistent with all the available evidence.
(Massaro and Stork formalize this using a Bayesian decision rule).
Hypothesis Consistent with Video? Consistent with Audio? ba NO YES da passably passably ga YES NO
Audio and visual information are typically complementary
Some gestures are very clearly represented in visual signal, but less clearly in audio:
"place of articulation"
(constrictor used for oral closure gesture)
Lips closures are very distinct from tongue tip or body closures in visual signal. (ba vs. da or ga).
Visual information can sometimes distinguish tongue tip from tongue body. (da vs. ga)
Acoustic differences in formant transitions between lip, tongue tip, and tongue body closures are very subtle.
Some gestures are very weakly represented in visual signal (or represented not at all). but are well represented in the audio signal.
- velic opening gesture (ba vs. ma)
- laryngeal gesture (ba vs. pa)