Consonant and vowel gestures are
How are these actions combined into larger structures (words)?
Synchronization
The simplest way of coordinating movements of multiple body parts is in-phase (0o).
No learning is required.

Anti-phase (180o) is also simple and doesn't require learning but is less stable than in-phase.
Spontaneous shifts from anti-phase to in-phase occur as rate of oscillation increases.
Consequence of CV synchronization: Parallel Transmission
Gestures for the consonant and vowel in a CV syllable are normally produced in parallel (partially overlapping in time) rather than strictly sequentially.
Evidence for CV synchronization
We can observe the times during which consonant and vowel gestures are produced by measuring movements of different constrictors.
This is done by attaching markers to a subject's articulators (e.g., lips, tongue tip, tongue body) and tracking the motion of the articulators that inform us about the timing of a given constriction's formation and release.
Techniques for articulatory tracking: