Semantically-grounded Representations of Audio with Visually Augmented captioNs 2024
Large language Model aided generation of descriptive visual-context enhanced audio captions for audio-language pretraining. Generative-augmented pretraining objective to learn semantic-grounded audio representations with zeroshot applications.