VAM - Information

University of Southern California

The "Vera am Mittag" (VAM) German Audio-Visual emotional speech database

Home

More Info

Release

Publications

Scope of the Database

The VAM corpus was extracted from 12 hours of recordings of the German TV talk-show "Vera am Mittag" (Vera at noon). These recordings were segmented into broadcasts, dialogue acts and utterances. The audio-visual speech corpus contains spontaneous and emotional speech recorded from unscripted, authentic discussions between the guests of the talk-show. Such data may be of interest to research groups working on spontaneous speech analysis, emotion recognition in both, speech and facial expression, natural language understanding, and robust speech recognition. Further interests may arise from a linguist's viewpoint in the variety of German regional accents that are present in the data.

Emotion Labels

In addition to the audio-visual data and the segmented utterances we provide emotion labels for a great part of the data. The emotion labels are given on a continuous-valued scale for three emotion primitives: valence (positive vs. negative), activation (calm vs. excited) and dominance (weak vs. strong). Several human evaluators performed the assessment individually. As a method, self assessment manikins were used (see publications).

Database Structure

The data is structured as follows:

1. VAM-Video

The VAM-Video corpus contains the audio-visual signals of the 12 broadcasts (shows). The following two corpus parts are extracted from the VAM-Video corpus. Unfortunately, almost the whole VAM-Video data got lost and is no longer available. the remaining subset of 78 utterances (5 speakers) is shared as mpg files 352x288 pixels, 25 fps.

2. VAM-Audio

This part of the corpus contains the audio signal only. In total, 947 utterances are contained in the VAM-Audio corpus. The data is organized speaker-wise for 47 individual speakers (11m/36f). For each speaker, the data is sub-structured in sentences. Due to the origin of the data, we have a different amount of data per speaker: from 4 up to 46 utterances. For each sentence, we provide one wav file. The wav files are recorded at 16 kHz sampling rate and 16 bit resolution as stereo signals. The emotion was assessed by several independent human evaluators: 17 evaluators for talk-show speakers 1-19, and 6 evaluators for speakers 20-47, respectively. Each individual evaluator's assessment is provided (*.eva files). In addition, fused emotion evaluation results are provided (*.ewe files).
To distinguish between the first set of recordings which was labeled by a larger number of evaluators and the second set of recordings, we suggest to call the utterances by speakers 1-19 "VAM-Audio I" and the rest "VAM-Audio II". The corpus size of VAM-Audio is 177 MB.

3. VAM-Faces

This part of the corpus contains extracted facial images of the speakers in the VAM-Video corpus. The corpus contains 1867 facial images extracted from audio-visual speech recordings of the VAM-Video corpus. The data is organized speaker-wise for a subset of 20 speakers. For each sentence we provide several facial images as png files.
The images have a resolution of 352x288 pixels.
The emotional content was labeled using emotion primitive labels valence, activation, and dominance (*.eva files). In addition, emotion category label were given (*.cat files).
The corpus size on disk is 255 MB.

Detailed Description

"The Vera am Mittag German Audio-Visual Emotional Speech Database", Michael Grimm, Kristian Kroschel, and Shrikanth Narayanan. Proceedings IEEE International Conference on Multimedia and Expo (ICME), Hannover, 2008, pp. 865-868

Suggested reading: Grimm et al., "Primitives-Based Evaluation and Estimation of Emotions in Speech", Speech Communication, Volume 49, Issues 10-11, October-November 2007, Pages 787-800

IMSC | SIPI | EE-Systems | University of Southern California

3710 S. McClintock Ave, RTH 320
Los Angeles, CA 90089, U.S.A