|
|
Scope of the Database
The VAM corpus was extracted from 12 hours of recordings of the German TV talk-show "Vera am Mittag" (Vera at noon). These recordings were segmented into broadcasts, dialogue acts and utterances. The audio-visual speech corpus contains spontaneous and emotional speech recorded from unscripted, authentic discussions between the guests of the talk-show. Such data may be of interest to research groups working on spontaneous speech analysis, emotion recognition in both, speech and facial expression, natural language understanding, and robust speech recognition. Further interests may arise from a linguist's viewpoint in the variety of German regional accents that are present in the data.
Emotion Labels
In addition to the audio-visual data and the segmented utterances we provide emotion labels for a great part of the data. The emotion labels are given on a continuous-valued scale for three emotion primitives: valence (positive vs. negative), activation (calm vs. excited) and dominance (weak vs. strong). Several human evaluators performed the assessment individually. As a method, self assessment manikins were used (see publications).
Database Structure
The data is structured as follows:
1. VAM-Video
The VAM-Video corpus contains the audio-visual signals of the 12 broadcasts (shows). The following two corpus parts are extracted from the VAM-Video corpus. Unfortunately, almost the whole VAM-Video data got lost and is no longer available. the remaining subset of 78 utterances (5 speakers) is shared as mpg files 352x288 pixels, 25 fps.
2. VAM-Audio
This part of the corpus contains the audio signal only. In total, 947 utterances are contained in the VAM-Audio corpus. The data is organized speaker-wise for 47 individual speakers (11m/36f). For each speaker, the data is sub-structured in sentences. Due to the origin of the data, we have a different amount of data per speaker: from 4 up to 46 utterances. For each sentence, we provide one wav file. The wav files are recorded at 16 kHz sampling rate and 16 bit resolution as stereo signals.
The emotion was assessed by several independent human evaluators: 17 evaluators for talk-show speakers 1-19, and 6 evaluators for speakers 20-47, respectively. Each individual evaluator's assessment is provided (*.eva files). In addition, fused emotion evaluation results are provided (*.ewe files).
To distinguish between the first set of recordings which was labeled by a larger number of evaluators and the second set of recordings, we suggest to call the utterances by speakers 1-19 "VAM-Audio I" and the rest "VAM-Audio II". The corpus size of VAM-Audio is 177 MB.
3. VAM-Faces
This part of the corpus contains extracted facial images of the speakers in the VAM-Video corpus. The corpus contains 1867 facial images extracted from audio-visual speech recordings of the VAM-Video corpus. The data is organized speaker-wise for a subset of 20 speakers. For each sentence we provide several facial images as png files. The images have a resolution of 352x288 pixels.
The emotional content was labeled using emotion primitive labels valence, activation, and dominance (*.eva files). In addition, emotion category label were given (*.cat files).
The corpus size on disk is 255 MB.
Detailed Description
"", Michael Grimm, Kristian Kroschel, and Shrikanth Narayanan. Proceedings IEEE International Conference on Multimedia and Expo (ICME), Hannover, 2008, pp. 865-868
Suggested reading: Grimm et al., "", Speech Communication, Volume 49, Issues 10-11, October-November 2007, Pages 787-800
|
|
|
(c) 2004 Speech Analysis & Interpretation
Laboratory
3710 S. McClintock Ave, RTH 320
Los Angeles, CA 90089, U.S.A
|
|