Description

The corpus comprises a total of 1109 sentences uttered by 14 native English speakers (6 males and 8 females). Areal time 3D scannerand a professional microphone were used to capture the facial movements and the speech of the speakers. The dense dynamic face scans were acquired at 25 frames per second and the RMS error in the 3D reconstruction is about 0.5 mm. In order to ease automatic speech segmentation, we carried out the recordings in a anechoic room, with walls covered by sound wave-absorbing materials, as shown in the picture.Each sentence was recorded twice:First, the speaker read the sentence from text, with a neutral expression.Then, the speaker watched a clip extracted from a feature film where the sentence is acted by professional actors and the context is highly emotional. After rating the emotions induced by the video, the speaker repeated the sentence.Related publications:G. Fanelli,J. Gall, H. Romsdorfer,T. Weiseand L. Van Gool,A 3D Audio-Visual Corpus of Affective Communication,IEEE Transactions on Multimedia, Vol. 12, No. 6, pp. 591 - 598, October 2010 (PDF).G. Fanelli,J. Gall, H. Romsdorfer,T. Weiseand L. Van Gool,Acquisition of a 3D Audio-Visual Corpus of Affective Speech,ETH BIWI Tech report n. 270 (PDF).G. Fanelli,J. Gall, H. Romsdorfer,T. Weiseand L. Van Gool,3D Vision Technology for Capturing Multimodal Corpora: Chances and Challenges,LREC WS on Multimodal Corpora, Malta, May 2010 (PDF).

Related Papers

  • Then, the speaker watched a clip extracted from a feature film where the sentence is acted by professional actors and the context is highly emotional. After rating the emotions induced by the video, the speaker repeated the sentence. [link]
  • First, the speaker read the sentence from text, with a neutral expression. [link]