Audiovisual Database for Simultaneous Speakers Voice Activity Detection and Sound Source Localization

The below files are multimodal recordings used for testing voice activity (VAD) and sound source localization (SSL) algorithms. They were first introduced in our paper below, please cite it if you use our dataset.

Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs. V. P. Minotto, C. R. Jung, B. Lee. IEEE Transactions on Multimedia 16 (4), 1032-1044

The audio files were recorded at 44100Hz and the video files at 10fps in some cases, and 20fps in others. For this reason, in order to process both audio and video data as much synchronously as possible, use for each image frame 4096 audio samples in the 10fps case, and 2048 in the 20fps case.

All recordings are named by starting with one of the three prefixes below.

One: recordings with only one speaker in the scene.

Two: recordings with two competing speakers.

Three: recordings with three competing speakers.

The following table shows the available data for download.

Audio Data: eight-channel .wav files captured at 44,100 Hz

Video Data: video files in .avi format; some captured at 10fps and some at 20fps.

Depth Data: depth stream captured using the Kinect sensor. This information is useful for labeling the true position of the speakers, that is, to be used as ground-truth for SSL. Note, however, that not all sequences contain a depth stream. Furthermore, they are distributed as series of .png images, instead of a video file, given that Microsoft’s particular encoding is used.

Physical setup: these are files created and saved using matlab. By loading it into your Matlab workspace, a series of variables are created, which contain all required information for reproducing our capture system (microphones positions, camera parameters etc.).

Ground Truth Data: A series of XML files (one for each sequence) representing the Ground Truth data for VAD, SSL, and Speaker Labeling of the scene. The ground truth files contain information regarding VAD, world SSL, image SSL, and user label for each frame. The ReadMe.xml file offers a comprehensive example of the files’ structure.