Audiovisual
Database for Simultaneous Speakers Voice Activity Detection and Sound Source
Localization
The below files are multimodal recordings used
for testing voice activity (VAD) and sound source localization (SSL)
algorithms. They were first introduced in our paper
below, please cite it if you use our dataset.
Simultaneous-Speaker Voice Activity Detection
and Localization Using Mid-Fusion of SVM and HMMs. V. P. Minotto, C. R. Jung,
B. Lee. IEEE Transactions on Multimedia 16 (4), 1032-1044
The audio files were recorded at 44100Hz and
the video files at 10fps in some cases, and 20fps in others. For this reason,
in order to process both audio and video data as much synchronously as
possible, use for each image frame 4096 audio samples in the 10fps case, and
2048 in the 20fps case.
All recordings are named by starting with one
of the three prefixes below.
One: recordings with only one speaker in the scene.
Two: recordings with two competing speakers.
Three: recordings with three competing speakers.
The following table shows the available data
for download.
Audio Data: eight-channel .wav files captured at 44,100 Hz
Video Data: video files in .avi format; some captured at 10fps and some at 20fps.
Depth Data: depth stream captured using the Kinect sensor. This information is
useful for labeling the true position of the speakers, that is, to be used as
ground-truth for SSL. Note, however, that not all sequences contain a depth
stream. Furthermore, they are distributed as series of .png images, instead of
a video file, given that Microsoft’s particular encoding is used.
Physical setup: these are files created and saved using matlab. By loading it into your
Matlab workspace, a series of variables are created, which contain all required
information for reproducing our capture system (microphones positions, camera
parameters etc.).
Ground Truth Data: A series of XML files (one for each sequence)
representing the Ground Truth data for VAD, SSL, and Speaker Labeling of the
scene. The ground truth files contain information regarding VAD, world SSL,
image SSL, and user label for each frame. The ReadMe.xml file offers a
comprehensive example of the files’ structure.