Audio-visual 3D Tracker (AV3T)

Compact multi-sensor platforms are portable and thus desirable for robotics and personal-assistance tasks. However, compared to physically distributed sensors, the size of these platforms makes person tracking more difficult.

To address this challenge, we propose an Audio Visual 3D Tracker (AV3T) [1] for multi-speaker tracking in 3D, using the complementarity of audio and video signals captured from a small-size co-located sensing platform. The key points of the proposed algorithm are

  • Face detector driven Particle Filter (PF) framework
  • Selective visual likelihoods between the discriminative (face detection) and generative (color spatio-gram) models
  • Video-assisted Global Coherence Field (GCF) [2] computation

The following figure shows the recording environment of the CAV3D dataset, where the yellow markers on the ground are used to calibrate the corner cameras and the close-view of the co-located sensor (surrounded by the red ellipse) is at the bottom right of the picture. The camera is around 48 cm above the microphone array. The region covered by the camera’s Field of View is within the red dashed line. The world coordinates x, y, z are originated at the top-right room corner, which are marked as magenta, green and blue respectively.


Block Diagram

The block diagram of the proposed audio-visual 3D tracker (AV3T) is presented in the following figure, where the light blue blocks represent the computation of the visual likelihood, the light magenta blocks represent the computation of the audio likelihood, the light yellow blocks represent the audio-visual tracking and GCC-PHAT stands for Generalized Cross Correlation with PHAse Transform [3].



Results

 We evaluate our proposed AV3T tracker on

  • CAV3D: a self-collected and annotated dataset

    • Left panel
      • top row: image tracking result (red cross), mouth ground truth (green cross) and the updated reference image (the right rectangular crop)
      • middle row: audio, video and audio-visual likelihoods i.e. particle weights
      • bottom row: audio, video and audio-visual likelihood maps on the speaker-height plane
    • Right panel
      • top 3 rows: tracking trajectory (red) and speaker ground-truth position (green) in individual x, y and z world coordinates
      • bottom row: 3D instant error (blue) at time t and the Mean Absolute Error (black) until time t

  • AV16.3: a publicly available dataset [4]

    • Left panel
      • top row: image tracking result (red cross), mouth ground-truth (green cross) and the updated reference image (the right rectangular crop)
      • bottom rows: tracking trajectory (red) and speaker ground-truth position (green) in individual x, y and z world coordinates; 3D instant error (blue) at time t and the Mean Absolute Error (black) until time t
    • Right panel
      • 1st & 3rd rows:  audio, video and audio-visual likelihoods i.e. particle weights of target 1 & 2, respectively
      • 2nd & 4th rows:  audio, video and audio-visual likelihood maps of target 1 & 2, respectively


Resources

CAV3D dataset

MATLAB source code [link]


References

[1] X. Qian, A. Brutti, O. Lanz, M. Omologo and A. Cavallaro, "Multi-speaker tracking from an audio-visual sensing device," in IEEE Transactions on Multimedia. doi: 10.1109/TMM.2019.2902489

[2] M. Omologo, P. Svaizer, and R. De Mori, “Acoustic transduction,” in Spoken Dialogue with Computer. Academic Press, 1998, ch. 2, pp. 1–46.

[3]  C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, Aug 1976.

[4] G. Lathoud, J.-M. Odobez, and D. Gatica-Perez, “AV16.3: an audio-visual corpus for speaker localization and tracking,” in Machine Learning for Multimodal Interaction. Martigny, Switzerland: Springer, Jun 2004.