Speaker tracking in a joint audio-video network
Abstract
Situational awareness is achieved naturally by the human senses of sight and hearing in
combination. System-level automatic scene understanding aims at replicating this human
ability using cooperative microphones and cameras. In this thesis, we integrate and fuse
audio and video signals at different levels of abstractions to detect and track a speaker
in a scenario where people are free to move indoors. Despite the low complexity of the
system, which consists of just 4 microphones pairs and 1 camera, results show that the
overall multimodal tracker is more reliable than single modality systems, tolerating large
occlusions and cross-talking. The system evaluation is performed on both single modality
and multimodality tracking. The performance improvement given by the audio-video
integration and fusion is quantified in terms of tracking precision and accuracy as well as
speaker diarisation error rate and precision-recall recognition metrics. We evaluate our
results vs. the closest works: a 56% improvement on audio only sound source localisation
computational cost and an 18% increment on the speaker diarisation error rate over a
speaker-only unit is achieved.