Speaker tracking in a joint audio-video network
MetadataShow full item record
Situational awareness is achieved naturally by the human senses of sight and hearing in combination. System-level automatic scene understanding aims at replicating this human ability using cooperative microphones and cameras. In this thesis, we integrate and fuse audio and video signals at different levels of abstractions to detect and track a speaker in a scenario where people are free to move indoors. Despite the low complexity of the system, which consists of just 4 microphones pairs and 1 camera, results show that the overall multimodal tracker is more reliable than single modality systems, tolerating large occlusions and cross-talking. The system evaluation is performed on both single modality and multimodality tracking. The performance improvement given by the audio-video integration and fusion is quantified in terms of tracking precision and accuracy as well as speaker diarisation error rate and precision-recall recognition metrics. We evaluate our results vs. the closest works: a 56% improvement on audio only sound source localisation computational cost and an 18% increment on the speaker diarisation error rate over a speaker-only unit is achieved.