Show simple item record

dc.contributor.advisorRobertson, Neil
dc.contributor.advisorHopgood, James
dc.contributor.authorD'Arca, Eleonora
dc.date.accessioned2016-10-20T10:39:57Z
dc.date.available2016-10-20T10:39:57Z
dc.date.issued2015-05
dc.identifier.urihttp://hdl.handle.net/10399/2972
dc.description.abstractSituational awareness is achieved naturally by the human senses of sight and hearing in combination. System-level automatic scene understanding aims at replicating this human ability using cooperative microphones and cameras. In this thesis, we integrate and fuse audio and video signals at different levels of abstractions to detect and track a speaker in a scenario where people are free to move indoors. Despite the low complexity of the system, which consists of just 4 microphones pairs and 1 camera, results show that the overall multimodal tracker is more reliable than single modality systems, tolerating large occlusions and cross-talking. The system evaluation is performed on both single modality and multimodality tracking. The performance improvement given by the audio-video integration and fusion is quantified in terms of tracking precision and accuracy as well as speaker diarisation error rate and precision-recall recognition metrics. We evaluate our results vs. the closest works: a 56% improvement on audio only sound source localisation computational cost and an 18% increment on the speaker diarisation error rate over a speaker-only unit is achieved.en_US
dc.description.sponsorshipEPSRCen_US
dc.language.isoenen_US
dc.publisherHeriot-Watt Universityen_US
dc.publisherEngineering and Physical Sciencesen_US
dc.rightsAll items in ROS are protected by the Creative Commons copyright license (http://creativecommons.org/licenses/by-nc-nd/2.5/scotland/), with some rights reserved.
dc.titleSpeaker tracking in a joint audio-video networken_US
dc.typeThesisen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record