Learning to handle miscommunication in multi-modal conversational AI

dc.contributor.advisorSuglia, Assistant Professor Alessandro
dc.contributor.advisorEshghi, Associate Professor Arash
dc.contributor.authorChiyah-Garcia, Javier
dc.date.accessioned2026-01-15T18:31:32Z
dc.date.issued2025-04
dc.description.abstractIn human communication, we continuously negotiate shared understanding and deal with misunderstandings as they arise to achieve mutual coordination. However, despite the ubiquity and importance of misunderstandings and repairs in dialogue, conversational AI often struggles to process them effectively, limiting their ability to collaborate with humans through natural language. This thesis explores how to develop robust models for processing miscommunications in situated collaborative tasks. We first collect a dialogue corpus to study human-agent coordination in an ambiguous environment, finding that models struggle to resolve referring expressions. To address this shortcoming, we design and train models to ground referring expressions and detect ambiguities, learning strong multi-modal representations in situated dialogues. We then analyse the signals required for models to learn to handle miscommunications, and propose a cross-modal taxonomy of clarifications to assess the contribution of distinct modalities. Our experiments with different model architectures and training objectives reveal that secondary objectives are essential to integrate multiple modalities (dialogue, visual and relational), leading to models better suited to deal with challenging clarifications in conversations. Finally, we evaluate how generative multi-modal LLMs handle both miscommunications and repairs by releasing a new benchmark, BlockWorld-Repairs, based on human data collections and studies. We then propose alternative training approaches that encourage models to learn from interactive settings, generalising to handling both instructions and subsequent repairs for successful task-completion. Throughout this thesis, we highlight the challenges posed by miscommunications and present approaches to develop robust collaborative conversational AI models better adapted for human interactions.en
dc.identifier.urihttps://www.ros.hw.ac.uk/handle/10399/5238
dc.language.isoenen
dc.publisherHeriot-Watt Universityen
dc.publisherMathematical and Computer Sciencesen
dc.titleLearning to handle miscommunication in multi-modal conversational AIen
dc.typeThesisen

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
Chiyah-GarciaJ_0425_macsSS.pdf
Size:
25.08 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: