Suglia, Assistant Professor AlessandroEshghi, Associate Professor ArashChiyah-Garcia, Javier2026-01-152025-04https://www.ros.hw.ac.uk/handle/10399/5238In human communication, we continuously negotiate shared understanding and deal with misunderstandings as they arise to achieve mutual coordination. However, despite the ubiquity and importance of misunderstandings and repairs in dialogue, conversational AI often struggles to process them effectively, limiting their ability to collaborate with humans through natural language. This thesis explores how to develop robust models for processing miscommunications in situated collaborative tasks. We first collect a dialogue corpus to study human-agent coordination in an ambiguous environment, finding that models struggle to resolve referring expressions. To address this shortcoming, we design and train models to ground referring expressions and detect ambiguities, learning strong multi-modal representations in situated dialogues. We then analyse the signals required for models to learn to handle miscommunications, and propose a cross-modal taxonomy of clarifications to assess the contribution of distinct modalities. Our experiments with different model architectures and training objectives reveal that secondary objectives are essential to integrate multiple modalities (dialogue, visual and relational), leading to models better suited to deal with challenging clarifications in conversations. Finally, we evaluate how generative multi-modal LLMs handle both miscommunications and repairs by releasing a new benchmark, BlockWorld-Repairs, based on human data collections and studies. We then propose alternative training approaches that encourage models to learn from interactive settings, generalising to handling both instructions and subsequent repairs for successful task-completion. Throughout this thesis, we highlight the challenges posed by miscommunications and present approaches to develop robust collaborative conversational AI models better adapted for human interactions.enLearning to handle miscommunication in multi-modal conversational AIThesis