Learning to handle miscommunication in multi-modal conversational AI
Abstract
In human communication, we continuously negotiate shared understanding and deal
with misunderstandings as they arise to achieve mutual coordination. However, despite
the ubiquity and importance of misunderstandings and repairs in dialogue, conversational
AI often struggles to process them effectively, limiting their ability to collaborate with
humans through natural language.
This thesis explores how to develop robust models for processing miscommunications
in situated collaborative tasks. We first collect a dialogue corpus to study human-agent
coordination in an ambiguous environment, finding that models struggle to resolve referring expressions. To address this shortcoming, we design and train models to ground
referring expressions and detect ambiguities, learning strong multi-modal representations
in situated dialogues. We then analyse the signals required for models to learn to handle miscommunications, and propose a cross-modal taxonomy of clarifications to assess
the contribution of distinct modalities. Our experiments with different model architectures and training objectives reveal that secondary objectives are essential to integrate
multiple modalities (dialogue, visual and relational), leading to models better suited to
deal with challenging clarifications in conversations. Finally, we evaluate how generative multi-modal LLMs handle both miscommunications and repairs by releasing a new
benchmark, BlockWorld-Repairs, based on human data collections and studies. We then
propose alternative training approaches that encourage models to learn from interactive
settings, generalising to handling both instructions and subsequent repairs for successful
task-completion. Throughout this thesis, we highlight the challenges posed by miscommunications and present approaches to develop robust collaborative conversational AI
models better adapted for human interactions.