Optimising strategies for learning visually grounded word meanings through interaction
Language Grounding is a fundamental problem in AI, regarding how symbols in Natural Language (e.g. words and phrases) refer to aspects of the physical environment (e.g. ob jects and attributes). In this thesis, our ultimate goal is to address an interactive language grounding problem, i.e. learning perceptual groundings (speciﬁcally vision) through Nat ural Language (NL) interaction with humans. Although some previous work has shown signiﬁcant progress on language/symbol grounding on diﬀerent tasks, there are still some limitations and unsolved problems: (a) only learning groundings holistically without under standing individual parts of the linguistic and non-linguistic context, (b) requiring training data of high quantity and quality, but without the possibility of on-line error correction, and (c) not being able to continuously and incrementally learn from the external environment. Most these limitations are likely to be alleviated if systems can learn symbol groundings, as and when needed, from natural, everyday conversations with humans. For working on all of the above limitations at once, this thesis proposes a modular Interactive Multi-modal Framework, which is compositional, optimised, trainable incrementally with small amounts of data, and able to handle natural, spontaneous dialogue. Speciﬁcally, we collect real human-human conversations (BURCHAK corpus) for investigating how humans behave in an interactive learning task, which contains a wide range of dialogue capabilities, strategies, and linguistic phenomena encountered in natural, spontaneous dialogue. This thesis then explores how diﬀerent capabilities and strategies (from the real data) aﬀect the overall learning/grounding eﬃciency, i.e. higher recognition accuracy with less human eﬀort in the dialogue. We found that an agent, that is able to: 1) take initiative, 2) consider both uncertainty from visual classiﬁcation and context-dependencies from dialogue, and 3) demand further information if necessary, performs better. Finally, following the above results, we train an optimised multi-modal dialogue agent using Reinforcement Learning for addressing interactive language grounding against the real data. The agent learns: (1) to perform a form of active learning, i.e. only ask further information if necessary, and (2) to process natural, daily conversations with humans. Here, we incorporate our framework with an incremental semantic formalism (the DS-TTR framework) that dynamically presents compositional representations for both linguistic and non-linguistic (visual) context, and is able to process natural, spontaneous conversations (speciﬁcally incremental phenomena, such as “self-repair”). These advances bring us closer to addressing the interactive grounding problem, and bringing robots from the laboratory into the real world, where they will need to speak in the same language as human beings.