Optimising strategies for learning visually grounded word meanings through interaction
Abstract
Language Grounding is a fundamental problem in AI, regarding how symbols in Natural
Language (e.g. words and phrases) refer to aspects of the physical environment (e.g. ob
jects and attributes). In this thesis, our ultimate goal is to address an interactive language
grounding problem, i.e. learning perceptual groundings (specifically vision) through Nat
ural Language (NL) interaction with humans. Although some previous work has shown
significant progress on language/symbol grounding on different tasks, there are still some
limitations and unsolved problems: (a) only learning groundings holistically without under
standing individual parts of the linguistic and non-linguistic context, (b) requiring training
data of high quantity and quality, but without the possibility of on-line error correction, and
(c) not being able to continuously and incrementally learn from the external environment.
Most these limitations are likely to be alleviated if systems can learn symbol groundings, as
and when needed, from natural, everyday conversations with humans.
For working on all of the above limitations at once, this thesis proposes a modular Interactive
Multi-modal Framework, which is compositional, optimised, trainable incrementally with
small amounts of data, and able to handle natural, spontaneous dialogue. Specifically, we
collect real human-human conversations (BURCHAK corpus) for investigating how humans
behave in an interactive learning task, which contains a wide range of dialogue capabilities,
strategies, and linguistic phenomena encountered in natural, spontaneous dialogue. This
thesis then explores how different capabilities and strategies (from the real data) affect the
overall learning/grounding efficiency, i.e. higher recognition accuracy with less human effort
in the dialogue. We found that an agent, that is able to: 1) take initiative, 2) consider
both uncertainty from visual classification and context-dependencies from dialogue, and
3) demand further information if necessary, performs better. Finally, following the above
results, we train an optimised multi-modal dialogue agent using Reinforcement Learning for
addressing interactive language grounding against the real data. The agent learns: (1) to
perform a form of active learning, i.e. only ask further information if necessary, and (2)
to process natural, daily conversations with humans. Here, we incorporate our framework
with an incremental semantic formalism (the DS-TTR framework) that dynamically presents
compositional representations for both linguistic and non-linguistic (visual) context, and is
able to process natural, spontaneous conversations (specifically incremental phenomena,
such as “self-repair”). These advances bring us closer to addressing the interactive grounding problem, and bringing
robots from the laboratory into the real world, where they will need to speak in the same
language as human beings.