Visually grounded representation learning using language games for embodied AI
Abstract
The ability to communicate in Natural Language is considered one of the ingredients
that facilitated the development of humans’ remarkable intelligence. Analogously, developing artificial agents that can seamlessly integrate with humans requires them to
understand, and use Natural Language, just like we do. Humans use Natural Language to coordinate and communicate relevant information to solve their tasks—they
play so-called “language games”. In this thesis work, we explore computational models
of how meanings can materialise in situated and embodied language games. Meanings
are instantiated when language is used to refer to, and to do things in the world. In
these activities, such as “guessing an object in an image” or “following instructions to
complete a task”, perceptual experience can be used to derive grounded meaning representations. Considering that di↵erent language games favour the development of specific
concepts, we argue it is detrimental to evaluate agents on their ability to solve a single
task. To mitigate this problem, we define GroLLA, a multi-task evaluation framework
for visual guessing games that extends a goal-oriented evaluation with auxiliary tasks
aimed at assessing the quality of the representations as well. By using this framework,
we demonstrate the inability of recent computational models to learn truly multimodal
representations that can generalise to unseen object categories. To overcome this issue,
we propose a representation learning component that derives concept representations
from perceptual experience, obtaining substantial gains over the baselines—especially
when unseen object categories are involved. To demonstrate that guessing games are
a generic procedure for grounded language learning, we present SPIEL, a novel self-play procedure to transfer learned representations to novel multimodal tasks. We show
that models trained in this way can obtain better performance as well as learn better
concept representations than competitors. Thanks to this procedure, artificial agents
can learn from interaction using any image-based datasets. Additionally, learning the
meaning of concepts involves understanding how entities interact with other entities in
the world. For this purpose, we use action-based and event-driven language games to
study how an agent can learn visually grounded conceptual representations from dynamic scenes. We design EmBERT, a generic architecture for an embodied agent able
to learn representations useful to complete language-guided action execution tasks in
a 3D environment. Finally, learning visually grounded representations can be achieved
when watching others completing a task. Inspired by this idea, we study how to learn
representations from videos that can be used for tackling multimodal tasks such as commentary generation. For this purpose, we define Goal, a highly multimodal benchmark
based on football commentaries that requires models to learn very fine-grained and rich
representations to be successful. We conclude with some future directions for further
progress in computational learning of grounded meaning representations.