Visually grounded representation learning using language games for embodied AI
MetadataShow full item record
The ability to communicate in Natural Language is considered one of the ingredients that facilitated the development of humans’ remarkable intelligence. Analogously, developing artificial agents that can seamlessly integrate with humans requires them to understand, and use Natural Language, just like we do. Humans use Natural Language to coordinate and communicate relevant information to solve their tasks—they play so-called “language games”. In this thesis work, we explore computational models of how meanings can materialise in situated and embodied language games. Meanings are instantiated when language is used to refer to, and to do things in the world. In these activities, such as “guessing an object in an image” or “following instructions to complete a task”, perceptual experience can be used to derive grounded meaning representations. Considering that di↵erent language games favour the development of specific concepts, we argue it is detrimental to evaluate agents on their ability to solve a single task. To mitigate this problem, we define GroLLA, a multi-task evaluation framework for visual guessing games that extends a goal-oriented evaluation with auxiliary tasks aimed at assessing the quality of the representations as well. By using this framework, we demonstrate the inability of recent computational models to learn truly multimodal representations that can generalise to unseen object categories. To overcome this issue, we propose a representation learning component that derives concept representations from perceptual experience, obtaining substantial gains over the baselines—especially when unseen object categories are involved. To demonstrate that guessing games are a generic procedure for grounded language learning, we present SPIEL, a novel self-play procedure to transfer learned representations to novel multimodal tasks. We show that models trained in this way can obtain better performance as well as learn better concept representations than competitors. Thanks to this procedure, artificial agents can learn from interaction using any image-based datasets. Additionally, learning the meaning of concepts involves understanding how entities interact with other entities in the world. For this purpose, we use action-based and event-driven language games to study how an agent can learn visually grounded conceptual representations from dynamic scenes. We design EmBERT, a generic architecture for an embodied agent able to learn representations useful to complete language-guided action execution tasks in a 3D environment. Finally, learning visually grounded representations can be achieved when watching others completing a task. Inspired by this idea, we study how to learn representations from videos that can be used for tackling multimodal tasks such as commentary generation. For this purpose, we define Goal, a highly multimodal benchmark based on football commentaries that requires models to learn very fine-grained and rich representations to be successful. We conclude with some future directions for further progress in computational learning of grounded meaning representations.