Context modelling for visually grounded response selection and generation
Abstract
With recent progress in deep learning, there has been an increased interest in visually
grounded dialog, which requires an AI agent to hold a meaningful conversation with
humans in Natural Language about visual content in other modalities, e.g. pictures
or videos.
This thesis contributes improved context modelling techniques for multi-modal
visually grounded response selection and generation. We show that knowing about
relevant context encodings enables a system to respond more accurately and more
helpfully to the user request. We also show that different types of context encodings
are relevant for different multi-modal visually grounded tasks and datasets.
In particular, this thesis focuses on two specific scenarios: response generation
for task-based multimodal search and open-domain response selection for image-grounded conversations. For these tasks, the thesis contributes new models for
context encoding, including knowledge grounding, encoding history, and multimodal
fusion. Throughout these tasks, the thesis provides an in-depth critical analysis of
shortcomings of current models, tasks and evaluation metrics.