Situated grounding and understanding of structured low-resource expert data
Abstract
Conversational agents are becoming more widespread, varying from social to goaloriented to multi-modal dialogue systems. However, for systems with both visual
and spatial requirements, such as situated robot planning, developing accurate goaloriented dialogue systems can be extremely challenging, especially in dynamic environments, such as underwater or first responders. Furthermore, training data-driven
algorithms in these domains is challenging due to the esoteric nature of the interaction, which requires expert input. We derive solutions for creating a collaborative
multi-modal conversational agent for setting high-level mission goals. We experiment with state-of-the-art deep learning models and techniques and create a new
data-driven method (MAPERT) that is capable of processing language instructions
by grounding the necessary elements using various types of input data (vision from
a map, text and other metadata). The results show that, depending on the task,
the accuracy of data-driven systems can vary dramatically depending on the type
of metadata and the attention mechanisms that are used. Finally, we are dealing
with low-resource expert data and this inspired the use of the Continual Learning
and Human In The Loop methodology with encouraging results.