With the
advances on Deep Learning and creation of large-volume datasets, the
progress becomes
particularly strong on the image classification task.
Therefore, we investigate if the performance of such architectures
generalizes to more complex
tasks that require a more holistic approach to scene comprehension.
The presented work focuses on learning spatial and multi-modal
representations,
and the foundations of a Visual Turing Test, where the scene understanding
is tested by a series of questions about its content.
In our studies, we propose DAQUAR, the first ‘question answering about
real-world images’ dataset together
with methods, termed a symbolic-based and a neural-based visual question
answering architectures,
that address the problem. The symbolic-based method relies on a semantic
parser, a database of visual facts, and
a bayesian formulation that accounts for various interpretations of the
visual scene.
The neural-based method is an end-to-end architecture composed of a
question encoder, image encoder,
multimodal embedding, and answer decoder. This architecture has proven to
be effective in capturing
language-based biases. It also becomes the standard component of other
visual question answering architectures.
Along with the methods, we also investigate various evaluation metrics
that embraces uncertainty
in word's meaning, and various interpretations of the scene and the question.