Sergio - Visual Question Generator

Visual Question Generation model, that uses InceptionV3 for encoding the vector, and a network inspired by ViT (Vision Transformer) for generating question related to the image.

Subject

CBIR

Service

Web Dev, NLP, CV

Date

June, 2021

Challenge

The objective of the project was the develop a visual question generator given a dataset containing the image and corressponding natural questions that a human would ask. However the some of the urls for the images no longer were valid and the questions often had spelling errors.

Solution

We first had to clean the data and make sure the urls provided for the image are actually valid. Once we had the data in the required format, we developed an encoder-decoder model, which encoded the image as a vector and the decoder used this to generate the question.

The objective of the project was to intergrate and test out our knowledge in Natural Language Processing and Computer Vision. The key idea behind the project was to create a Deep Learning model that can take an image as input and identify features or objects in the image and ask naturally occuring questions regarding it. For instance, when shown an image of food item, a natural can be ‘how does it taste’ or ‘who made it’. Instead of doing object detection, we decided to encode the image as a vector which represents the objects and features present in the image. This was achieved using an InceptionV3 network with ImageNet weights.

The vector is the passed through a network inspired by Vision Transformer (ViT Network). This essentially focussed on important features that have been encoded in the vector, and this was then used to generate the question. The question generation was done one word at a time, using an LSTM based RNN. The RNN relied on the information retreived from the image and the words that have been already generated.