Skip to content

Visual Question Answering – Large Scale dataset based on human generated captions

  • by

Recently with the advent of promising CNN based object detection models, tasks related to scene understanding have become pretty popular. Quite a few research groups have shown some results which are not just impressive but they claim to be very close to human performance . Couple of such tasks are image captioning, visual question answering/ visual turning tests.

Microsoft Research group has released a humungously large image caption dataset for about 120k images (train/val split). Where each image is annotated with 5 captions. For first few weeks of PhD I got tempted to work on Visual Question Answering. However, I later dropped the idea. It is very interesting to see a lot of work, esp. completely human annotated question answering dataset from COCO team itself. During my VQA exploration period I have generated few thousand question answer pairs from existing 600k captions. They are not as perfect as the human annotated question/answer pairs but definitely could be used to train a supervised Q/A system. I have recently come across a arXiv paper which uses similar method as I did to generate question answer pairs from already existing captions. However, I have generated them using manipulated and customized version of tree transformation algorithm of Michael Heilman’s Question Generation work. It can have answers up to length 4, so covers much more cases than the COCO-QA dataset. In the generated dataset all the answers to Boolean questions are True as they are generated from sentences. One naïve of generating negative examples is to replace one of the categories in the question with random category.

As, I’m not working on it anymore. Thought it could be useful for someone who might be looking for image Q/A datasets and thus sharing the dataset. I have only generated the question which can have an answer of length 4 or less (write to me if you are looking for answers with longer length). Every question/asnwer pair is shared along with the original caption from which it is generated and the mapping to cocoid.

You can download the dataset here.

Currently I have not worked on checking/correcting the grammaticality of these questions. One naive way of checking the grammaticality could be training a simple binary classifier with human generated correct questions as positive and randomly generated or sampling ill formed wiki questions or ranomly generated questions as negative samples.

Please feel free to write to my email if you have any questions.