vqa-backend / DATASET_CARD.md
Deva8's picture
Deploy VQA Space with model downloader
bb8f662

VQA v2 Curated Dataset for Spatial Reasoning

Dataset Description

This is a curated and balanced subset of the VQA v2 (Visual Question Answering v2.0) dataset, specifically preprocessed for training visual question answering models with enhanced spatial reasoning capabilities.

Dataset Summary

  • Source: VQA v2 (MSCOCO train2014 split)
  • Task: Visual Question Answering
  • Language: English
  • License: CC BY 4.0 (inherited from VQA v2)

Key Features

✨ Quality-Focused Curation:

  • Filtered out ambiguous yes/no questions
  • Removed vague questions ("what is in the image", etc.)
  • Answer length limited to 5 words / 30 characters
  • Minimum answer frequency threshold (20 occurrences)

🎯 Balanced Distribution:

  • Maximum 600 samples per answer class
  • Prevents model bias toward common answers
  • Ensures diverse question-answer coverage

πŸ“Š Dataset Statistics:

  • Total Q-A pairs: ~[Your final count from running the script]
  • Unique answers: ~[Number of unique answer classes]
  • Images: MSCOCO train2014 subset
  • Format: JSON + CSV metadata

Dataset Structure

Data Fields

Each sample contains:

{
  "image_id": 123456,           // MSCOCO image ID
  "question_id": 789012,        // VQA v2 question ID
  "question": "What color is the car?",
  "answer": "red",              // Most frequent answer from annotators
  "image_path": "images/COCO_train2014_000000123456.jpg"
}

Data Splits

  • Training: Main dataset (recommend 80-90% for training)
  • Validation: User-defined split (recommend 10-20% for validation)

File Structure

gen_vqa_v2/
β”œβ”€β”€ images/                    # MSCOCO train2014 images
β”‚   └── COCO_train2014_*.jpg
β”œβ”€β”€ qa_pairs.json             # Question-answer pairs (JSON)
└── metadata.csv              # Same data in CSV format

Data Preprocessing

Filtering Criteria

Excluded Answers:

  • Generic responses: yes, no, unknown, none, n/a, cant tell, not sure

Excluded Questions:

  • Ambiguous queries: "what is in the image", "what is this", "what is that", "what do you see"

Answer Constraints:

  • Maximum 5 words per answer
  • Maximum 30 characters per answer
  • Minimum frequency: 20 occurrences across dataset

Balancing Strategy:

  • Maximum 600 samples per answer class
  • Prevents over-representation of common answers (e.g., "white", "2")

Preprocessing Script

The dataset was generated using genvqa-dataset.py:

# Key parameters
MIN_ANSWER_FREQ = 20          # Minimum answer occurrences
MAX_SAMPLES_PER_ANSWER = 600  # Class balancing limit

Intended Use

Primary Use Cases

βœ… Training VQA Models:

  • Visual question answering systems
  • Multimodal vision-language models
  • Spatial reasoning research

βœ… Research Applications:

  • Evaluating spatial understanding in VQA
  • Studying answer distribution bias
  • Benchmarking ensemble architectures

Out-of-Scope Use

❌ Medical diagnosis or safety-critical applications
❌ Surveillance or privacy-invasive systems
❌ Generating misleading or harmful content


Dataset Creation

Source Data

VQA v2 Dataset:

  • Paper: Making the V in VQA Matter
  • Authors: Goyal et al. (2017)
  • Images: MSCOCO train2014
  • Original Size: 443,757 question-answer pairs (train split)

Curation Rationale

This curated subset addresses common VQA training challenges:

  1. Bias Reduction: Limits over-represented answers
  2. Quality Control: Removes ambiguous/uninformative samples
  3. Spatial Focus: Retains questions requiring spatial reasoning
  4. Practical Constraints: Focuses on concise, specific answers

Annotations

Annotations are inherited from VQA v2:

  • 10 answers per question from human annotators
  • Answer selection: Most frequent answer among annotators
  • Consensus: Majority voting for ground truth

Considerations for Using the Data

Social Impact

This dataset inherits biases from MSCOCO and VQA v2:

  • Geographic bias: Primarily Western/North American scenes
  • Cultural bias: Limited representation of global diversity
  • Object bias: Common objects over-represented

Limitations

⚠️ Known Issues:

  • Answer distribution still skewed toward common objects (e.g., "white", "2", "yes")
  • Spatial reasoning questions may be underrepresented
  • Some questions may have multiple valid answers

⚠️ Not Suitable For:

  • Fine-grained visual reasoning (e.g., "How many stripes on the 3rd zebra?")
  • Rare object recognition
  • Non-English languages

Citation

BibTeX

@inproceedings{goyal2017making,
  title={Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering},
  author={Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi},
  booktitle={CVPR},
  year={2017}
}

Original VQA v2 Dataset


Additional Information

Dataset Curators

Curated from VQA v2 by [Your Name/Organization]

Licensing

This dataset is released under CC BY 4.0, consistent with the original VQA v2 license.

Contact

For questions or issues, please contact [your email/GitHub].


Usage Example

Loading the Dataset

import json
import pandas as pd
from PIL import Image

# Load metadata
with open("gen_vqa_v2/qa_pairs.json", "r") as f:
    data = json.load(f)

# Or use CSV
df = pd.read_csv("gen_vqa_v2/metadata.csv")

# Access a sample
sample = data[0]
image = Image.open(f"gen_vqa_v2/{sample['image_path']}")
question = sample['question']
answer = sample['answer']

print(f"Q: {question}")
print(f"A: {answer}")

Training Split

from sklearn.model_selection import train_test_split

# 80-20 train-val split
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

Acknowledgments

  • VQA v2 Team: Goyal et al. for the original dataset
  • MSCOCO Team: Lin et al. for the image dataset
  • Community: Open-source VQA research community