Spaces:
Sleeping
VQA v2 Curated Dataset for Spatial Reasoning
Dataset Description
This is a curated and balanced subset of the VQA v2 (Visual Question Answering v2.0) dataset, specifically preprocessed for training visual question answering models with enhanced spatial reasoning capabilities.
Dataset Summary
- Source: VQA v2 (MSCOCO train2014 split)
- Task: Visual Question Answering
- Language: English
- License: CC BY 4.0 (inherited from VQA v2)
Key Features
β¨ Quality-Focused Curation:
- Filtered out ambiguous yes/no questions
- Removed vague questions ("what is in the image", etc.)
- Answer length limited to 5 words / 30 characters
- Minimum answer frequency threshold (20 occurrences)
π― Balanced Distribution:
- Maximum 600 samples per answer class
- Prevents model bias toward common answers
- Ensures diverse question-answer coverage
π Dataset Statistics:
- Total Q-A pairs: ~[Your final count from running the script]
- Unique answers: ~[Number of unique answer classes]
- Images: MSCOCO train2014 subset
- Format: JSON + CSV metadata
Dataset Structure
Data Fields
Each sample contains:
{
"image_id": 123456, // MSCOCO image ID
"question_id": 789012, // VQA v2 question ID
"question": "What color is the car?",
"answer": "red", // Most frequent answer from annotators
"image_path": "images/COCO_train2014_000000123456.jpg"
}
Data Splits
- Training: Main dataset (recommend 80-90% for training)
- Validation: User-defined split (recommend 10-20% for validation)
File Structure
gen_vqa_v2/
βββ images/ # MSCOCO train2014 images
β βββ COCO_train2014_*.jpg
βββ qa_pairs.json # Question-answer pairs (JSON)
βββ metadata.csv # Same data in CSV format
Data Preprocessing
Filtering Criteria
Excluded Answers:
- Generic responses:
yes,no,unknown,none,n/a,cant tell,not sure
Excluded Questions:
- Ambiguous queries: "what is in the image", "what is this", "what is that", "what do you see"
Answer Constraints:
- Maximum 5 words per answer
- Maximum 30 characters per answer
- Minimum frequency: 20 occurrences across dataset
Balancing Strategy:
- Maximum 600 samples per answer class
- Prevents over-representation of common answers (e.g., "white", "2")
Preprocessing Script
The dataset was generated using genvqa-dataset.py:
# Key parameters
MIN_ANSWER_FREQ = 20 # Minimum answer occurrences
MAX_SAMPLES_PER_ANSWER = 600 # Class balancing limit
Intended Use
Primary Use Cases
β Training VQA Models:
- Visual question answering systems
- Multimodal vision-language models
- Spatial reasoning research
β Research Applications:
- Evaluating spatial understanding in VQA
- Studying answer distribution bias
- Benchmarking ensemble architectures
Out-of-Scope Use
β Medical diagnosis or safety-critical applications
β Surveillance or privacy-invasive systems
β Generating misleading or harmful content
Dataset Creation
Source Data
VQA v2 Dataset:
- Paper: Making the V in VQA Matter
- Authors: Goyal et al. (2017)
- Images: MSCOCO train2014
- Original Size: 443,757 question-answer pairs (train split)
Curation Rationale
This curated subset addresses common VQA training challenges:
- Bias Reduction: Limits over-represented answers
- Quality Control: Removes ambiguous/uninformative samples
- Spatial Focus: Retains questions requiring spatial reasoning
- Practical Constraints: Focuses on concise, specific answers
Annotations
Annotations are inherited from VQA v2:
- 10 answers per question from human annotators
- Answer selection: Most frequent answer among annotators
- Consensus: Majority voting for ground truth
Considerations for Using the Data
Social Impact
This dataset inherits biases from MSCOCO and VQA v2:
- Geographic bias: Primarily Western/North American scenes
- Cultural bias: Limited representation of global diversity
- Object bias: Common objects over-represented
Limitations
β οΈ Known Issues:
- Answer distribution still skewed toward common objects (e.g., "white", "2", "yes")
- Spatial reasoning questions may be underrepresented
- Some questions may have multiple valid answers
β οΈ Not Suitable For:
- Fine-grained visual reasoning (e.g., "How many stripes on the 3rd zebra?")
- Rare object recognition
- Non-English languages
Citation
BibTeX
@inproceedings{goyal2017making,
title={Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering},
author={Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi},
booktitle={CVPR},
year={2017}
}
Original VQA v2 Dataset
- Homepage: https://visualqa.org/
- Paper: https://arxiv.org/abs/1612.00837
- License: CC BY 4.0
Additional Information
Dataset Curators
Curated from VQA v2 by [Your Name/Organization]
Licensing
This dataset is released under CC BY 4.0, consistent with the original VQA v2 license.
Contact
For questions or issues, please contact [your email/GitHub].
Usage Example
Loading the Dataset
import json
import pandas as pd
from PIL import Image
# Load metadata
with open("gen_vqa_v2/qa_pairs.json", "r") as f:
data = json.load(f)
# Or use CSV
df = pd.read_csv("gen_vqa_v2/metadata.csv")
# Access a sample
sample = data[0]
image = Image.open(f"gen_vqa_v2/{sample['image_path']}")
question = sample['question']
answer = sample['answer']
print(f"Q: {question}")
print(f"A: {answer}")
Training Split
from sklearn.model_selection import train_test_split
# 80-20 train-val split
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)
Acknowledgments
- VQA v2 Team: Goyal et al. for the original dataset
- MSCOCO Team: Lin et al. for the image dataset
- Community: Open-source VQA research community