# VQA v2 Curated Dataset for Spatial Reasoning ## Dataset Description This is a **curated and balanced subset** of the VQA v2 (Visual Question Answering v2.0) dataset, specifically preprocessed for training visual question answering models with enhanced spatial reasoning capabilities. ### Dataset Summary - **Source**: VQA v2 (MSCOCO train2014 split) - **Task**: Visual Question Answering - **Language**: English - **License**: CC BY 4.0 (inherited from VQA v2) ### Key Features ✨ **Quality-Focused Curation**: - Filtered out ambiguous yes/no questions - Removed vague questions ("what is in the image", etc.) - Answer length limited to 5 words / 30 characters - Minimum answer frequency threshold (20 occurrences) 🎯 **Balanced Distribution**: - Maximum 600 samples per answer class - Prevents model bias toward common answers - Ensures diverse question-answer coverage 📊 **Dataset Statistics**: - **Total Q-A pairs**: ~[Your final count from running the script] - **Unique answers**: ~[Number of unique answer classes] - **Images**: MSCOCO train2014 subset - **Format**: JSON + CSV metadata --- ## Dataset Structure ### Data Fields Each sample contains: ```json { "image_id": 123456, // MSCOCO image ID "question_id": 789012, // VQA v2 question ID "question": "What color is the car?", "answer": "red", // Most frequent answer from annotators "image_path": "images/COCO_train2014_000000123456.jpg" } ``` ### Data Splits - **Training**: Main dataset (recommend 80-90% for training) - **Validation**: User-defined split (recommend 10-20% for validation) ### File Structure ``` gen_vqa_v2/ ├── images/ # MSCOCO train2014 images │ └── COCO_train2014_*.jpg ├── qa_pairs.json # Question-answer pairs (JSON) └── metadata.csv # Same data in CSV format ``` --- ## Data Preprocessing ### Filtering Criteria **Excluded Answers**: - Generic responses: `yes`, `no`, `unknown`, `none`, `n/a`, `cant tell`, `not sure` **Excluded Questions**: - Ambiguous queries: "what is in the image", "what is this", "what is that", "what do you see" **Answer Constraints**: - Maximum 5 words per answer - Maximum 30 characters per answer - Minimum frequency: 20 occurrences across dataset **Balancing Strategy**: - Maximum 600 samples per answer class - Prevents over-representation of common answers (e.g., "white", "2") ### Preprocessing Script The dataset was generated using `genvqa-dataset.py`: ```python # Key parameters MIN_ANSWER_FREQ = 20 # Minimum answer occurrences MAX_SAMPLES_PER_ANSWER = 600 # Class balancing limit ``` --- ## Intended Use ### Primary Use Cases ✅ **Training VQA Models**: - Visual question answering systems - Multimodal vision-language models - Spatial reasoning research ✅ **Research Applications**: - Evaluating spatial understanding in VQA - Studying answer distribution bias - Benchmarking ensemble architectures ### Out-of-Scope Use ❌ Medical diagnosis or safety-critical applications ❌ Surveillance or privacy-invasive systems ❌ Generating misleading or harmful content --- ## Dataset Creation ### Source Data **VQA v2 Dataset**: - **Paper**: [Making the V in VQA Matter](https://arxiv.org/abs/1612.00837) - **Authors**: Goyal et al. (2017) - **Images**: MSCOCO train2014 - **Original Size**: 443,757 question-answer pairs (train split) ### Curation Rationale This curated subset addresses common VQA training challenges: 1. **Bias Reduction**: Limits over-represented answers 2. **Quality Control**: Removes ambiguous/uninformative samples 3. **Spatial Focus**: Retains questions requiring spatial reasoning 4. **Practical Constraints**: Focuses on concise, specific answers ### Annotations Annotations are inherited from VQA v2: - 10 answers per question from human annotators - **Answer selection**: Most frequent answer among annotators - **Consensus**: Majority voting for ground truth --- ## Considerations for Using the Data ### Social Impact This dataset inherits biases from MSCOCO and VQA v2: - **Geographic bias**: Primarily Western/North American scenes - **Cultural bias**: Limited representation of global diversity - **Object bias**: Common objects over-represented ### Limitations ⚠️ **Known Issues**: - Answer distribution still skewed toward common objects (e.g., "white", "2", "yes") - Spatial reasoning questions may be underrepresented - Some questions may have multiple valid answers ⚠️ **Not Suitable For**: - Fine-grained visual reasoning (e.g., "How many stripes on the 3rd zebra?") - Rare object recognition - Non-English languages --- ## Citation ### BibTeX ```bibtex @inproceedings{goyal2017making, title={Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering}, author={Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi}, booktitle={CVPR}, year={2017} } ``` ### Original VQA v2 Dataset - **Homepage**: https://visualqa.org/ - **Paper**: https://arxiv.org/abs/1612.00837 - **License**: CC BY 4.0 --- ## Additional Information ### Dataset Curators Curated from VQA v2 by [Your Name/Organization] ### Licensing This dataset is released under **CC BY 4.0**, consistent with the original VQA v2 license. ### Contact For questions or issues, please contact [your email/GitHub]. --- ## Usage Example ### Loading the Dataset ```python import json import pandas as pd from PIL import Image # Load metadata with open("gen_vqa_v2/qa_pairs.json", "r") as f: data = json.load(f) # Or use CSV df = pd.read_csv("gen_vqa_v2/metadata.csv") # Access a sample sample = data[0] image = Image.open(f"gen_vqa_v2/{sample['image_path']}") question = sample['question'] answer = sample['answer'] print(f"Q: {question}") print(f"A: {answer}") ``` ### Training Split ```python from sklearn.model_selection import train_test_split # 80-20 train-val split train_data, val_data = train_test_split(data, test_size=0.2, random_state=42) ``` --- ## Acknowledgments - **VQA v2 Team**: Goyal et al. for the original dataset - **MSCOCO Team**: Lin et al. for the image dataset - **Community**: Open-source VQA research community