Spaces:
Sleeping
Sleeping
| # VQA v2 Curated Dataset for Spatial Reasoning | |
| ## Dataset Description | |
| This is a **curated and balanced subset** of the VQA v2 (Visual Question Answering v2.0) dataset, specifically preprocessed for training visual question answering models with enhanced spatial reasoning capabilities. | |
| ### Dataset Summary | |
| - **Source**: VQA v2 (MSCOCO train2014 split) | |
| - **Task**: Visual Question Answering | |
| - **Language**: English | |
| - **License**: CC BY 4.0 (inherited from VQA v2) | |
| ### Key Features | |
| β¨ **Quality-Focused Curation**: | |
| - Filtered out ambiguous yes/no questions | |
| - Removed vague questions ("what is in the image", etc.) | |
| - Answer length limited to 5 words / 30 characters | |
| - Minimum answer frequency threshold (20 occurrences) | |
| π― **Balanced Distribution**: | |
| - Maximum 600 samples per answer class | |
| - Prevents model bias toward common answers | |
| - Ensures diverse question-answer coverage | |
| π **Dataset Statistics**: | |
| - **Total Q-A pairs**: ~[Your final count from running the script] | |
| - **Unique answers**: ~[Number of unique answer classes] | |
| - **Images**: MSCOCO train2014 subset | |
| - **Format**: JSON + CSV metadata | |
| --- | |
| ## Dataset Structure | |
| ### Data Fields | |
| Each sample contains: | |
| ```json | |
| { | |
| "image_id": 123456, // MSCOCO image ID | |
| "question_id": 789012, // VQA v2 question ID | |
| "question": "What color is the car?", | |
| "answer": "red", // Most frequent answer from annotators | |
| "image_path": "images/COCO_train2014_000000123456.jpg" | |
| } | |
| ``` | |
| ### Data Splits | |
| - **Training**: Main dataset (recommend 80-90% for training) | |
| - **Validation**: User-defined split (recommend 10-20% for validation) | |
| ### File Structure | |
| ``` | |
| gen_vqa_v2/ | |
| βββ images/ # MSCOCO train2014 images | |
| β βββ COCO_train2014_*.jpg | |
| βββ qa_pairs.json # Question-answer pairs (JSON) | |
| βββ metadata.csv # Same data in CSV format | |
| ``` | |
| --- | |
| ## Data Preprocessing | |
| ### Filtering Criteria | |
| **Excluded Answers**: | |
| - Generic responses: `yes`, `no`, `unknown`, `none`, `n/a`, `cant tell`, `not sure` | |
| **Excluded Questions**: | |
| - Ambiguous queries: "what is in the image", "what is this", "what is that", "what do you see" | |
| **Answer Constraints**: | |
| - Maximum 5 words per answer | |
| - Maximum 30 characters per answer | |
| - Minimum frequency: 20 occurrences across dataset | |
| **Balancing Strategy**: | |
| - Maximum 600 samples per answer class | |
| - Prevents over-representation of common answers (e.g., "white", "2") | |
| ### Preprocessing Script | |
| The dataset was generated using `genvqa-dataset.py`: | |
| ```python | |
| # Key parameters | |
| MIN_ANSWER_FREQ = 20 # Minimum answer occurrences | |
| MAX_SAMPLES_PER_ANSWER = 600 # Class balancing limit | |
| ``` | |
| --- | |
| ## Intended Use | |
| ### Primary Use Cases | |
| β **Training VQA Models**: | |
| - Visual question answering systems | |
| - Multimodal vision-language models | |
| - Spatial reasoning research | |
| β **Research Applications**: | |
| - Evaluating spatial understanding in VQA | |
| - Studying answer distribution bias | |
| - Benchmarking ensemble architectures | |
| ### Out-of-Scope Use | |
| β Medical diagnosis or safety-critical applications | |
| β Surveillance or privacy-invasive systems | |
| β Generating misleading or harmful content | |
| --- | |
| ## Dataset Creation | |
| ### Source Data | |
| **VQA v2 Dataset**: | |
| - **Paper**: [Making the V in VQA Matter](https://arxiv.org/abs/1612.00837) | |
| - **Authors**: Goyal et al. (2017) | |
| - **Images**: MSCOCO train2014 | |
| - **Original Size**: 443,757 question-answer pairs (train split) | |
| ### Curation Rationale | |
| This curated subset addresses common VQA training challenges: | |
| 1. **Bias Reduction**: Limits over-represented answers | |
| 2. **Quality Control**: Removes ambiguous/uninformative samples | |
| 3. **Spatial Focus**: Retains questions requiring spatial reasoning | |
| 4. **Practical Constraints**: Focuses on concise, specific answers | |
| ### Annotations | |
| Annotations are inherited from VQA v2: | |
| - 10 answers per question from human annotators | |
| - **Answer selection**: Most frequent answer among annotators | |
| - **Consensus**: Majority voting for ground truth | |
| --- | |
| ## Considerations for Using the Data | |
| ### Social Impact | |
| This dataset inherits biases from MSCOCO and VQA v2: | |
| - **Geographic bias**: Primarily Western/North American scenes | |
| - **Cultural bias**: Limited representation of global diversity | |
| - **Object bias**: Common objects over-represented | |
| ### Limitations | |
| β οΈ **Known Issues**: | |
| - Answer distribution still skewed toward common objects (e.g., "white", "2", "yes") | |
| - Spatial reasoning questions may be underrepresented | |
| - Some questions may have multiple valid answers | |
| β οΈ **Not Suitable For**: | |
| - Fine-grained visual reasoning (e.g., "How many stripes on the 3rd zebra?") | |
| - Rare object recognition | |
| - Non-English languages | |
| --- | |
| ## Citation | |
| ### BibTeX | |
| ```bibtex | |
| @inproceedings{goyal2017making, | |
| title={Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering}, | |
| author={Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi}, | |
| booktitle={CVPR}, | |
| year={2017} | |
| } | |
| ``` | |
| ### Original VQA v2 Dataset | |
| - **Homepage**: https://visualqa.org/ | |
| - **Paper**: https://arxiv.org/abs/1612.00837 | |
| - **License**: CC BY 4.0 | |
| --- | |
| ## Additional Information | |
| ### Dataset Curators | |
| Curated from VQA v2 by [Your Name/Organization] | |
| ### Licensing | |
| This dataset is released under **CC BY 4.0**, consistent with the original VQA v2 license. | |
| ### Contact | |
| For questions or issues, please contact [your email/GitHub]. | |
| --- | |
| ## Usage Example | |
| ### Loading the Dataset | |
| ```python | |
| import json | |
| import pandas as pd | |
| from PIL import Image | |
| # Load metadata | |
| with open("gen_vqa_v2/qa_pairs.json", "r") as f: | |
| data = json.load(f) | |
| # Or use CSV | |
| df = pd.read_csv("gen_vqa_v2/metadata.csv") | |
| # Access a sample | |
| sample = data[0] | |
| image = Image.open(f"gen_vqa_v2/{sample['image_path']}") | |
| question = sample['question'] | |
| answer = sample['answer'] | |
| print(f"Q: {question}") | |
| print(f"A: {answer}") | |
| ``` | |
| ### Training Split | |
| ```python | |
| from sklearn.model_selection import train_test_split | |
| # 80-20 train-val split | |
| train_data, val_data = train_test_split(data, test_size=0.2, random_state=42) | |
| ``` | |
| --- | |
| ## Acknowledgments | |
| - **VQA v2 Team**: Goyal et al. for the original dataset | |
| - **MSCOCO Team**: Lin et al. for the image dataset | |
| - **Community**: Open-source VQA research community | |