Spaces:

Deva8
/

vqa-backend

Sleeping

App Files Files Community

vqa-backend / DATASET_CARD.md

Deva8

Deploy VQA Space with model downloader

bb8f662 9 days ago

preview code

raw

history blame contribute delete

6.26 kB

	# VQA v2 Curated Dataset for Spatial Reasoning

	## Dataset Description

	This is a curated and balanced subset of the VQA v2 (Visual Question Answering v2.0) dataset, specifically preprocessed for training visual question answering models with enhanced spatial reasoning capabilities.

	### Dataset Summary

	- Source: VQA v2 (MSCOCO train2014 split)
	- Task: Visual Question Answering
	- Language: English
	- License: CC BY 4.0 (inherited from VQA v2)

	### Key Features

	✨ Quality-Focused Curation:
	- Filtered out ambiguous yes/no questions
	- Removed vague questions ("what is in the image", etc.)
	- Answer length limited to 5 words / 30 characters
	- Minimum answer frequency threshold (20 occurrences)

	🎯 Balanced Distribution:
	- Maximum 600 samples per answer class
	- Prevents model bias toward common answers
	- Ensures diverse question-answer coverage

	📊 Dataset Statistics:
	- Total Q-A pairs: ~[Your final count from running the script]
	- Unique answers: ~[Number of unique answer classes]
	- Images: MSCOCO train2014 subset
	- Format: JSON + CSV metadata

	---

	## Dataset Structure

	### Data Fields

	Each sample contains:

	```json
	{
	"image_id": 123456, // MSCOCO image ID
	"question_id": 789012, // VQA v2 question ID
	"question": "What color is the car?",
	"answer": "red", // Most frequent answer from annotators
	"image_path": "images/COCO_train2014_000000123456.jpg"
	}
	```

	### Data Splits

	- Training: Main dataset (recommend 80-90% for training)
	- Validation: User-defined split (recommend 10-20% for validation)

	### File Structure

	```
	gen_vqa_v2/
	├── images/ # MSCOCO train2014 images
	│ └── COCO_train2014_*.jpg
	├── qa_pairs.json # Question-answer pairs (JSON)
	└── metadata.csv # Same data in CSV format
	```

	---

	## Data Preprocessing

	### Filtering Criteria

	Excluded Answers:
	- Generic responses: `yes`, `no`, `unknown`, `none`, `n/a`, `cant tell`, `not sure`

	Excluded Questions:
	- Ambiguous queries: "what is in the image", "what is this", "what is that", "what do you see"

	Answer Constraints:
	- Maximum 5 words per answer
	- Maximum 30 characters per answer
	- Minimum frequency: 20 occurrences across dataset

	Balancing Strategy:
	- Maximum 600 samples per answer class
	- Prevents over-representation of common answers (e.g., "white", "2")

	### Preprocessing Script

	The dataset was generated using `genvqa-dataset.py`:

	```python
	# Key parameters
	MIN_ANSWER_FREQ = 20 # Minimum answer occurrences
	MAX_SAMPLES_PER_ANSWER = 600 # Class balancing limit
	```

	---

	## Intended Use

	### Primary Use Cases

	✅ Training VQA Models:
	- Visual question answering systems
	- Multimodal vision-language models
	- Spatial reasoning research

	✅ Research Applications:
	- Evaluating spatial understanding in VQA
	- Studying answer distribution bias
	- Benchmarking ensemble architectures

	### Out-of-Scope Use

	❌ Medical diagnosis or safety-critical applications
	❌ Surveillance or privacy-invasive systems
	❌ Generating misleading or harmful content

	---

	## Dataset Creation

	### Source Data

	VQA v2 Dataset:
	- Paper: [Making the V in VQA Matter](https://arxiv.org/abs/1612.00837)
	- Authors: Goyal et al. (2017)
	- Images: MSCOCO train2014
	- Original Size: 443,757 question-answer pairs (train split)

	### Curation Rationale

	This curated subset addresses common VQA training challenges:

	1. Bias Reduction: Limits over-represented answers
	2. Quality Control: Removes ambiguous/uninformative samples
	3. Spatial Focus: Retains questions requiring spatial reasoning
	4. Practical Constraints: Focuses on concise, specific answers

	### Annotations

	Annotations are inherited from VQA v2:
	- 10 answers per question from human annotators
	- Answer selection: Most frequent answer among annotators
	- Consensus: Majority voting for ground truth

	---

	## Considerations for Using the Data

	### Social Impact

	This dataset inherits biases from MSCOCO and VQA v2:
	- Geographic bias: Primarily Western/North American scenes
	- Cultural bias: Limited representation of global diversity
	- Object bias: Common objects over-represented

	### Limitations

	⚠️ Known Issues:
	- Answer distribution still skewed toward common objects (e.g., "white", "2", "yes")
	- Spatial reasoning questions may be underrepresented
	- Some questions may have multiple valid answers

	⚠️ Not Suitable For:
	- Fine-grained visual reasoning (e.g., "How many stripes on the 3rd zebra?")
	- Rare object recognition
	- Non-English languages

	---

	## Citation

	### BibTeX

	```bibtex
	@inproceedings{goyal2017making,
	title={Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering},
	author={Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi},
	booktitle={CVPR},
	year={2017}
	}
	```

	### Original VQA v2 Dataset

	- Homepage: https://visualqa.org/
	- Paper: https://arxiv.org/abs/1612.00837
	- License: CC BY 4.0

	---

	## Additional Information

	### Dataset Curators

	Curated from VQA v2 by [Your Name/Organization]

	### Licensing

	This dataset is released under CC BY 4.0, consistent with the original VQA v2 license.

	### Contact

	For questions or issues, please contact [your email/GitHub].

	---

	## Usage Example

	### Loading the Dataset

	```python
	import json
	import pandas as pd
	from PIL import Image

	# Load metadata
	with open("gen_vqa_v2/qa_pairs.json", "r") as f:
	data = json.load(f)

	# Or use CSV
	df = pd.read_csv("gen_vqa_v2/metadata.csv")

	# Access a sample
	sample = data[0]
	image = Image.open(f"gen_vqa_v2/{sample['image_path']}")
	question = sample['question']
	answer = sample['answer']

	print(f"Q: {question}")
	print(f"A: {answer}")
	```

	### Training Split

	```python
	from sklearn.model_selection import train_test_split

	# 80-20 train-val split
	train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)
	```

	---

	## Acknowledgments

	- VQA v2 Team: Goyal et al. for the original dataset
	- MSCOCO Team: Lin et al. for the image dataset
	- Community: Open-source VQA research community