File size: 2,156 Bytes

59455a1

---
license: mit
task_categories:
  - visual-question-answering
  - question-answering
language:
  - en
pretty_name: REVERSE Visual Instruct 1.3M
size_categories:
  - 100K<n<1M
---

# REVERSE Visual Instruct 1.3M

## Dataset Summary

**Dataset Type:**  
REVERSE Visual Instruct 1.3M is a GPT-generated instruction-following dataset designed for training hallucination-aware vision-language models (VLMs). It builds on the LLaVA Instruct 665K dataset and includes structured annotations to indicate model confidence. We introduce three special tokens:  
- `<SPAN>`: marks the beginning of a key phrase  
- `</CN>`: denotes a confident (grounded) phrase  
- `</UN>`: denotes an unconfident (potentially hallucinated) phrase  

Roughly 50% of the examples contain correct (grounded) phrases from the original LLaVA dataset, while the remaining 50% include hallucinated or incorrect phrases generated via GPT-4o-mini-0718 and rule-based augmentations.

**Collection Date:**  
February 2025

**Data Generation:**  
Incorrect examples were synthesized via a combination of GPT-4o-mini prompts and deterministic, rule-based edits. Correct examples are drawn directly from the original LLaVA Visual Instruct dataset. Please refer to our [GitHub Repo](https://github.com/tsunghan-wu/reverse_vlm) for more details.

**License:**  
- Augmented portion: MIT License  
- Base dataset (LLaVA): [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)  
- Note: As both datasets use OpenAI APIs in the data generation process, users should comply with [OpenAI's Terms of Use](https://openai.com/policies/terms-of-use).

**Project Page:**  
[https://reverse-vlm.github.io](https://reverse-vlm.github.io)

**Support & Issues:**  
[GitHub Issues](https://github.com/tsunghan-wu/reverse_vlm/issues)

## Intended Use

**Primary Use Cases:**  
- Research on hallucination detection and mitigation in VLMs  
- Development and benchmarking of trustworthy vision-language assistants  
- Instruction tuning for multi-modal dialogue agents

**Target Users:**  
Researchers, practitioners, and hobbyists working in computer vision, natural language processing, and multi-modal AI.