reverse_llava_more / README.md
tsunghanwu's picture
Update README.md
bfa23af verified
---
license: mit
datasets:
- tsunghanwu/reverse-instruct-1.3m
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---
# REVERSE-LLaVA-MORE-8B
<a href="https://arxiv.org/abs/2504.13169">
<img src="https://img.shields.io/badge/arXiv-2504.13169-b31b1b.svg" alt="arXiv" />
</a>
## Model Summary
REVERSE-LLaVA-MORE-8B is an open-source vision-language model (VLM) that performs both next-token prediction and self-verification / self-correction during generation. It is built upon LLaVA-MORE (LLaVA with LLaMA-3.1) and fine-tuned on the REVERSE Visual Instruct 1.3M dataset. The model is equipped with a retrospective resampling mechanism that detects and corrects hallucinations on the fly. Training was conducted in early March, 2025.
## Performance
REVERSE-LLaVA-MORE-8B delivers **strong performance gains** in hallucination reduction across multiple captioning and open-ended VQA benchmarks:
| Benchmark | Metric | Best Baseline | REVERSE (Ο„=0.003) | REVERSE (Ο„=0.0003) |
| ------------ | ----------------------------- | ---------------- | ----------------- | ------------------ |
| CHAIR-MSCOCO | CHAIR (↓) | DoLA (13.8) | 12.2 | **8.4** |
| | CHAIRs (↓) | DoLA (51.8) | 42.4 | **25.2** |
| AMBER-G | Hallucination (↓) | Woodpecker (7.4) | 6.5 | **5.1** |
| | Coverage (↑) | DoLA (53.1) | **54.8** | 38.9 |
| MMHal-Bench | Score (↑) | DoLA (2.54) | 2.28 | **2.93** |
| | Hallucination Rate (↓) | DoLA (0.51) | 0.54 | **0.40** |
| HaloQuest | Avg. Accuracy (↑) | DoLA (22.8) | 26.7 | **36.7** |
| | False Premise Acc. (↑) | DoLA (15.5) | 30.0 | **39.5** |
| | Visual Challenging Acc. (↑) | **DoLA (45.1)** | 31.3 | 30.9 |
| | Insufficient Context Acc. (↑) | DoLA (7.4) | 11.7 | **38.1** |
On discriminative tasks, REVERSE-LLaVA-MORE performs competitively with existing base VLM:
| Benchmark | Metric | LLaVA-MORE-8B | REVERSE (Ο„=0.5) |
| ------------ | ----------------------------- | ---------------- | ---------------- |
| AMBER-D | F1 Score (↑) | **71.6** | 69.3 |
| POPE | F1 Score (↑) | **85.1** | 84.4 |
| MME-Hall | Score (↑) | **678.3** | 657.6 |
## Usage
Please refer to the installation guide on GitHub to get started:
πŸ‘‰ [Installation Guide](https://github.com/tsunghan-wu/reverse_vlm)
## Additional Resources
- πŸ“„ Project Page: [https://reverse-vlm.github.io/](https://reverse-vlm.github.io/)
- 🧾 Dataset: [REVERSE Visual Instruct 1.3M](https://huggingface.co/datasets/tsunghanwu/reverse-instruct-1.3m)
- πŸ”§ Ask Questions: [GitHub Issues](https://github.com/tsunghan-wu/reverse_vlm/issues)
## Intended Use
**Primary Use Cases:**
- Reducing hallucination in image captioning and open-ended VQA
- Evaluating hallucination-aware generation strategies
- Research on grounded and trustworthy multimodal reasoning
**Target Users:**
Researchers, developers, and students working on VLMs, hallucination mitigation, and vision-language alignment.