File size: 3,580 Bytes
5fc6d8e bfa23af 5fc6d8e bfa23af 5fc6d8e bfa23af | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | ---
license: mit
datasets:
- tsunghanwu/reverse-instruct-1.3m
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---
# REVERSE-LLaVA-MORE-8B
<a href="https://arxiv.org/abs/2504.13169">
<img src="https://img.shields.io/badge/arXiv-2504.13169-b31b1b.svg" alt="arXiv" />
</a>
## Model Summary
REVERSE-LLaVA-MORE-8B is an open-source vision-language model (VLM) that performs both next-token prediction and self-verification / self-correction during generation. It is built upon LLaVA-MORE (LLaVA with LLaMA-3.1) and fine-tuned on the REVERSE Visual Instruct 1.3M dataset. The model is equipped with a retrospective resampling mechanism that detects and corrects hallucinations on the fly. Training was conducted in early March, 2025.
## Performance
REVERSE-LLaVA-MORE-8B delivers **strong performance gains** in hallucination reduction across multiple captioning and open-ended VQA benchmarks:
| Benchmark | Metric | Best Baseline | REVERSE (Ο=0.003) | REVERSE (Ο=0.0003) |
| ------------ | ----------------------------- | ---------------- | ----------------- | ------------------ |
| CHAIR-MSCOCO | CHAIR (β) | DoLA (13.8) | 12.2 | **8.4** |
| | CHAIRs (β) | DoLA (51.8) | 42.4 | **25.2** |
| AMBER-G | Hallucination (β) | Woodpecker (7.4) | 6.5 | **5.1** |
| | Coverage (β) | DoLA (53.1) | **54.8** | 38.9 |
| MMHal-Bench | Score (β) | DoLA (2.54) | 2.28 | **2.93** |
| | Hallucination Rate (β) | DoLA (0.51) | 0.54 | **0.40** |
| HaloQuest | Avg. Accuracy (β) | DoLA (22.8) | 26.7 | **36.7** |
| | False Premise Acc. (β) | DoLA (15.5) | 30.0 | **39.5** |
| | Visual Challenging Acc. (β) | **DoLA (45.1)** | 31.3 | 30.9 |
| | Insufficient Context Acc. (β) | DoLA (7.4) | 11.7 | **38.1** |
On discriminative tasks, REVERSE-LLaVA-MORE performs competitively with existing base VLM:
| Benchmark | Metric | LLaVA-MORE-8B | REVERSE (Ο=0.5) |
| ------------ | ----------------------------- | ---------------- | ---------------- |
| AMBER-D | F1 Score (β) | **71.6** | 69.3 |
| POPE | F1 Score (β) | **85.1** | 84.4 |
| MME-Hall | Score (β) | **678.3** | 657.6 |
## Usage
Please refer to the installation guide on GitHub to get started:
π [Installation Guide](https://github.com/tsunghan-wu/reverse_vlm)
## Additional Resources
- π Project Page: [https://reverse-vlm.github.io/](https://reverse-vlm.github.io/)
- π§Ύ Dataset: [REVERSE Visual Instruct 1.3M](https://huggingface.co/datasets/tsunghanwu/reverse-instruct-1.3m)
- π§ Ask Questions: [GitHub Issues](https://github.com/tsunghan-wu/reverse_vlm/issues)
## Intended Use
**Primary Use Cases:**
- Reducing hallucination in image captioning and open-ended VQA
- Evaluating hallucination-aware generation strategies
- Research on grounded and trustworthy multimodal reasoning
**Target Users:**
Researchers, developers, and students working on VLMs, hallucination mitigation, and vision-language alignment. |