File size: 3,580 Bytes
5fc6d8e
 
bfa23af
 
 
 
5fc6d8e
 
 
 
bfa23af
 
 
 
5fc6d8e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bfa23af
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
license: mit
datasets:
- tsunghanwu/reverse-instruct-1.3m
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---

# REVERSE-LLaVA-MORE-8B

<a href="https://arxiv.org/abs/2504.13169">
  <img src="https://img.shields.io/badge/arXiv-2504.13169-b31b1b.svg" alt="arXiv" />
</a>

## Model Summary

REVERSE-LLaVA-MORE-8B is an open-source vision-language model (VLM) that performs both next-token prediction and self-verification / self-correction during generation. It is built upon LLaVA-MORE (LLaVA with LLaMA-3.1) and fine-tuned on the REVERSE Visual Instruct 1.3M dataset. The model is equipped with a retrospective resampling mechanism that detects and corrects hallucinations on the fly. Training was conducted in early March, 2025.

## Performance

REVERSE-LLaVA-MORE-8B delivers **strong performance gains** in hallucination reduction across multiple captioning and open-ended VQA benchmarks:

| Benchmark    | Metric                        | Best Baseline    | REVERSE (Ο„=0.003) | REVERSE (Ο„=0.0003) |
| ------------ | ----------------------------- | ---------------- | ----------------- | ------------------ |
| CHAIR-MSCOCO | CHAIR (↓)                     | DoLA (13.8)      | 12.2              | **8.4**            |
|              | CHAIRs (↓)                    | DoLA (51.8)      | 42.4              | **25.2**           |
| AMBER-G      | Hallucination (↓)             | Woodpecker (7.4) | 6.5               | **5.1**            |
|              | Coverage (↑)                  | DoLA (53.1)      | **54.8**          | 38.9               |
| MMHal-Bench  | Score (↑)                     | DoLA (2.54)      | 2.28              | **2.93**           |
|              | Hallucination Rate (↓)        | DoLA (0.51)      | 0.54              | **0.40**           |
| HaloQuest    | Avg. Accuracy (↑)             | DoLA (22.8)      | 26.7              | **36.7**           |
|              | False Premise Acc. (↑)        | DoLA (15.5)      | 30.0              | **39.5**           |
|              | Visual Challenging Acc. (↑)   | **DoLA (45.1)**  | 31.3              | 30.9               |
|              | Insufficient Context Acc. (↑) | DoLA (7.4)       | 11.7              | **38.1**           |

On discriminative tasks, REVERSE-LLaVA-MORE performs competitively with existing base VLM:

| Benchmark    | Metric                        | LLaVA-MORE-8B    | REVERSE (Ο„=0.5) |
| ------------ | ----------------------------- | ---------------- | ---------------- |
| AMBER-D      | F1 Score (↑)                  | **71.6**         | 69.3             |
| POPE         | F1 Score (↑)                  | **85.1**         | 84.4             |
| MME-Hall     | Score (↑)                     | **678.3**        | 657.6            |

## Usage

Please refer to the installation guide on GitHub to get started:  
πŸ‘‰ [Installation Guide](https://github.com/tsunghan-wu/reverse_vlm)

## Additional Resources

- πŸ“„ Project Page: [https://reverse-vlm.github.io/](https://reverse-vlm.github.io/)
- 🧾 Dataset: [REVERSE Visual Instruct 1.3M](https://huggingface.co/datasets/tsunghanwu/reverse-instruct-1.3m)
- πŸ”§ Ask Questions: [GitHub Issues](https://github.com/tsunghan-wu/reverse_vlm/issues)

## Intended Use

**Primary Use Cases:**  
- Reducing hallucination in image captioning and open-ended VQA  
- Evaluating hallucination-aware generation strategies  
- Research on grounded and trustworthy multimodal reasoning

**Target Users:**  
Researchers, developers, and students working on VLMs, hallucination mitigation, and vision-language alignment.