|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- tsunghanwu/reverse-instruct-1.3m |
|
|
base_model: |
|
|
- meta-llama/Llama-3.1-8B-Instruct |
|
|
--- |
|
|
|
|
|
# REVERSE-LLaVA-MORE-8B |
|
|
|
|
|
<a href="https://arxiv.org/abs/2504.13169"> |
|
|
<img src="https://img.shields.io/badge/arXiv-2504.13169-b31b1b.svg" alt="arXiv" /> |
|
|
</a> |
|
|
|
|
|
## Model Summary |
|
|
|
|
|
REVERSE-LLaVA-MORE-8B is an open-source vision-language model (VLM) that performs both next-token prediction and self-verification / self-correction during generation. It is built upon LLaVA-MORE (LLaVA with LLaMA-3.1) and fine-tuned on the REVERSE Visual Instruct 1.3M dataset. The model is equipped with a retrospective resampling mechanism that detects and corrects hallucinations on the fly. Training was conducted in early March, 2025. |
|
|
|
|
|
## Performance |
|
|
|
|
|
REVERSE-LLaVA-MORE-8B delivers **strong performance gains** in hallucination reduction across multiple captioning and open-ended VQA benchmarks: |
|
|
|
|
|
| Benchmark | Metric | Best Baseline | REVERSE (Ο=0.003) | REVERSE (Ο=0.0003) | |
|
|
| ------------ | ----------------------------- | ---------------- | ----------------- | ------------------ | |
|
|
| CHAIR-MSCOCO | CHAIR (β) | DoLA (13.8) | 12.2 | **8.4** | |
|
|
| | CHAIRs (β) | DoLA (51.8) | 42.4 | **25.2** | |
|
|
| AMBER-G | Hallucination (β) | Woodpecker (7.4) | 6.5 | **5.1** | |
|
|
| | Coverage (β) | DoLA (53.1) | **54.8** | 38.9 | |
|
|
| MMHal-Bench | Score (β) | DoLA (2.54) | 2.28 | **2.93** | |
|
|
| | Hallucination Rate (β) | DoLA (0.51) | 0.54 | **0.40** | |
|
|
| HaloQuest | Avg. Accuracy (β) | DoLA (22.8) | 26.7 | **36.7** | |
|
|
| | False Premise Acc. (β) | DoLA (15.5) | 30.0 | **39.5** | |
|
|
| | Visual Challenging Acc. (β) | **DoLA (45.1)** | 31.3 | 30.9 | |
|
|
| | Insufficient Context Acc. (β) | DoLA (7.4) | 11.7 | **38.1** | |
|
|
|
|
|
On discriminative tasks, REVERSE-LLaVA-MORE performs competitively with existing base VLM: |
|
|
|
|
|
| Benchmark | Metric | LLaVA-MORE-8B | REVERSE (Ο=0.5) | |
|
|
| ------------ | ----------------------------- | ---------------- | ---------------- | |
|
|
| AMBER-D | F1 Score (β) | **71.6** | 69.3 | |
|
|
| POPE | F1 Score (β) | **85.1** | 84.4 | |
|
|
| MME-Hall | Score (β) | **678.3** | 657.6 | |
|
|
|
|
|
## Usage |
|
|
|
|
|
Please refer to the installation guide on GitHub to get started: |
|
|
π [Installation Guide](https://github.com/tsunghan-wu/reverse_vlm) |
|
|
|
|
|
## Additional Resources |
|
|
|
|
|
- π Project Page: [https://reverse-vlm.github.io/](https://reverse-vlm.github.io/) |
|
|
- π§Ύ Dataset: [REVERSE Visual Instruct 1.3M](https://huggingface.co/datasets/tsunghanwu/reverse-instruct-1.3m) |
|
|
- π§ Ask Questions: [GitHub Issues](https://github.com/tsunghan-wu/reverse_vlm/issues) |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
**Primary Use Cases:** |
|
|
- Reducing hallucination in image captioning and open-ended VQA |
|
|
- Evaluating hallucination-aware generation strategies |
|
|
- Research on grounded and trustworthy multimodal reasoning |
|
|
|
|
|
**Target Users:** |
|
|
Researchers, developers, and students working on VLMs, hallucination mitigation, and vision-language alignment. |