tsunghanwu
/

reverse_llava_more

Model card Files Files and versions

reverse_llava_more / README.md

tsunghanwu's picture

Update README.md

bfa23af verified 10 months ago

|

history blame contribute delete

3.58 kB

	---
	license: mit
	datasets:
	- tsunghanwu/reverse-instruct-1.3m
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	---

	# REVERSE-LLaVA-MORE-8B

	<a href="https://arxiv.org/abs/2504.13169">
	<img src="https://img.shields.io/badge/arXiv-2504.13169-b31b1b.svg" alt="arXiv" />
	</a>

	## Model Summary

	REVERSE-LLaVA-MORE-8B is an open-source vision-language model (VLM) that performs both next-token prediction and self-verification / self-correction during generation. It is built upon LLaVA-MORE (LLaVA with LLaMA-3.1) and fine-tuned on the REVERSE Visual Instruct 1.3M dataset. The model is equipped with a retrospective resampling mechanism that detects and corrects hallucinations on the fly. Training was conducted in early March, 2025.

	## Performance

	REVERSE-LLaVA-MORE-8B delivers strong performance gains in hallucination reduction across multiple captioning and open-ended VQA benchmarks:

	\| Benchmark \| Metric \| Best Baseline \| REVERSE (τ=0.003) \| REVERSE (τ=0.0003) \|
	\| ------------ \| ----------------------------- \| ---------------- \| ----------------- \| ------------------ \|
	\| CHAIR-MSCOCO \| CHAIR (↓) \| DoLA (13.8) \| 12.2 \| 8.4 \|
	\| \| CHAIRs (↓) \| DoLA (51.8) \| 42.4 \| 25.2 \|
	\| AMBER-G \| Hallucination (↓) \| Woodpecker (7.4) \| 6.5 \| 5.1 \|
	\| \| Coverage (↑) \| DoLA (53.1) \| 54.8 \| 38.9 \|
	\| MMHal-Bench \| Score (↑) \| DoLA (2.54) \| 2.28 \| 2.93 \|
	\| \| Hallucination Rate (↓) \| DoLA (0.51) \| 0.54 \| 0.40 \|
	\| HaloQuest \| Avg. Accuracy (↑) \| DoLA (22.8) \| 26.7 \| 36.7 \|
	\| \| False Premise Acc. (↑) \| DoLA (15.5) \| 30.0 \| 39.5 \|
	\| \| Visual Challenging Acc. (↑) \| DoLA (45.1) \| 31.3 \| 30.9 \|
	\| \| Insufficient Context Acc. (↑) \| DoLA (7.4) \| 11.7 \| 38.1 \|

	On discriminative tasks, REVERSE-LLaVA-MORE performs competitively with existing base VLM:

	\| Benchmark \| Metric \| LLaVA-MORE-8B \| REVERSE (τ=0.5) \|
	\| ------------ \| ----------------------------- \| ---------------- \| ---------------- \|
	\| AMBER-D \| F1 Score (↑) \| 71.6 \| 69.3 \|
	\| POPE \| F1 Score (↑) \| 85.1 \| 84.4 \|
	\| MME-Hall \| Score (↑) \| 678.3 \| 657.6 \|

	## Usage

	Please refer to the installation guide on GitHub to get started:
	👉 [Installation Guide](https://github.com/tsunghan-wu/reverse_vlm)

	## Additional Resources

	- 📄 Project Page: [https://reverse-vlm.github.io/](https://reverse-vlm.github.io/)
	- 🧾 Dataset: [REVERSE Visual Instruct 1.3M](https://huggingface.co/datasets/tsunghanwu/reverse-instruct-1.3m)
	- 🔧 Ask Questions: [GitHub Issues](https://github.com/tsunghan-wu/reverse_vlm/issues)

	## Intended Use

	Primary Use Cases:
	- Reducing hallucination in image captioning and open-ended VQA
	- Evaluating hallucination-aware generation strategies
	- Research on grounded and trustworthy multimodal reasoning

	Target Users:
	Researchers, developers, and students working on VLMs, hallucination mitigation, and vision-language alignment.