Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# REVERSE-LLaVA-MORE-8B
|
| 6 |
+
|
| 7 |
+
## Model Summary
|
| 8 |
+
|
| 9 |
+
REVERSE-LLaVA-MORE-8B is an open-source vision-language model (VLM) that performs both next-token prediction and self-verification / self-correction during generation. It is built upon LLaVA-MORE (LLaVA with LLaMA-3.1) and fine-tuned on the REVERSE Visual Instruct 1.3M dataset. The model is equipped with a retrospective resampling mechanism that detects and corrects hallucinations on the fly. Training was conducted in early March, 2025.
|
| 10 |
+
|
| 11 |
+
## Performance
|
| 12 |
+
|
| 13 |
+
REVERSE-LLaVA-MORE-8B delivers **strong performance gains** in hallucination reduction across multiple captioning and open-ended VQA benchmarks:
|
| 14 |
+
|
| 15 |
+
| Benchmark | Metric | Best Baseline | REVERSE (Ο=0.003) | REVERSE (Ο=0.0003) |
|
| 16 |
+
| ------------ | ----------------------------- | ---------------- | ----------------- | ------------------ |
|
| 17 |
+
| CHAIR-MSCOCO | CHAIR (β) | DoLA (13.8) | 12.2 | **8.4** |
|
| 18 |
+
| | CHAIRs (β) | DoLA (51.8) | 42.4 | **25.2** |
|
| 19 |
+
| AMBER-G | Hallucination (β) | Woodpecker (7.4) | 6.5 | **5.1** |
|
| 20 |
+
| | Coverage (β) | DoLA (53.1) | **54.8** | 38.9 |
|
| 21 |
+
| MMHal-Bench | Score (β) | DoLA (2.54) | 2.28 | **2.93** |
|
| 22 |
+
| | Hallucination Rate (β) | DoLA (0.51) | 0.54 | **0.40** |
|
| 23 |
+
| HaloQuest | Avg. Accuracy (β) | DoLA (22.8) | 26.7 | **36.7** |
|
| 24 |
+
| | False Premise Acc. (β) | DoLA (15.5) | 30.0 | **39.5** |
|
| 25 |
+
| | Visual Challenging Acc. (β) | **DoLA (45.1)** | 31.3 | 30.9 |
|
| 26 |
+
| | Insufficient Context Acc. (β) | DoLA (7.4) | 11.7 | **38.1** |
|
| 27 |
+
|
| 28 |
+
On discriminative tasks, REVERSE-LLaVA-MORE performs competitively with existing base VLM:
|
| 29 |
+
|
| 30 |
+
| Benchmark | Metric | LLaVA-MORE-8B | REVERSE (Ο=0.5) |
|
| 31 |
+
| ------------ | ----------------------------- | ---------------- | ---------------- |
|
| 32 |
+
| AMBER-D | F1 Score (β) | **71.6** | 69.3 |
|
| 33 |
+
| POPE | F1 Score (β) | **85.1** | 84.4 |
|
| 34 |
+
| MME-Hall | Score (β) | **678.3** | 657.6 |
|
| 35 |
+
|
| 36 |
+
## Usage
|
| 37 |
+
|
| 38 |
+
Please refer to the installation guide on GitHub to get started:
|
| 39 |
+
π [Installation Guide](https://github.com/tsunghan-wu/reverse_vlm)
|
| 40 |
+
|
| 41 |
+
## Additional Resources
|
| 42 |
+
|
| 43 |
+
- π Project Page: [https://reverse-vlm.github.io/](https://reverse-vlm.github.io/)
|
| 44 |
+
- π§Ύ Dataset: [REVERSE Visual Instruct 1.3M](https://huggingface.co/datasets/tsunghanwu/reverse-instruct-1.3m)
|
| 45 |
+
- π§ Ask Questions: [GitHub Issues](https://github.com/tsunghan-wu/reverse_vlm/issues)
|
| 46 |
+
|
| 47 |
+
## Intended Use
|
| 48 |
+
|
| 49 |
+
**Primary Use Cases:**
|
| 50 |
+
- Reducing hallucination in image captioning and open-ended VQA
|
| 51 |
+
- Evaluating hallucination-aware generation strategies
|
| 52 |
+
- Research on grounded and trustworthy multimodal reasoning
|
| 53 |
+
|
| 54 |
+
**Target Users:**
|
| 55 |
+
Researchers, developers, and students working on VLMs, hallucination mitigation, and vision-language alignment.
|