--- license: mit datasets: - tsunghanwu/reverse-instruct-1.3m base_model: - meta-llama/Llama-3.1-8B-Instruct --- # REVERSE-LLaVA-MORE-8B arXiv ## Model Summary REVERSE-LLaVA-MORE-8B is an open-source vision-language model (VLM) that performs both next-token prediction and self-verification / self-correction during generation. It is built upon LLaVA-MORE (LLaVA with LLaMA-3.1) and fine-tuned on the REVERSE Visual Instruct 1.3M dataset. The model is equipped with a retrospective resampling mechanism that detects and corrects hallucinations on the fly. Training was conducted in early March, 2025. ## Performance REVERSE-LLaVA-MORE-8B delivers **strong performance gains** in hallucination reduction across multiple captioning and open-ended VQA benchmarks: | Benchmark | Metric | Best Baseline | REVERSE (τ=0.003) | REVERSE (τ=0.0003) | | ------------ | ----------------------------- | ---------------- | ----------------- | ------------------ | | CHAIR-MSCOCO | CHAIR (↓) | DoLA (13.8) | 12.2 | **8.4** | | | CHAIRs (↓) | DoLA (51.8) | 42.4 | **25.2** | | AMBER-G | Hallucination (↓) | Woodpecker (7.4) | 6.5 | **5.1** | | | Coverage (↑) | DoLA (53.1) | **54.8** | 38.9 | | MMHal-Bench | Score (↑) | DoLA (2.54) | 2.28 | **2.93** | | | Hallucination Rate (↓) | DoLA (0.51) | 0.54 | **0.40** | | HaloQuest | Avg. Accuracy (↑) | DoLA (22.8) | 26.7 | **36.7** | | | False Premise Acc. (↑) | DoLA (15.5) | 30.0 | **39.5** | | | Visual Challenging Acc. (↑) | **DoLA (45.1)** | 31.3 | 30.9 | | | Insufficient Context Acc. (↑) | DoLA (7.4) | 11.7 | **38.1** | On discriminative tasks, REVERSE-LLaVA-MORE performs competitively with existing base VLM: | Benchmark | Metric | LLaVA-MORE-8B | REVERSE (τ=0.5) | | ------------ | ----------------------------- | ---------------- | ---------------- | | AMBER-D | F1 Score (↑) | **71.6** | 69.3 | | POPE | F1 Score (↑) | **85.1** | 84.4 | | MME-Hall | Score (↑) | **678.3** | 657.6 | ## Usage Please refer to the installation guide on GitHub to get started: 👉 [Installation Guide](https://github.com/tsunghan-wu/reverse_vlm) ## Additional Resources - 📄 Project Page: [https://reverse-vlm.github.io/](https://reverse-vlm.github.io/) - 🧾 Dataset: [REVERSE Visual Instruct 1.3M](https://huggingface.co/datasets/tsunghanwu/reverse-instruct-1.3m) - 🔧 Ask Questions: [GitHub Issues](https://github.com/tsunghan-wu/reverse_vlm/issues) ## Intended Use **Primary Use Cases:** - Reducing hallucination in image captioning and open-ended VQA - Evaluating hallucination-aware generation strategies - Research on grounded and trustworthy multimodal reasoning **Target Users:** Researchers, developers, and students working on VLMs, hallucination mitigation, and vision-language alignment.