Improve model card: Add pipeline tag, library name, paper, code, abstract, image, and usage
Browse filesThis PR significantly enhances the model card for the `Vision-Zero-InternVL3-14B-Clevr` model by adding crucial metadata and detailed documentation.
Specifically, it includes:
- **`pipeline_tag: image-text-to-text`**: This accurately categorizes the model's functionality as a Vision-Language Model, improving its discoverability on the Hugging Face Hub.
- **`library_name: transformers`**: Evidence from the `config.json` (e.g., `transformers_version`, `architectures`) suggests compatibility with the `transformers` library, enabling automated code snippets for users.
- **`license: cc-by-nc-4.0`**: A common research license has been added.
- **Paper Link**: A direct link to the paper [Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play](https://huggingface.co/papers/2509.25541).
- **GitHub Repository**: A link to the official GitHub repository: https://github.com/wangqinsi1/Vision-Zero.
- **Abstract**: The full paper abstract is included for a comprehensive overview.
- **Overview Image**: The main overview image from the GitHub README is included for visual context.
- **Quick Start (Inference)**: A detailed usage section, including setup instructions and a Python code snippet, is directly extracted from the GitHub README to guide users on how to run inference.
- **Citation**: The BibTeX citation for the paper is also included.
These additions will greatly improve the discoverability, usability, and overall documentation of the model on the Hugging Face Hub.
|
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
pipeline_tag: image-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
|
| 8 |
+
|
| 9 |
+
This repository contains the `Vision-Zero-InternVL3-14B-Clevr` model, as presented in the paper [Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play](https://huggingface.co/papers/2509.25541).
|
| 10 |
+
|
| 11 |
+
Vision-Zero is a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. It trains Vision-Language Models (VLMs) in "Who Is the Spy"-style games, where models engage in strategic reasoning and actions across multiple roles, autonomously generating training data without human annotation.
|
| 12 |
+
|
| 13 |
+

|
| 14 |
+
|
| 15 |
+
## Abstract
|
| 16 |
+
Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision–language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose **Vision-Zero**, *a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs.* Specifically, Vision-Zero encompasses three main attributes: (1) **Strategic Self-Play Framework:** Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) **Gameplay from Arbitrary Images:** Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model’s reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) **Sustainable Performance Gain:** We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods.
|
| 17 |
+
|
| 18 |
+
## Code
|
| 19 |
+
The official implementation and code for Vision-Zero can be found on GitHub: [https://github.com/wangqinsi1/Vision-Zero](https://github.com/wangqinsi1/Vision-Zero)
|
| 20 |
+
|
| 21 |
+
## Quick Start (Inference)
|
| 22 |
+
|
| 23 |
+
### Install Python Packages
|
| 24 |
+
First, create a conda environment and install relevant python packages.
|
| 25 |
+
```bash
|
| 26 |
+
conda create -n vision-zero python=3.10
|
| 27 |
+
conda activate vision-zero
|
| 28 |
+
bash setup.sh
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
### Play with the Model Yourself
|
| 32 |
+
```python
|
| 33 |
+
import pae
|
| 34 |
+
from pae.models import LlavaAgent, ClaudeAgent
|
| 35 |
+
from accelerate import Accelerator
|
| 36 |
+
import torch
|
| 37 |
+
from tqdm import tqdm
|
| 38 |
+
from types import SimpleNamespace
|
| 39 |
+
from pae.environment.webgym import BatchedWebEnv
|
| 40 |
+
import os
|
| 41 |
+
from llava.model.language_model.llava_mistral import LlavaMistralForCausalLM
|
| 42 |
+
|
| 43 |
+
# ============= Instanstiate the agent =============
|
| 44 |
+
config_dict = {"use_lora": False,
|
| 45 |
+
"use_q4": False, # our 34b model is quantized to 4-bit, set it to True if you are using 34B model
|
| 46 |
+
"use_anyres": False,
|
| 47 |
+
"temperature": 1.0,
|
| 48 |
+
"max_new_tokens": 512,
|
| 49 |
+
"train_vision": False,
|
| 50 |
+
"num_beams": 1,}
|
| 51 |
+
config = SimpleNamespace(**config_dict)
|
| 52 |
+
|
| 53 |
+
accelerator = Accelerator()
|
| 54 |
+
agent = LlavaAgent(policy_lm = "Qinsi1/Vision-Zero-InternVL3-14B-Clevr", # alternate models "yifeizhou/pae-llava-7b-webarena", "yifeizhou/pae-llava-34b"
|
| 55 |
+
device = accelerator.device,
|
| 56 |
+
accelerator = accelerator,
|
| 57 |
+
config = config)
|
| 58 |
+
|
| 59 |
+
# ============= Instanstiate the environment =============
|
| 60 |
+
test_tasks = [{"web_name": "Google Map",
|
| 61 |
+
"id": "0",
|
| 62 |
+
"ques": "Locate a parking lot near the Brooklyn Bridge that open 24 hours. Review the user comments about it.",
|
| 63 |
+
"web": "https://www.google.com/maps/"}]
|
| 64 |
+
save_path = "xxx"
|
| 65 |
+
|
| 66 |
+
test_env = BatchedWebEnv(tasks = test_tasks,
|
| 67 |
+
do_eval = False,
|
| 68 |
+
download_dir=os.path.join(save_path, 'test_driver', 'download'),
|
| 69 |
+
output_dir=os.path.join(save_path, 'test_driver', 'output'),
|
| 70 |
+
batch_size=1,
|
| 71 |
+
max_iter=10,)
|
| 72 |
+
# for you to check the images and actions
|
| 73 |
+
image_histories = [] # stores the history of the paths of images
|
| 74 |
+
action_histories = [] # stores the history of actions
|
| 75 |
+
|
| 76 |
+
results = test_env.reset()
|
| 77 |
+
image_histories.append(results[0][0]["image"])
|
| 78 |
+
|
| 79 |
+
observations = [r[0] for r in results]
|
| 80 |
+
actions = agent.get_action(observations)
|
| 81 |
+
action_histories.append(actions[0])
|
| 82 |
+
dones = None
|
| 83 |
+
|
| 84 |
+
for _ in tqdm(range(3)):
|
| 85 |
+
if dones is not None and all(dones):
|
| 86 |
+
break
|
| 87 |
+
results = test_env.step(actions)
|
| 88 |
+
image_histories.append(results[0][0]["image"])
|
| 89 |
+
observations = [r[0] for r in results]
|
| 90 |
+
actions = agent.get_action(observations)
|
| 91 |
+
action_histories.append(actions[0])
|
| 92 |
+
dones = [r[2] for r in results]
|
| 93 |
+
|
| 94 |
+
print("Done!")
|
| 95 |
+
print("image_histories: ", image_histories)
|
| 96 |
+
print("action_histories: ", action_histories)
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
## Citation
|
| 100 |
+
If you find Vision-Zero useful or relevant to your project and research, please kindly cite our paper:
|
| 101 |
+
```bibtex
|
| 102 |
+
@misc{wang2025visionzeroscalablevlmselfimprovement,
|
| 103 |
+
title={Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play},
|
| 104 |
+
author={Qinsi Wang and Bo Liu and Tianyi Zhou and Jing Shi and Yueqian Lin and Yiran Chen and Hai Helen Li and Kun Wan and Wentian Zhao},
|
| 105 |
+
year={2025},
|
| 106 |
+
eprint={2509.25541},
|
| 107 |
+
archivePrefix={arXiv},
|
| 108 |
+
primaryClass={cs.CV},
|
| 109 |
+
url={https://arxiv.org/abs/2509.25541},
|
| 110 |
+
}
|
| 111 |
+
```
|