nielsr HF Staff commited on
Commit
5f09261
·
verified ·
1 Parent(s): 80e04e3

Improve model card: Add pipeline tag, library name, paper, code, abstract, image, and usage

Browse files

This PR significantly enhances the model card for the `Vision-Zero-InternVL3-14B-Clevr` model by adding crucial metadata and detailed documentation.

Specifically, it includes:
- **`pipeline_tag: image-text-to-text`**: This accurately categorizes the model's functionality as a Vision-Language Model, improving its discoverability on the Hugging Face Hub.
- **`library_name: transformers`**: Evidence from the `config.json` (e.g., `transformers_version`, `architectures`) suggests compatibility with the `transformers` library, enabling automated code snippets for users.
- **`license: cc-by-nc-4.0`**: A common research license has been added.
- **Paper Link**: A direct link to the paper [Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play](https://huggingface.co/papers/2509.25541).
- **GitHub Repository**: A link to the official GitHub repository: https://github.com/wangqinsi1/Vision-Zero.
- **Abstract**: The full paper abstract is included for a comprehensive overview.
- **Overview Image**: The main overview image from the GitHub README is included for visual context.
- **Quick Start (Inference)**: A detailed usage section, including setup instructions and a Python code snippet, is directly extracted from the GitHub README to guide users on how to run inference.
- **Citation**: The BibTeX citation for the paper is also included.

These additions will greatly improve the discoverability, usability, and overall documentation of the model on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +111 -0
README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ ---
6
+
7
+ # Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
8
+
9
+ This repository contains the `Vision-Zero-InternVL3-14B-Clevr` model, as presented in the paper [Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play](https://huggingface.co/papers/2509.25541).
10
+
11
+ Vision-Zero is a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. It trains Vision-Language Models (VLMs) in "Who Is the Spy"-style games, where models engage in strategic reasoning and actions across multiple roles, autonomously generating training data without human annotation.
12
+
13
+ ![Overview](https://github.com/wangqinsi1/Vision-Zero/raw/main/self-play-taste.png)
14
+
15
+ ## Abstract
16
+ Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision–language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose **Vision-Zero**, *a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs.* Specifically, Vision-Zero encompasses three main attributes: (1) **Strategic Self-Play Framework:** Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) **Gameplay from Arbitrary Images:** Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model’s reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) **Sustainable Performance Gain:** We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods.
17
+
18
+ ## Code
19
+ The official implementation and code for Vision-Zero can be found on GitHub: [https://github.com/wangqinsi1/Vision-Zero](https://github.com/wangqinsi1/Vision-Zero)
20
+
21
+ ## Quick Start (Inference)
22
+
23
+ ### Install Python Packages
24
+ First, create a conda environment and install relevant python packages.
25
+ ```bash
26
+ conda create -n vision-zero python=3.10
27
+ conda activate vision-zero
28
+ bash setup.sh
29
+ ```
30
+
31
+ ### Play with the Model Yourself
32
+ ```python
33
+ import pae
34
+ from pae.models import LlavaAgent, ClaudeAgent
35
+ from accelerate import Accelerator
36
+ import torch
37
+ from tqdm import tqdm
38
+ from types import SimpleNamespace
39
+ from pae.environment.webgym import BatchedWebEnv
40
+ import os
41
+ from llava.model.language_model.llava_mistral import LlavaMistralForCausalLM
42
+
43
+ # ============= Instanstiate the agent =============
44
+ config_dict = {"use_lora": False,
45
+ "use_q4": False, # our 34b model is quantized to 4-bit, set it to True if you are using 34B model
46
+ "use_anyres": False,
47
+ "temperature": 1.0,
48
+ "max_new_tokens": 512,
49
+ "train_vision": False,
50
+ "num_beams": 1,}
51
+ config = SimpleNamespace(**config_dict)
52
+
53
+ accelerator = Accelerator()
54
+ agent = LlavaAgent(policy_lm = "Qinsi1/Vision-Zero-InternVL3-14B-Clevr", # alternate models "yifeizhou/pae-llava-7b-webarena", "yifeizhou/pae-llava-34b"
55
+ device = accelerator.device,
56
+ accelerator = accelerator,
57
+ config = config)
58
+
59
+ # ============= Instanstiate the environment =============
60
+ test_tasks = [{"web_name": "Google Map",
61
+ "id": "0",
62
+ "ques": "Locate a parking lot near the Brooklyn Bridge that open 24 hours. Review the user comments about it.",
63
+ "web": "https://www.google.com/maps/"}]
64
+ save_path = "xxx"
65
+
66
+ test_env = BatchedWebEnv(tasks = test_tasks,
67
+ do_eval = False,
68
+ download_dir=os.path.join(save_path, 'test_driver', 'download'),
69
+ output_dir=os.path.join(save_path, 'test_driver', 'output'),
70
+ batch_size=1,
71
+ max_iter=10,)
72
+ # for you to check the images and actions
73
+ image_histories = [] # stores the history of the paths of images
74
+ action_histories = [] # stores the history of actions
75
+
76
+ results = test_env.reset()
77
+ image_histories.append(results[0][0]["image"])
78
+
79
+ observations = [r[0] for r in results]
80
+ actions = agent.get_action(observations)
81
+ action_histories.append(actions[0])
82
+ dones = None
83
+
84
+ for _ in tqdm(range(3)):
85
+ if dones is not None and all(dones):
86
+ break
87
+ results = test_env.step(actions)
88
+ image_histories.append(results[0][0]["image"])
89
+ observations = [r[0] for r in results]
90
+ actions = agent.get_action(observations)
91
+ action_histories.append(actions[0])
92
+ dones = [r[2] for r in results]
93
+
94
+ print("Done!")
95
+ print("image_histories: ", image_histories)
96
+ print("action_histories: ", action_histories)
97
+ ```
98
+
99
+ ## Citation
100
+ If you find Vision-Zero useful or relevant to your project and research, please kindly cite our paper:
101
+ ```bibtex
102
+ @misc{wang2025visionzeroscalablevlmselfimprovement,
103
+ title={Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play},
104
+ author={Qinsi Wang and Bo Liu and Tianyi Zhou and Jing Shi and Yueqian Lin and Yiran Chen and Hai Helen Li and Kun Wan and Wentian Zhao},
105
+ year={2025},
106
+ eprint={2509.25541},
107
+ archivePrefix={arXiv},
108
+ primaryClass={cs.CV},
109
+ url={https://arxiv.org/abs/2509.25541},
110
+ }
111
+ ```