kolerk commited on
Commit
93c2634
·
verified ·
1 Parent(s): b9f78e1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -1
README.md CHANGED
@@ -11,4 +11,103 @@ base_model:
11
  pipeline_tag: image-text-to-text
12
  ---
13
 
14
- This is the model cited in the paper: [Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models](https://arxiv.org/abs/2505.16854).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  pipeline_tag: image-text-to-text
12
  ---
13
 
14
+ ---
15
+ license: apache-2.0
16
+ datasets:
17
+ - kolerk/TON-AITZ-SFT
18
+ language:
19
+ - en
20
+ base_model:
21
+ - Qwen/Qwen2.5-VL-3B-Instruct
22
+ pipeline_tag: image-text-to-text
23
+ ---
24
+ # TON-AITZ
25
+ TON is a series of large language models trained using our efficient algorithm, which automatically decides whether to think or not, based on Qwen2.5-VL.
26
+ We apply Group Relative Policy Optimization (GRPO) for reinforcement learning with "thought dropout" supervised finetuning as a prelimary step.
27
+ ## Introduction
28
+
29
+ Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide *when reasoning is necessary*. To realize this, we propose *TON*, a two-stage training strategy:
30
+
31
+ 1. **(i)** A supervised fine-tuning (SFT) stage with a simple yet effective “**thought dropout**” operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning.
32
+ 2. **(ii)** A GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards.
33
+
34
+ Experimental results show that *TON* can *reduce the completion length by up to **90%** compared to vanilla GRPO, without sacrificing performance or even improving it*. Further evaluations across diverse vision-language tasks—covering a range of reasoning difficulties under both 3B and 7B models—consistently reveal that the *model progressively learns to bypass unnecessary reasoning steps as training advances*. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches.
35
+
36
+ ## Quickstart
37
+
38
+ ```python
39
+ from transformers import AutoModelForCausalLM, AutoTokenizer
40
+
41
+
42
+ example={
43
+ 'image': your_image_path,
44
+ 'problem': 'How many items are there in the image?'
45
+ }
46
+
47
+ def make_conversation_image(example):
48
+ return {
49
+ 'image': example['image'], # Store path instead of loaded image
50
+ 'prompt': [{
51
+ 'role': 'user',
52
+ 'content': [
53
+ {'type': 'image', 'text': None},
54
+ {'type': 'text', 'text': example['problem']}
55
+ ]
56
+ }]
57
+ }
58
+
59
+ model_name = "kolerk/TON-3B-AITZ"
60
+
61
+ model = AutoModelForCausalLM.from_pretrained(
62
+ model_name,
63
+ torch_dtype="auto",
64
+ device_map="auto"
65
+ )
66
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
67
+
68
+
69
+ text = tokenizer.apply_chat_template(
70
+ make_conversation_image(example),
71
+ tokenize=False,
72
+ add_generation_prompt=True
73
+ )
74
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
75
+
76
+ generated_ids = model.generate(
77
+ **model_inputs,
78
+ max_new_tokens=4096,
79
+ top_p=0.95,
80
+ top_k=1,
81
+ temperature=0.6
82
+ )
83
+ generated_ids = [
84
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
85
+ ]
86
+
87
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
88
+ print(response)
89
+ ```
90
+
91
+ ## Evaluation
92
+
93
+ Run our test Python file in the [code repository](https://github.com/kokolerk/TON/blob/main/src/eval/test_qwen25vl_counting_superclevr.py) for more details.
94
+
95
+
96
+ ## Citation
97
+
98
+ If you find our work helpful, feel free to give us a cite.
99
+
100
+ ```
101
+ @misc{wang2025think,
102
+ title={Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models},
103
+ author={Jiaqi Wang and Kevin Qinghong Lin and James Cheng and Mike Zheng Shou},
104
+ year={2025},
105
+ eprint={2505.16854},
106
+ archivePrefix={arXiv},
107
+ primaryClass={cs.AI}
108
+ }
109
+ ```
110
+
111
+
112
+
113
+