kolerk commited on
Commit
82cd49f
·
verified ·
1 Parent(s): 93c2634

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -12
README.md CHANGED
@@ -11,19 +11,9 @@ base_model:
11
  pipeline_tag: image-text-to-text
12
  ---
13
 
14
- ---
15
- license: apache-2.0
16
- datasets:
17
- - kolerk/TON-AITZ-SFT
18
- language:
19
- - en
20
- base_model:
21
- - Qwen/Qwen2.5-VL-3B-Instruct
22
- pipeline_tag: image-text-to-text
23
- ---
24
- # TON-AITZ
25
  TON is a series of large language models trained using our efficient algorithm, which automatically decides whether to think or not, based on Qwen2.5-VL.
26
- We apply Group Relative Policy Optimization (GRPO) for reinforcement learning with "thought dropout" supervised finetuning as a prelimary step.
27
  ## Introduction
28
 
29
  Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide *when reasoning is necessary*. To realize this, we propose *TON*, a two-stage training strategy:
 
11
  pipeline_tag: image-text-to-text
12
  ---
13
 
14
+ # TON-CLEVR
 
 
 
 
 
 
 
 
 
 
15
  TON is a series of large language models trained using our efficient algorithm, which automatically decides whether to think or not, based on Qwen2.5-VL.
16
+ We apply Group Relative Policy Optimization (GRPO) for reinforcement learning with "thought dropout" supervised finetuning as a preliminary step.
17
  ## Introduction
18
 
19
  Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide *when reasoning is necessary*. To realize this, we propose *TON*, a two-stage training strategy: