Update README.md
Browse files
README.md
CHANGED
|
@@ -11,19 +11,9 @@ base_model:
|
|
| 11 |
pipeline_tag: image-text-to-text
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
| 15 |
-
license: apache-2.0
|
| 16 |
-
datasets:
|
| 17 |
-
- kolerk/TON-AITZ-SFT
|
| 18 |
-
language:
|
| 19 |
-
- en
|
| 20 |
-
base_model:
|
| 21 |
-
- Qwen/Qwen2.5-VL-3B-Instruct
|
| 22 |
-
pipeline_tag: image-text-to-text
|
| 23 |
-
---
|
| 24 |
-
# TON-AITZ
|
| 25 |
TON is a series of large language models trained using our efficient algorithm, which automatically decides whether to think or not, based on Qwen2.5-VL.
|
| 26 |
-
We apply Group Relative Policy Optimization (GRPO) for reinforcement learning with "thought dropout" supervised finetuning as a
|
| 27 |
## Introduction
|
| 28 |
|
| 29 |
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide *when reasoning is necessary*. To realize this, we propose *TON*, a two-stage training strategy:
|
|
|
|
| 11 |
pipeline_tag: image-text-to-text
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# TON-CLEVR
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
TON is a series of large language models trained using our efficient algorithm, which automatically decides whether to think or not, based on Qwen2.5-VL.
|
| 16 |
+
We apply Group Relative Policy Optimization (GRPO) for reinforcement learning with "thought dropout" supervised finetuning as a preliminary step.
|
| 17 |
## Introduction
|
| 18 |
|
| 19 |
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide *when reasoning is necessary*. To realize this, we propose *TON*, a two-stage training strategy:
|