kolerk
/

TON-3B-CLEVR

Image-Text-to-Text

Model card Files Files and versions

kolerk commited on Jul 14, 2025

Commit

82cd49f

·

verified ·

1 Parent(s): 93c2634

Update README.md

Files changed (1) hide show

README.md +2 -12

README.md CHANGED Viewed

@@ -11,19 +11,9 @@ base_model:
 pipeline_tag: image-text-to-text
 ---
----
-license: apache-2.0
-datasets:
-- kolerk/TON-AITZ-SFT
-language:
-- en
-base_model:
-- Qwen/Qwen2.5-VL-3B-Instruct
-pipeline_tag: image-text-to-text
----
-# TON-AITZ
 TON is a series of large language models trained using our efficient algorithm, which automatically decides whether to think or not, based on Qwen2.5-VL.
-We apply Group Relative Policy Optimization (GRPO) for reinforcement learning with "thought dropout" supervised finetuning as a prelimary step.
 ## Introduction
 Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide *when reasoning is necessary*. To realize this, we propose *TON*, a two-stage training strategy:

 pipeline_tag: image-text-to-text
 ---
+# TON-CLEVR
 TON is a series of large language models trained using our efficient algorithm, which automatically decides whether to think or not, based on Qwen2.5-VL.
+We apply Group Relative Policy Optimization (GRPO) for reinforcement learning with "thought dropout" supervised finetuning as a preliminary step.
 ## Introduction
 Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide *when reasoning is necessary*. To realize this, we propose *TON*, a two-stage training strategy: