rayruiyang
/

VST-7B-RL

@@ -1,5 +1,7 @@
 ---
 license: apache-2.0
 ---
 # VST-7B-RL
@@ -34,7 +36,7 @@ We introduce **Visual Spatial Tuning (VST)**, a comprehensive framework designed
 ✨ **VST-P**: 4.1M samples across 19 skills, spanning single images, multi-image scenarios, and videos—boosting spatial perception in VLMs.
 ✨ **VST-R**: 135K curated samples that teach models to reason in space, including step-by-step reasoning and rule-based data for reinforcement learning.
 ✨ **Progressive Training Pipeline**: Start with supervised fine-tuning to build foundational spatial knowledge, then reinforce spatial reasoning abilities via RL. VST achieves state-of-the-art results on spatial benchmarks (34.8% on MMSI-Bench, 61.2% on VSIBench) without compromising general capabilities.
-✨ **Vision-Language-Action Models Enhanced**: The VST paradigm significantly strengthens spatial tuning, paving the way for more physically grounded AI.
@@ -149,4 +151,4 @@ If you find our work helpful, feel free to give us a cite.
   journal={arXiv preprint arXiv:2511.05491},
   year={2025}
 }
-```

 ---
 license: apache-2.0
+library_name: transformers
+pipeline_tag: image-text-to-text
 ---
 # VST-7B-RL
 ✨ **VST-P**: 4.1M samples across 19 skills, spanning single images, multi-image scenarios, and videos—boosting spatial perception in VLMs.
 ✨ **VST-R**: 135K curated samples that teach models to reason in space, including step-by-step reasoning and rule-based data for reinforcement learning.
 ✨ **Progressive Training Pipeline**: Start with supervised fine-tuning to build foundational spatial knowledge, then reinforce spatial reasoning abilities via RL. VST achieves state-of-the-art results on spatial benchmarks (34.8% on MMSI-Bench, 61.2% on VSIBench) without compromising general capabilities.
+✨ **Vision-Language-Action Models Enhanced**: The VST paradigm significantly strengthens robotic learning, paving the way for more physically grounded AI.
   journal={arXiv preprint arXiv:2511.05491},
   year={2025}
 }
+```