justairr
/

SATORI

@@ -5,14 +5,18 @@ language:
 - zh
 base_model:
 - Qwen/Qwen2.5-VL-3B-Instruct
 ---
 SATORI is a vision-language model fine-tuned from Qwen2.5-VL to perform structured visual reasoning for Visual Question Answering (VQA). It generates:
 1. A concise image caption describing the overall scene.
 2. Coordinates of relevant bounding boxes that support reasoning.
 3. A final answer to the user’s question.
 ### Inference Example
 ```python
@@ -79,5 +83,4 @@ output_text = processor.batch_decode(
     trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )
 print(output_text)
-```

 - zh
 base_model:
 - Qwen/Qwen2.5-VL-3B-Instruct
+datasets:
+- justairr/VQA-Verify
 ---
+This is the official implementation from the paper *SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards*. [Arxiv Here.](https://arxiv.org/abs/2505.19094)
 SATORI is a vision-language model fine-tuned from Qwen2.5-VL to perform structured visual reasoning for Visual Question Answering (VQA). It generates:
 1. A concise image caption describing the overall scene.
 2. Coordinates of relevant bounding boxes that support reasoning.
 3. A final answer to the user’s question.
 ### Inference Example
 ```python
     trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )
 print(output_text)
+```