justairr commited on
Commit
df48927
·
verified ·
1 Parent(s): 2cdeeaa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -2
README.md CHANGED
@@ -5,14 +5,18 @@ language:
5
  - zh
6
  base_model:
7
  - Qwen/Qwen2.5-VL-3B-Instruct
 
 
8
  ---
9
 
 
10
 
11
  SATORI is a vision-language model fine-tuned from Qwen2.5-VL to perform structured visual reasoning for Visual Question Answering (VQA). It generates:
12
  1. A concise image caption describing the overall scene.
13
  2. Coordinates of relevant bounding boxes that support reasoning.
14
  3. A final answer to the user’s question.
15
 
 
16
  ### Inference Example
17
 
18
  ```python
@@ -79,5 +83,4 @@ output_text = processor.batch_decode(
79
  trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
80
  )
81
  print(output_text)
82
- ```
83
-
 
5
  - zh
6
  base_model:
7
  - Qwen/Qwen2.5-VL-3B-Instruct
8
+ datasets:
9
+ - justairr/VQA-Verify
10
  ---
11
 
12
+ This is the official implementation from the paper *SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards*. [Arxiv Here.](https://arxiv.org/abs/2505.19094)
13
 
14
  SATORI is a vision-language model fine-tuned from Qwen2.5-VL to perform structured visual reasoning for Visual Question Answering (VQA). It generates:
15
  1. A concise image caption describing the overall scene.
16
  2. Coordinates of relevant bounding boxes that support reasoning.
17
  3. A final answer to the user’s question.
18
 
19
+
20
  ### Inference Example
21
 
22
  ```python
 
83
  trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
84
  )
85
  print(output_text)
86
+ ```