Update README.md
Browse files
README.md
CHANGED
|
@@ -5,14 +5,18 @@ language:
|
|
| 5 |
- zh
|
| 6 |
base_model:
|
| 7 |
- Qwen/Qwen2.5-VL-3B-Instruct
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
|
|
|
| 10 |
|
| 11 |
SATORI is a vision-language model fine-tuned from Qwen2.5-VL to perform structured visual reasoning for Visual Question Answering (VQA). It generates:
|
| 12 |
1. A concise image caption describing the overall scene.
|
| 13 |
2. Coordinates of relevant bounding boxes that support reasoning.
|
| 14 |
3. A final answer to the user’s question.
|
| 15 |
|
|
|
|
| 16 |
### Inference Example
|
| 17 |
|
| 18 |
```python
|
|
@@ -79,5 +83,4 @@ output_text = processor.batch_decode(
|
|
| 79 |
trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
| 80 |
)
|
| 81 |
print(output_text)
|
| 82 |
-
```
|
| 83 |
-
|
|
|
|
| 5 |
- zh
|
| 6 |
base_model:
|
| 7 |
- Qwen/Qwen2.5-VL-3B-Instruct
|
| 8 |
+
datasets:
|
| 9 |
+
- justairr/VQA-Verify
|
| 10 |
---
|
| 11 |
|
| 12 |
+
This is the official implementation from the paper *SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards*. [Arxiv Here.](https://arxiv.org/abs/2505.19094)
|
| 13 |
|
| 14 |
SATORI is a vision-language model fine-tuned from Qwen2.5-VL to perform structured visual reasoning for Visual Question Answering (VQA). It generates:
|
| 15 |
1. A concise image caption describing the overall scene.
|
| 16 |
2. Coordinates of relevant bounding boxes that support reasoning.
|
| 17 |
3. A final answer to the user’s question.
|
| 18 |
|
| 19 |
+
|
| 20 |
### Inference Example
|
| 21 |
|
| 22 |
```python
|
|
|
|
| 83 |
trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
| 84 |
)
|
| 85 |
print(output_text)
|
| 86 |
+
```
|
|
|