Update README.md
Browse files
README.md
CHANGED
|
@@ -64,7 +64,9 @@ It is a fine-tune of **Qwen 2.5-VL-7B** using ~10 k synthetic doc-to-Reasoning-t
|
|
| 64 |
## Training
|
| 65 |
|
| 66 |
1. **SFT**: One-epoch supervised fine-tune on synthetic reasoning trace generated from public PDFs (10K input/output pairs).
|
| 67 |
-
2. **RL (GRPO)**: RL pahse using a structure-aware reward (5K difficults image examples).
|
|
|
|
|
|
|
| 68 |
|
| 69 |
|
| 70 |
## Quick start: 🤗 Transformers
|
|
|
|
| 64 |
## Training
|
| 65 |
|
| 66 |
1. **SFT**: One-epoch supervised fine-tune on synthetic reasoning trace generated from public PDFs (10K input/output pairs).
|
| 67 |
+
2. **RL (GRPO)**: RL pahse using a structure-aware reward (5K difficults image examples).
|
| 68 |
+
|
| 69 |
+
**Model before GRPO loose 80% time vs post GRPO model**
|
| 70 |
|
| 71 |
|
| 72 |
## Quick start: 🤗 Transformers
|