Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -64,7 +64,9 @@ It is a fine-tune of **Qwen 2.5-VL-7B** using ~10 k synthetic doc-to-Reasoning-t
 ## Training
 1. **SFT**: One-epoch supervised fine-tune on synthetic reasoning trace generated from public PDFs (10K input/output pairs).
-2. **RL (GRPO)**: RL pahse using a structure-aware reward (5K difficults image examples).
 ## Quick start: 🤗 Transformers

 ## Training
 1. **SFT**: One-epoch supervised fine-tune on synthetic reasoning trace generated from public PDFs (10K input/output pairs).
+2. **RL (GRPO)**: RL pahse using a structure-aware reward (5K difficults image examples).
+**Model before GRPO loose 80% time vs post GRPO model**
 ## Quick start: 🤗 Transformers