Update README.md
Browse files
README.md
CHANGED
|
@@ -60,7 +60,7 @@ It is a fine-tune of **Qwen 2.5-VL-7B** using ~10 k synthetic doc-to-Reasoning-t
|
|
| 60 |
## Training
|
| 61 |
|
| 62 |
1. **SFT**: One-epoch supervised fine-tune on synthetic reasoning trace generated from public PDFs (10K input/output pairs).
|
| 63 |
-
2. **RL (GRPO)**: RL
|
| 64 |
|
| 65 |
**Model before GRPO loose 80% time vs post GRPO model (see win-rate matrix)**
|
| 66 |
|
|
@@ -74,7 +74,7 @@ import torch
|
|
| 74 |
from PIL import Image
|
| 75 |
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
|
| 76 |
|
| 77 |
-
model_id = "
|
| 78 |
|
| 79 |
processor = AutoProcessor.from_pretrained(
|
| 80 |
model_id,
|
|
@@ -112,7 +112,7 @@ from PIL import Image
|
|
| 112 |
from vllm import LLM, SamplingParams
|
| 113 |
from transformers import AutoProcessor
|
| 114 |
|
| 115 |
-
model_id = "
|
| 116 |
|
| 117 |
llm = LLM(
|
| 118 |
model=model_id,
|
|
|
|
| 60 |
## Training
|
| 61 |
|
| 62 |
1. **SFT**: One-epoch supervised fine-tune on synthetic reasoning trace generated from public PDFs (10K input/output pairs).
|
| 63 |
+
2. **RL (GRPO)**: RL phase using a structure-aware reward (5K difficults image examples).
|
| 64 |
|
| 65 |
**Model before GRPO loose 80% time vs post GRPO model (see win-rate matrix)**
|
| 66 |
|
|
|
|
| 74 |
from PIL import Image
|
| 75 |
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
|
| 76 |
|
| 77 |
+
model_id = "Numind/NuMarkdown-reasoning"
|
| 78 |
|
| 79 |
processor = AutoProcessor.from_pretrained(
|
| 80 |
model_id,
|
|
|
|
| 112 |
from vllm import LLM, SamplingParams
|
| 113 |
from transformers import AutoProcessor
|
| 114 |
|
| 115 |
+
model_id = "Numind/NuMarkdown-reasoning"
|
| 116 |
|
| 117 |
llm = LLM(
|
| 118 |
model=model_id,
|