Alexandre-Numind commited on
Commit
333c874
ยท
verified ยท
1 Parent(s): 6f5a5b4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -105
README.md CHANGED
@@ -1,58 +1,68 @@
1
  ---
2
-
3
  license: mit
4
  base_model: Qwen/Qwen2.5-VL-7B
 
 
 
 
 
 
 
5
  model_name: NuMarkdown-Qwen2.5-VL
6
-
 
 
 
 
7
  ---
8
 
9
- # NuMarkdownโ€‘Qwen2.5โ€‘VLย ๐Ÿ–‹๏ธ๐Ÿ“„ย โ†’ย ๐Ÿ“
10
 
11
- **NuMarkdownโ€‘Qwen2.5โ€‘VL** is the **first reasoning visionโ€‘language model** that converts semiโ€‘structured **documents and PDF scans into clean GitHubโ€‘flavoured Markdown**, with layout preserved and an optional chainโ€‘ofโ€‘thought explaining each step.
 
12
 
13
- > *โ€œFrom messy scans to tidy `.md` in one shot.โ€*
14
 
15
  ---
 
16
 
17
- ## Overview
18
 
19
- * **Architecture:** fineโ€‘tune of [Qwenย 2.5โ€‘VLโ€‘7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B).
20
- * **Training data:** 10โ€ฏk synthetic docโ€‘toโ€‘Markdown pairs + 5โ€ฏk challenging images.
21
- * **Reasoning tokens:** during inference the model thinks \~20โ€ฏ%โ€ฏโ€“โ€ฏ2โ€ฏร— more tokens than its final answer.
22
- * **License:** MIT โ€“ free for commercial & research use.
23
 
24
- ---
 
 
 
 
 
 
 
 
25
 
26
- ## Results
27
 
28
- ### ๐Ÿ† Arena rankingย โ€” *Trueskillโ€‘2 (ฮผโ€ฏโˆ’โ€ฏ3ฯƒ)*
29
 
30
- | Rank | Model | ฮผ | ฯƒ | ฮผโ€ฏโˆ’โ€ฏ3ฯƒ |
31
- | ---- | -------------------------------------- | ----- | ---- | ------ |
32
- | ๐Ÿฅ‡ 1 | **geminiโ€‘flashโ€‘reasoning** | 26.75 | 0.80 | 24.35 |
33
- | ๐Ÿฅˆ 2 | **NuMarkdownโ€‘reasoning** | 26.10 | 0.79 | 23.72 |
34
- | ๐Ÿฅ‰ 3 | **NuMarkdownโ€‘reasoningโ€‘w/oย reasoning** | 25.32 | 0.80 | 22.93 |
35
- | 4 | **OCRFluxโ€‘3B** | 24.63 | 0.80 | 22.22 |
36
- | 5 | **gptโ€‘4o** | 24.48 | 0.80 | 22.08 |
37
- | 6 | **geminiโ€‘flashโ€‘w/oย reasoning** | 24.11 | 0.79 | 21.74 |
38
- | 7 | **RolmoOCR** | 23.53 | 0.82 | 21.07 |
39
 
40
- ### Winโ€‘rate plots
41
 
42
- | | |
43
- | :----------------------------------------------: | :---------------------------------------: |
44
- | ![Barโ€‘plot of pairwise winโ€‘rate](bar plot.png) | ![Matrix winโ€‘rate heatโ€‘map](matrix.png) |
 
 
45
 
46
  ---
47
 
48
- ## Training procedure
 
 
 
49
 
50
- 1. **Supervised fineโ€‘tuning (SFT)** โ€“ one epoch on 10โ€ฏk synthetic pairs generated from public PDFs.
51
- 2. **Reinforcement Learning (GRPO)** โ€“ 5โ€ฏk difficult images with a **structureโ€‘aware** reward focusing on layout fidelity.
52
 
53
- ---
54
 
55
- ## Quick startย โ€” ๐Ÿค—ย Transformers
56
 
57
  ```python
58
  from __future__ import annotations
@@ -61,11 +71,11 @@ import torch
61
  from PIL import Image
62
  from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
63
 
64
- model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
65
 
66
  processor = AutoProcessor.from_pretrained(
67
  model_id,
68
- trust_remote_code=True,
69
  )
70
 
71
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
@@ -77,92 +87,37 @@ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
77
  )
78
 
79
  img = Image.open("invoice_scan.png").convert("RGB")
80
- messages = [
81
- {
82
- "role": "user",
83
- "content": [{"type": "image"}],
84
- }
85
- ]
86
-
87
- prompt = processor.apply_chat_template(
88
- messages,
89
- tokenize=False,
90
- add_generation_prompt=True,
91
- )
92
-
93
- inputs = processor(
94
- text=prompt,
95
- images=[img],
96
- return_tensors="pt",
97
- ).to(model.device)
98
 
99
  with torch.no_grad():
100
- outputs = model.generate(**inputs, max_new_tokens=5_000)
101
-
102
- print(
103
- processor.decode(
104
- outputs[0]
105
- .split("<answer>")[1]
106
- .split("</answer>")[0],
107
- skip_special_tokens=True,
108
- )
109
- )
110
- ```
111
 
112
- ---
 
113
 
114
- ## Quick startย โ€” vLLM
115
 
 
116
  ```python
117
  from PIL import Image
118
  from vllm import LLM, SamplingParameters
119
  from transformers import AutoProcessor
120
 
121
  model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
122
-
123
  llm = LLM(model=model_id, trust_remote_code=True, dtype="bfloat16")
124
  proc = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
125
 
126
  img = Image.open("invoice_scan.png")
 
 
127
 
128
- prompt = proc(
129
- text="Convert this to Markdown with reasoning.",
130
- image=img,
131
- return_tensors="np", # numpy arrays for vLLM
132
- )
133
-
134
- params = SamplingParameters(
135
- max_tokens=1_024,
136
- temperature=0.8,
137
- top_p=0.95,
138
- )
139
-
140
- result = (
141
- llm.generate([{"prompt": prompt}], params)[0]
142
- .outputs[0]
143
- .text.split("<answer>")[1]
144
- .split("</answer>")[0]
145
- )
146
-
147
  print(result)
148
- ```
149
-
150
- ---
151
-
152
- ## Citation
153
-
154
- If you use **NuMarkdownโ€‘Qwen2.5โ€‘VL** in your research, please cite the model:
155
-
156
- ```bibtex
157
- @software{NuMarkdown-Qwen2.5-VL,
158
- title = {NuMarkdown-Qwen2.5-VL: Vision-language reasoning model for doc-to-Markdown},
159
- author = {NM-dev},
160
- year = 2025,
161
- url = {https://huggingface.co/NM-dev/NuMarkdown-Qwen2.5-VL},
162
- license = {MIT}
163
- }
164
- ```
165
-
166
- ---
167
-
168
- *Last updated: 2025โ€‘08โ€‘04*
 
1
  ---
 
2
  license: mit
3
  base_model: Qwen/Qwen2.5-VL-7B
4
+ tags:
5
+ - vision-language
6
+ - document-to-markdown
7
+ - reinforcement-learning
8
+ - grpo
9
+ - qwen2.5
10
+ - markdown
11
  model_name: NuMarkdown-Qwen2.5-VL
12
+ datasets:
13
+ - NM-dev/markdown-input_output-v3
14
+ - NM-dev/markdown-grpo-images3
15
+ library_name: transformers
16
+ pipeline_tag: text-generation
17
  ---
18
 
19
+ # NuMarkdown-Qwen2.5-VL ๐Ÿ–‹๏ธ๐Ÿ“„ โ†’ ๐Ÿ“
20
 
21
+ **NuMarkdown-Qwen2.5-VL** is the first reasoning vision-language model trained to converts documents into clean GitHub-flavoured Markdown.
22
+ It is a fine-tune of **Qwen 2.5-VL-7B** using ~10 k synthetic doc-to-Markdown pairs, followed by a RL phase (GRPO) with a layout-centric reward.
23
 
24
+ *(note: the number of thinking tokens can vary from 20% to 2X the number of token of the final answers)*
25
 
26
  ---
27
+ ## Results
28
 
29
+ (we plan to realease a markdown arena -similar to llmArena- for complex document to markdown task)
30
 
31
+ ### Arena ranking (using trueskill-2 ranking system)
 
 
 
32
 
33
+ | Rank | Model | ฮผ | ฯƒ | ฮผ โˆ’ 3ฯƒ |
34
+ | ---- | --------------------------------------- | ----- | ---- | ------ |
35
+ | ๐Ÿฅ‡ 1 | **gemini-flash-reasoning** | 26.75 | 0.80 | 24.35 |
36
+ | ๐Ÿฅˆ 2 | **NuMarkdown-reasoning** | 26.10 | 0.79 | 23.72 |
37
+ | ๐Ÿฅ‰ 3 | **NuMarkdown-reasoning-w/o\_reasoning** | 25.32 | 0.80 | 22.93 |
38
+ | 4 | **OCRFlux-3B** | 24.63 | 0.80 | 22.22 |
39
+ | 5 | **gpt-4o** | 24.48 | 0.80 | 22.08 |
40
+ | 6 | **gemini-flash-w/o\_reasoning** | 24.11 | 0.79 | 21.74 |
41
+ | 7 | **RolmoOCR** | 23.53 | 0.82 | 21.07 |
42
 
 
43
 
44
+ ### Win-rate of our model against others models:
45
 
46
+ <img src="bar plot.png" width="500"/>
 
 
 
 
 
 
 
 
47
 
48
+ ### Matrix Win-rate:
49
 
50
+ <img src="matrix.png" width="500"/>
51
+
52
+ ### GRPO:
53
+
54
+ GRPO model win 80% against model trained only with SFT
55
 
56
  ---
57
 
58
+ ## Training
59
+
60
+ 1. **SFT**: One-epoch supervised fine-tune on synthetic reasoning trace generated from public PDFs (10K input/output pairs).
61
+ 2. **RL (GRPO)**: RL pahse using a structure-aware reward (5K difficults image examples).
62
 
 
 
63
 
 
64
 
65
+ ## Quick start: ๐Ÿค— Transformers
66
 
67
  ```python
68
  from __future__ import annotations
 
71
  from PIL import Image
72
  from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
73
 
74
+ model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
75
 
76
  processor = AutoProcessor.from_pretrained(
77
  model_id,
78
+ trust_remote_code=True,
79
  )
80
 
81
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
 
87
  )
88
 
89
  img = Image.open("invoice_scan.png").convert("RGB")
90
+ messages = [{
91
+ "role": "user",
92
+ "content": [
93
+ {"type": "image"},
94
+ ],
95
+ }]
96
+ prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
97
+ enc = processor(text=prompt, images=[img], return_tensors="pt").to(model.device)
 
 
 
 
 
 
 
 
 
 
98
 
99
  with torch.no_grad():
100
+ out = model.generate(**enc, max_new_tokens=5000)
 
 
 
 
 
 
 
 
 
 
101
 
102
+ print(processor.decode(out[0].split("<answer>")[1].split("</answer>")[0], skip_special_tokens=True))
103
+ ```
104
 
 
105
 
106
+ ## VLLM:
107
  ```python
108
  from PIL import Image
109
  from vllm import LLM, SamplingParameters
110
  from transformers import AutoProcessor
111
 
112
  model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
 
113
  llm = LLM(model=model_id, trust_remote_code=True, dtype="bfloat16")
114
  proc = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
115
 
116
  img = Image.open("invoice_scan.png")
117
+ prompt = proc(text="Convert this to Markdown with reasoning.", image=img,
118
+ return_tensors="np") # numpy arrays for vLLM
119
 
120
+ params = SamplingParameters(max_tokens=1024, temperature=0.8, top_p=0.95)
121
+ result = llm.generate([{"prompt": prompt}], params)[0].outputs[0].text.split("<answer>")[1].split("</answer>")[0]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
  print(result)
123
+ ```