Update README.md

Browse files

Files changed (1) hide show

README.md +60 -105

README.md CHANGED Viewed

@@ -1,58 +1,68 @@
 ---
 license: mit
 base_model: Qwen/Qwen2.5-VL-7B
 model_name: NuMarkdown-Qwen2.5-VL
 ---
-# NuMarkdown‑Qwen2.5‑VL 🖋️📄 → 📝
-**NuMarkdown‑Qwen2.5‑VL** is the **first reasoning vision‑language model** that converts semi‑structured **documents and PDF scans into clean GitHub‑flavoured Markdown**, with layout preserved and an optional chain‑of‑thought explaining each step.
-> *“From messy scans to tidy `.md` in one shot.”*
 ---
-## Overview
-* **Architecture:** fine‑tune of [Qwen 2.5‑VL‑7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B).
-* **Training data:** 10 k synthetic doc‑to‑Markdown pairs + 5 k challenging images.
-* **Reasoning tokens:** during inference the model thinks \~20 % – 2 × more tokens than its final answer.
-* **License:** MIT – free for commercial & research use.
----
-## Results
-### 🏆 Arena ranking — *Trueskill‑2 (μ − 3σ)*
-| Rank | Model                                  | μ     | σ    | μ − 3σ |
-| ---- | -------------------------------------- | ----- | ---- | ------ |
-| 🥇 1 | **gemini‑flash‑reasoning**             | 26.75 | 0.80 | 24.35  |
-| 🥈 2 | **NuMarkdown‑reasoning**               | 26.10 | 0.79 | 23.72  |
-| 🥉 3 | **NuMarkdown‑reasoning‑w/o reasoning** | 25.32 | 0.80 | 22.93  |
-| 4    | **OCRFlux‑3B**                         | 24.63 | 0.80 | 22.22  |
-| 5    | **gpt‑4o**                             | 24.48 | 0.80 | 22.08  |
-| 6    | **gemini‑flash‑w/o reasoning**         | 24.11 | 0.79 | 21.74  |
-| 7    | **RolmoOCR**                           | 23.53 | 0.82 | 21.07  |
-### Win‑rate plots
-|                                                  |                                           |
-| :----------------------------------------------: | :---------------------------------------: |
-| ![Bar‑plot of pairwise win‑rate](bar plot.png) | ![Matrix win‑rate heat‑map](matrix.png) |
 ---
-## Training procedure
-1. **Supervised fine‑tuning (SFT)** – one epoch on 10 k synthetic pairs generated from public PDFs.
-2. **Reinforcement Learning (GRPO)** – 5 k difficult images with a **structure‑aware** reward focusing on layout fidelity.
----
-## Quick start — 🤗 Transformers
 ```python
 from __future__ import annotations
@@ -61,11 +71,11 @@ import torch
 from PIL import Image
 from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
-model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
 processor = AutoProcessor.from_pretrained(
     model_id,
-    trust_remote_code=True,
 )
 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
@@ -77,92 +87,37 @@ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
 )
 img = Image.open("invoice_scan.png").convert("RGB")
-messages = [
-    {
-        "role": "user",
-        "content": [{"type": "image"}],
-    }
-]
-prompt = processor.apply_chat_template(
-    messages,
-    tokenize=False,
-    add_generation_prompt=True,
-)
-inputs = processor(
-    text=prompt,
-    images=[img],
-    return_tensors="pt",
-).to(model.device)
 with torch.no_grad():
-    outputs = model.generate(**inputs, max_new_tokens=5_000)
-print(
-    processor.decode(
-        outputs[0]
-        .split("<answer>")[1]
-        .split("</answer>")[0],
-        skip_special_tokens=True,
-    )
-)
-```
----
-## Quick start — vLLM
 ```python
 from PIL import Image
 from vllm import LLM, SamplingParameters
 from transformers import AutoProcessor
 model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
 llm  = LLM(model=model_id, trust_remote_code=True, dtype="bfloat16")
 proc = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
 img = Image.open("invoice_scan.png")
-prompt = proc(
-    text="Convert this to Markdown with reasoning.",
-    image=img,
-    return_tensors="np",   # numpy arrays for vLLM
-)
-params = SamplingParameters(
-    max_tokens=1_024,
-    temperature=0.8,
-    top_p=0.95,
-)
-result = (
-    llm.generate([{"prompt": prompt}], params)[0]
-    .outputs[0]
-    .text.split("<answer>")[1]
-    .split("</answer>")[0]
-)
 print(result)
-```
----
-## Citation
-If you use **NuMarkdown‑Qwen2.5‑VL** in your research, please cite the model:
-```bibtex
-@software{NuMarkdown-Qwen2.5-VL,
-  title        = {NuMarkdown-Qwen2.5-VL: Vision-language reasoning model for doc-to-Markdown},
-  author       = {NM-dev},
-  year         = 2025,
-  url          = {https://huggingface.co/NM-dev/NuMarkdown-Qwen2.5-VL},
-  license      = {MIT}
-}
-```
----
-*Last updated: 2025‑08‑04*

 ---
 license: mit
 base_model: Qwen/Qwen2.5-VL-7B
+tags:
+  - vision-language
+  - document-to-markdown
+  - reinforcement-learning
+  - grpo
+  - qwen2.5
+  - markdown
 model_name: NuMarkdown-Qwen2.5-VL
+datasets:
+  - NM-dev/markdown-input_output-v3
+  - NM-dev/markdown-grpo-images3
+library_name: transformers
+pipeline_tag: text-generation
 ---
+# NuMarkdown-Qwen2.5-VL 🖋️📄 → 📝
+**NuMarkdown-Qwen2.5-VL** is the first reasoning vision-language model trained to converts documents into clean GitHub-flavoured Markdown.
+It is a fine-tune of **Qwen 2.5-VL-7B** using ~10 k synthetic doc-to-Markdown pairs, followed by a RL phase (GRPO) with a layout-centric reward.
+*(note: the number of thinking tokens can vary from 20% to 2X the number of token of the final answers)*
 ---
+## Results
+(we plan to realease a markdown arena -similar to llmArena- for complex document to markdown task)
+### Arena ranking (using trueskill-2 ranking system)
+| Rank | Model                                   | μ     | σ    | μ − 3σ |
+| ---- | --------------------------------------- | ----- | ---- | ------ |
+| 🥇 1 | **gemini-flash-reasoning**              | 26.75 | 0.80 | 24.35  |
+| 🥈 2 | **NuMarkdown-reasoning**                | 26.10 | 0.79 | 23.72  |
+| 🥉 3 | **NuMarkdown-reasoning-w/o\_reasoning** | 25.32 | 0.80 | 22.93  |
+| 4    | **OCRFlux-3B**                          | 24.63 | 0.80 | 22.22  |
+| 5    | **gpt-4o**                              | 24.48 | 0.80 | 22.08  |
+| 6    | **gemini-flash-w/o\_reasoning**         | 24.11 | 0.79 | 21.74  |
+| 7    | **RolmoOCR**                            | 23.53 | 0.82 | 21.07  |
+### Win-rate of our model against others models:
+<img src="bar plot.png" width="500"/>
+### Matrix Win-rate:
+<img src="matrix.png" width="500"/>
+### GRPO:
+GRPO model win 80% against model trained only with SFT
 ---
+## Training
+1. **SFT**: One-epoch supervised fine-tune on synthetic reasoning trace generated from public PDFs (10K input/output pairs).
+2. **RL (GRPO)**: RL pahse using a structure-aware reward (5K difficults image examples).
+## Quick start: 🤗 Transformers
 ```python
 from __future__ import annotations
 from PIL import Image
 from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
+model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
 processor = AutoProcessor.from_pretrained(
     model_id,
+    trust_remote_code=True,
 )
 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
 )
 img = Image.open("invoice_scan.png").convert("RGB")
+messages = [{
+    "role": "user",
+    "content": [
+        {"type": "image"},
+    ],
+}]
+prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+enc = processor(text=prompt, images=[img], return_tensors="pt").to(model.device)
 with torch.no_grad():
+    out = model.generate(**enc, max_new_tokens=5000)
+print(processor.decode(out[0].split("<answer>")[1].split("</answer>")[0], skip_special_tokens=True))
+```
+## VLLM:
 ```python
 from PIL import Image
 from vllm import LLM, SamplingParameters
 from transformers import AutoProcessor
 model_id = "NM-dev/NuMarkdown-Qwen2.5-VL"
 llm  = LLM(model=model_id, trust_remote_code=True, dtype="bfloat16")
 proc = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
 img = Image.open("invoice_scan.png")
+prompt = proc(text="Convert this to Markdown with reasoning.", image=img,
+              return_tensors="np")  # numpy arrays for vLLM
+params = SamplingParameters(max_tokens=1024, temperature=0.8, top_p=0.95)
+result = llm.generate([{"prompt": prompt}], params)[0].outputs[0].text.split("<answer>")[1].split("</answer>")[0]
 print(result)
+```