Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +87 -96
model.safetensors +3 -0
tokenizer.json +2 -2
tokenizer.model +2 -2

README.md CHANGED Viewed

@@ -40,26 +40,24 @@ PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document
 [![X](https://img.shields.io/badge/X-PaddlePaddle-6080F0)](https://x.com/PaddlePaddle)
 [![License](https://img.shields.io/badge/license-Apache_2.0-green)](./LICENSE)
-**🔥 [Official Website](https://www.paddleocr.com)** |
-**📝 [Technical Report](https://arxiv.org/pdf/2510.14528)**
 </div>
 <div align="center">
-<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/paddleocr-vl-1.5_metrics.png" width="800"/>
 </div>
 ## Introduction
-**PaddleOCR-VL-1.5 is an upgraded model achieving a new SOTA accuracy of 94.5% on OmniDocBench v1.5**. To rigorously evaluate robustness against real-world physical distortions—including scanning artifacts, skew, warping, screen photography, and illumination—we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model’s capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency.
 ### **Key Capabilities of PaddleOCR-VL-1.5**
-1. With a **parameter size of 0.9B**, PaddleOCR-VL-1.5 **achieves 94.5% accuracy on OmniDocBench v1.5**, surpassing the previous SOTA model PaddleOCR-VL. Significant improvements are observed in **table, formula, and text recognition.**
-2. **It introduces an innovative approach to document parsing by supporting irregular-shaped localization**, enabling accurate polygonal detection under skewed and warped document conditions. Evaluations across five real-world scenarios—scanning, skew, warping, screen-photography, and illumination—demonstrate superior performance over mainstream open-source and proprietary models.
 3. The model introduces **text spotting (text-line localization and recognition)**, along with **seal recognition**, with all corresponding metrics **setting new SOTA results** in their respective tasks.
@@ -71,13 +69,13 @@ PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document
 ### **Model Architecture**
 <div align="center">
-<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/PaddleOCR-VL-1.5.png" width="800"/>
 </div>
 ## News
-* ```2026.01.29``` 🚀 We release [PaddleOCR-VL-1.5](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5), —a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing.
 ## Usage
@@ -177,7 +175,7 @@ from transformers import AutoProcessor, AutoModelForImageTextToText
 # ---- Settings ----
 model_path = "PaddlePaddle/PaddleOCR-VL-1.5"
 image_path = "test.png"
-task = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula' | 'spotting' | 'seal'
 # ------------------
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
@@ -187,28 +185,11 @@ PROMPTS = {
     "formula": "Formula Recognition:",
     "chart": "Chart Recognition:",
     "spotting": "Spotting:",
-    "seal": "Seal Recognition:",
 }
-model = AutoModelForImageTextToText.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(DEVICE).eval()
 processor = AutoProcessor.from_pretrained(model_path)
-# ---- Image Preprocessing ----
 image = Image.open(image_path).convert("RGB")
-orig_w, orig_h = image.size
-spotting_upscale_threshold = 1500
-if task == "spotting" and orig_w < spotting_upscale_threshold and orig_h < spotting_upscale_threshold:
-    process_w, process_h = orig_w * 2, orig_h * 2
-    try:
-        resample_filter = Image.Resampling.LANCZOS
-    except AttributeError:
-        resample_filter = Image.LANCZOS
-    image = image.resize((process_w, process_h), resample_filter)
-# Set max_pixels: use 1605632 for spotting, otherwise use default ~1M pixels
-max_pixels = 2048 * 28 * 28 if task == "spotting" else 1280 * 28 * 28
 messages = [
     {
         "role": "user",
@@ -218,20 +199,15 @@ messages = [
         ]
     }
 ]
 inputs = processor.apply_chat_template(
-    messages,
-    add_generation_prompt=True,
-    tokenize=True,
-    return_dict=True,
-    return_tensors="pt",
-    image_processor_kwargs={
-        "max_pixels": max_pixels,
-        "min_pixels": 144 * 28 * 28
-    }
 ).to(model.device)
-outputs = model.generate(**inputs, max_new_tokens=512)
 result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1])
 print(result)
 ```
@@ -252,7 +228,7 @@ from transformers import AutoProcessor, AutoModelForImageTextToText
 # ---- Settings ----
 model_path = "PaddlePaddle/PaddleOCR-VL-1.5"
 image_path = "test.png"
-task = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula' | 'spotting' | 'seal'
 # ------------------
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
@@ -262,28 +238,11 @@ PROMPTS = {
     "formula": "Formula Recognition:",
     "chart": "Chart Recognition:",
     "spotting": "Spotting:",
-    "seal": "Seal Recognition:",
 }
 model = AutoModelForImageTextToText.from_pretrained(model_path, dtype="bfloat16", attn_implementation="flash_attention_2").to(DEVICE).eval()
 processor = AutoProcessor.from_pretrained(model_path)
-# ---- Image Preprocessing ----
 image = Image.open(image_path).convert("RGB")
-orig_w, orig_h = image.size
-spotting_upscale_threshold = 1500
-if task == "spotting" and orig_w < spotting_upscale_threshold and orig_h < spotting_upscale_threshold:
-    process_w, process_h = orig_w * 2, orig_h * 2
-    try:
-        resample_filter = Image.Resampling.LANCZOS
-    except AttributeError:
-        resample_filter = Image.LANCZOS
-    image = image.resize((process_w, process_h), resample_filter)
-# Set max_pixels: use 1605632 for spotting, otherwise use default ~1M pixels
-max_pixels = 2048 * 28 * 28 if task == "spotting" else 1280 * 28 * 28
 messages = [
     {
         "role": "user",
@@ -299,13 +258,9 @@ inputs = processor.apply_chat_template(
 	tokenize=True,
 	return_dict=True,
 	return_tensors="pt",
-    image_processor_kwargs={
-        "max_pixels": max_pixels,
-        "min_pixels": 144 * 28 * 28
-    }
 ).to(model.device)
-outputs = model.generate(**inputs, max_new_tokens=512)
 result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1])
 print(result)
 ```
@@ -314,109 +269,145 @@ print(result)
 ## Performance
-### Document Parsing
 #### 1. OmniDocBench v1.5
-##### PaddleOCR-VL-1.5 achieves SOTA performance for overall, text, formula, tables and reading order on OmniDocBench v1.5
 <div align="center">
-<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/omnidocbenchv1.5_metrics.png" width="800"/>
 </div>
-> **Notes:**
-> - Performance metrics are cited from the [OmniDocBench official leaderboard](https://opendatalab.com/omnidocbench), except for Gemini-3 Pro, Qwen3-VL-235B-A22B-Instruct and our model, which were evaluated independently.
-####  2. Real5-OmniDocBench
-##### Across all five diverse and challenging scenarios—scanning, warping, screen-photography, illumination, and skew—PaddleOCR-VL-1.5 consistently sets new SOTA records
 <div align="center">
-<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/real5-omnidocbench_metrics.png" width="800"/>
 </div>
 > **Notes:**
-> - Real5-OmniDocBench is a brand-new benchmark oriented toward real-world scenarios, which we constructed based on the OmniDocBench v1.5 dataset. The dataset comprises five distinct scenarios: Scanning, Warping, Screen-photography, Illumination, and Skew. For further details, please refer to [Real5-OmniDocBench](https://huggingface.co/datasets/PaddlePaddle/Real5-OmniDocBench).
-### Inference Performance
 <div align="center">
-<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/inference_performance.png" width="800"/>
 </div>
-> **Notes:**
-> - End-to-End Inference Performance Comparison on OmniDocBench v1.5. PDF documents were processed in batches of 512 on a single NVIDIA A100 GPU. The reported end-to-end runtime includes both PDF rendering and Markdown generation. All methods rely on their built-in PDF parsing modules and default DPI settings to reflect out-of-the-box performance.
-## Visualization
-### Real-word Document Parsing
-#### Illumination
 <div align="center">
-<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/light.jpg" width="800"/>
 </div>
-#### Skew
 <div align="center">
-<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/skew.jpg" width="800"/>
 </div>
-#### Screen Photography
 <div align="center">
-<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/screen.jpg" width="800"/>
 </div>
-#### Scanning
 <div align="center">
-<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/scaning.jpg" width="800"/>
 </div>
-#### Warping
 <div align="center">
-<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/curving.jpg" width="800"/>
 </div>
-### Text Spotting
 <div align="center">
-<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/spotting.jpg" width="800"/>
 </div>
-### Seal Recognition
 <div align="center">
-<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/seal.jpg" width="800"/>
 </div>
 ## Acknowledgments
-We would like to thank [PaddleFormers](https://github.com/PaddlePaddle/PaddleFormers), [Keye](https://github.com/Kwai-Keye/Keye), [MinerU](https://github.com/opendatalab/MinerU), [OmniDocBench](https://github.com/opendatalab/OmniDocBench) for providing valuable code, model weights and benchmarks. We also appreciate everyone's contribution to this open-source project!
 ## Citation
-If you find PaddleOCR-VL-1.5 helpful, feel free to give us a star and citation.
 ```bibtex
 comming soon

 [![X](https://img.shields.io/badge/X-PaddlePaddle-6080F0)](https://x.com/PaddlePaddle)
 [![License](https://img.shields.io/badge/license-Apache_2.0-green)](./LICENSE)
+**🔥 Official Website**: [Baidu AI Studio](https://aistudio.baidu.com/paddleocr) |
+**📝 arXiv**: [Technical Report](https://arxiv.org/pdf/2510.14528)
 </div>
 <div align="center">
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/allmetric.png" width="800"/>
 </div>
 ## Introduction
+**PaddleOCR-VL-1.5 is an upgraded model achieving a new SOTA accuracy of 94.5% on OmniDocBench v1.5**. To rigorously evaluate robustness against real-world physical distortions—including scanning artifacts, skewing, curving, screen-photo , and light variations—we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model’s capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency.
 ### **Key Capabilities of PaddleOCR-VL-1.5**
+1. With a **parameter size of 0.9B**, PaddleOCR-VL-1.5 **achieves xxx% accuracy on OmniDocBench v1.5**, surpassing the previous SOTA model PaddleOCR-VL. Significant improvements are observed in **table, formula, and text understanding.**
+2. **It introduces an innovative approach to document parsing by supporting irregular-shaped localization**, enabling accurate polygonal detection under skewed and curved document conditions. Evaluations across five real-world scenarios—scanning, curving, skewing, screen-photo capture, and light variation—demonstrate superior performance over mainstream open-source and proprietary models.
 3. The model introduces **text spotting (text-line localization and recognition)**, along with **seal recognition**, with all corresponding metrics **setting new SOTA results** in their respective tasks.
 ### **Model Architecture**
 <div align="center">
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/paddleocrvl.png" width="800"/>
 </div>
 ## News
+* ```2026.01.29``` 🚀 We release [PaddleOCR-VL-1.5](https://github.com/PaddlePaddle/PaddleOCR-1.5), —a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing.
 ## Usage
 # ---- Settings ----
 model_path = "PaddlePaddle/PaddleOCR-VL-1.5"
 image_path = "test.png"
+task = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula' | 'spotting'
 # ------------------
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
     "formula": "Formula Recognition:",
     "chart": "Chart Recognition:",
     "spotting": "Spotting:",
 }
+model = AutoModelForImageTextToText.from_pretrained(model_path, dtype="bfloat16").to(DEVICE).eval()
 processor = AutoProcessor.from_pretrained(model_path)
 image = Image.open(image_path).convert("RGB")
 messages = [
     {
         "role": "user",
         ]
     }
 ]
 inputs = processor.apply_chat_template(
+	messages,
+	add_generation_prompt=True,
+	tokenize=True,
+	return_dict=True,
+	return_tensors="pt",
 ).to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=100)
 result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1])
 print(result)
 ```
 # ---- Settings ----
 model_path = "PaddlePaddle/PaddleOCR-VL-1.5"
 image_path = "test.png"
+task = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula' | 'spotting'
 # ------------------
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
     "formula": "Formula Recognition:",
     "chart": "Chart Recognition:",
     "spotting": "Spotting:",
 }
 model = AutoModelForImageTextToText.from_pretrained(model_path, dtype="bfloat16", attn_implementation="flash_attention_2").to(DEVICE).eval()
 processor = AutoProcessor.from_pretrained(model_path)
 image = Image.open(image_path).convert("RGB")
 messages = [
     {
         "role": "user",
 	tokenize=True,
 	return_dict=True,
 	return_tensors="pt",
 ).to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=100)
 result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1])
 print(result)
 ```
 ## Performance
+### Page-Level Document Parsing
 #### 1. OmniDocBench v1.5
+##### PaddleOCR-VL achieves SOTA performance for overall, text, formula, tables and reading order on OmniDocBench v1.5
 <div align="center">
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/omni15.png" width="800"/>
 </div>
+####  2. OmniDocBench v1.0
+##### PaddleOCR-VL achieves SOTA performance for almost all metrics of overall, text, formula, tables and reading order on OmniDocBench v1.0
 <div align="center">
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/omni10.png" width="800"/>
 </div>
 > **Notes:**
+> - The metrics are from [MinerU](https://github.com/opendatalab/MinerU), [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and our own internal evaluations.
+### Element-level Recognition
+#### 1. Text
+**Comparison of OmniDocBench-OCR-block Performance**
+PaddleOCR-VL’s robust and versatile capability in handling diverse document types, establishing it as the leading method in the OmniDocBench-OCR-block performance evaluation.
 <div align="center">
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/omnibenchocr.png" width="800"/>
 </div>
+**Comparison of In-house-OCR Performance**
+In-house-OCR provides a evaluation of performance across multiple languages and text types. Our model demonstrates outstanding accuracy with the lowest edit distances in all evaluated scripts.
+<div align="center">
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/inhouseocr.png" width="800"/>
+</div>
+#### 2. Table
+**Comparison of In-house-Table Performance**
+Our self-built evaluation set contains diverse types of table images, such as Chinese, English, mixed Chinese-English, and tables with various characteristics like full, partial, or no borders, book/manual formats, lists, academic papers, merged cells, as well as low-quality, watermarked, etc. PaddleOCR-VL achieves remarkable performance across all categories.
 <div align="center">
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/inhousetable.png" width="600"/>
 </div>
+#### 3. Formula
+**Comparison of In-house-Formula Performance**
+In-house-Formula evaluation set contains simple prints, complex prints, camera scans, and handwritten formulas. PaddleOCR-VL demonstrates the best performance in every category.
 <div align="center">
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/inhouse-formula.png" width="500"/>
 </div>
+#### 4. Chart
+**Comparison of In-house-Chart Performance**
+The evaluation set is broadly categorized into 11 chart categories, including bar-line hybrid, pie, 100% stacked bar, area, bar, bubble, histogram, line, scatterplot, stacked area, and stacked bar. PaddleOCR-VL not only outperforms expert OCR VLMs but also surpasses some 72B-level multimodal language models.
 <div align="center">
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/inhousechart.png" width="400"/>
 </div>
+## Visualization
+### Comprehensive Document Parsing
 <div align="center">
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/overview1.jpg" width="600"/>
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/overview2.jpg" width="600"/>
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/overview3.jpg" width="600"/>
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/overview4.jpg" width="600"/>
 </div>
+### Text
 <div align="center">
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/text_english_arabic.jpg" width="300" style="display: inline-block;"/>
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/text_handwriting_02.jpg" width="300" style="display: inline-block;"/>
 </div>
+### Table
 <div align="center">
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/table_01.jpg" width="300" style="display: inline-block;"/>
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/table_02.jpg" width="300" style="display: inline-block;"/>
 </div>
+### Formula
+<div align="center">
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/formula_EN.jpg" width="300" style="display: inline-block;"/>
+<img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/formula_ZH.jpg" width="300" style="display: inline-block;"/>
+</div>
+### Chart
 <div align="center">
+  <img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/chart_01.jpg" width="300" style="display: inline-block;"/>
+  <img src="https://huggingface.co/datasets/PaddlePaddle/PaddleOCR-VL_demo/resolve/main/imgs/chart_02.jpg" width="300" style="display: inline-block;"/>
 </div>
 ## Acknowledgments
+We would like to thank [ERNIE](https://github.com/PaddlePaddle/ERNIE), [Keye](https://github.com/Kwai-Keye/Keye), [MinerU](https://github.com/opendatalab/MinerU), [OmniDocBench](https://github.com/opendatalab/OmniDocBench) for providing valuable code, model weights and benchmarks. We also appreciate everyone's contribution to this open-source project!
 ## Citation
+If you find PaddleOCR-VL helpful, feel free to give us a star and citation.
 ```bibtex
 comming soon

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d557c9d8997ae57ed3b1b33bdf347be878cc335687f32ca105341c16973f8958
+size 1917255968

tokenizer.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c8a215a59183d0d0781adc33bacd3ce6162716f7fd568fb30234a74d69803a7d
-size 11189060

 version https://git-lfs.github.com/spec/v1
+oid sha256:45ed88f7769781f2251cbfc8d7f162dd458eddf23ab99051a9db4448c09b5c33
+size 133

tokenizer.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:34ef7db83df785924fb83d7b887b6e822a031c56e15cff40aaf9b982988180df
-size 1614363

 version https://git-lfs.github.com/spec/v1
+oid sha256:3614d0fee7e3b9a9a00b978752c9cf87bc85984a38ab08f4d7dd6bb8d2e3be83
+size 132