|
|
--- |
|
|
license: other |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-3B-Instruct |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
tags: |
|
|
- text-generation-inference |
|
|
- document-ai |
|
|
- table-extraction |
|
|
- layouts |
|
|
- markdown |
|
|
- html-markdown |
|
|
- document-retrieval |
|
|
- visual-grounding |
|
|
- pdf-ocr |
|
|
- layout-analysis |
|
|
- xml |
|
|
- yaml |
|
|
- json |
|
|
--- |
|
|
|
|
|
 |
|
|
|
|
|
# **epsilon-ocr-d.markdown-post3.0.m** |
|
|
|
|
|
> **epsilon-ocr-d.markdown-post3.0.m** is an experimental document AI multimodal model fine tuned on top of **Qwen2.5-VL-3B-Instruct**, optimized for OCR driven document reconstruction and dynamic Markdown generation. It converts documents into structured **Markdown**, **HTML-Markdown**, and hybrid technical documentation formats with inline code adaptation. Built for efficient model scaling, it offers strong performance with reduced compute requirements. |
|
|
|
|
|
# Key Enhancements |
|
|
|
|
|
* **Dynamic Markdown and Layout Reconstruction** |
|
|
Converts multi page and complex layout documents into structured Markdown or HTML-Markdown with preserved hierarchy, formatting, headings, and semantic reading order. |
|
|
|
|
|
* **Inline Programming Language Support** |
|
|
Automatically embeds LaTeX, Python, JavaScript, and shell code blocks within reconstructed documentation for research and technical writing. |
|
|
|
|
|
* **High Accuracy OCR and Visual Parsing** |
|
|
Extracts text from structured, semi structured, and unstructured formats. Supports multi page input and contextual alignment. |
|
|
|
|
|
* **Complex Structure Understanding** |
|
|
Parses tables, forms, graphs, diagrams, multi column layouts, and mathematical expressions without structural loss. |
|
|
|
|
|
* **Document Retrieval and Semantic Linking** |
|
|
Performs cross page reasoning and content referencing for enterprise document workflows. |
|
|
|
|
|
* **Multimodal Long Document Reasoning** |
|
|
Supports long content comprehension for slides, scanned books, handwritten pages, and research papers. |
|
|
|
|
|
--- |
|
|
|
|
|
> 👉 This model is a stage progression model, and it may currently contain artifacts. |
|
|
|
|
|
--- |
|
|
|
|
|
# Quick Start with Transformers |
|
|
|
|
|
```python |
|
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
|
|
from qwen_vl_utils import process_vision_info |
|
|
|
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
|
"prithivMLmods/epsilon-ocr-d.markdown-post3.0.m", torch_dtype="auto", device_map="auto" |
|
|
) |
|
|
|
|
|
processor = AutoProcessor.from_pretrained("prithivMLmods/epsilon-ocr-d.markdown-post3.0.m") |
|
|
|
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "image", |
|
|
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", |
|
|
}, |
|
|
{"type": "text", "text": "Convert to Markdown."}, |
|
|
], |
|
|
} |
|
|
] |
|
|
|
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
inputs = processor( |
|
|
text=[text], |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
padding=True, |
|
|
return_tensors="pt", |
|
|
) |
|
|
inputs = inputs.to("cuda") |
|
|
|
|
|
generated_ids = model.generate(**inputs, max_new_tokens=2048) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
output_text = processor.batch_decode( |
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
|
) |
|
|
print(output_text) |
|
|
``` |
|
|
|
|
|
# Intended Use |
|
|
|
|
|
* OCR to Markdown or HTML Markdown conversion |
|
|
* Document reconstruction for manuals, books, and research materials |
|
|
* Table extraction and structural transformation |
|
|
* Multi page document retrieval and question answering |
|
|
* Mathematical OCR and LaTeX generation |
|
|
* Form extraction and structured entity mapping |
|
|
* Documentation rebuilding for enterprise knowledge systems |
|
|
* Automation of digitization and archival systems |
|
|
|
|
|
# Limitations |
|
|
|
|
|
* Accuracy may drop on highly damaged or extremely low resolution images |
|
|
* Limited performance compared to larger VL models in very large document reasoning |
|
|
* Language coverage varies for low resource scripts |
|
|
* Very complex forms may require secondary refinement |
|
|
|
|
|
## References |
|
|
|
|
|
* Qwen2.5 VL |
|
|
[https://huggingface.co/papers/2502.13923](https://huggingface.co/papers/2502.13923) |
|
|
|
|
|
* DocVLM Efficient Reader |
|
|
[https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1) |
|
|
|
|
|
* YaRN Efficient Context Window Extension |
|
|
[https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071) |
|
|
|
|
|
* Qwen2 VL High Resolution Perception |
|
|
[https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191) |
|
|
|
|
|
* Qwen VL Vision Language and OCR |
|
|
[https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966) |
|
|
|
|
|
* OCR Benchmark for Multimodal Models |
|
|
[https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210) |