epsilon-ocr-d.markdown-post3.0.m / README.md

Update README.md

cfc37e1 verified 5 days ago

4.81 kB

	---
	license: other
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- text-generation-inference
	- document-ai
	- table-extraction
	- layouts
	- markdown
	- html-markdown
	- document-retrieval
	- visual-grounding
	- pdf-ocr
	- layout-analysis
	- xml
	- yaml
	- json
	---

	![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/3BlRVBsY8SFY34bBdwICO.png)

	# epsilon-ocr-d.markdown-post3.0.m

	> epsilon-ocr-d.markdown-post3.0.m is an experimental document AI multimodal model fine tuned on top of Qwen2.5-VL-3B-Instruct, optimized for OCR driven document reconstruction and dynamic Markdown generation. It converts documents into structured Markdown, HTML-Markdown, and hybrid technical documentation formats with inline code adaptation. Built for efficient model scaling, it offers strong performance with reduced compute requirements.

	# Key Enhancements

	* Dynamic Markdown and Layout Reconstruction
	Converts multi page and complex layout documents into structured Markdown or HTML-Markdown with preserved hierarchy, formatting, headings, and semantic reading order.

	* Inline Programming Language Support
	Automatically embeds LaTeX, Python, JavaScript, and shell code blocks within reconstructed documentation for research and technical writing.

	* High Accuracy OCR and Visual Parsing
	Extracts text from structured, semi structured, and unstructured formats. Supports multi page input and contextual alignment.

	* Complex Structure Understanding
	Parses tables, forms, graphs, diagrams, multi column layouts, and mathematical expressions without structural loss.

	* Document Retrieval and Semantic Linking
	Performs cross page reasoning and content referencing for enterprise document workflows.

	* Multimodal Long Document Reasoning
	Supports long content comprehension for slides, scanned books, handwritten pages, and research papers.

	---

	> 👉 This model is a stage progression model, and it may currently contain artifacts.

	---

	# Quick Start with Transformers

	```python
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info

	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"prithivMLmods/epsilon-ocr-d.markdown-post3.0.m", torch_dtype="auto", device_map="auto"
	)

	processor = AutoProcessor.from_pretrained("prithivMLmods/epsilon-ocr-d.markdown-post3.0.m")

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
	},
	{"type": "text", "text": "Convert to Markdown."},
	],
	}
	]

	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")

	generated_ids = model.generate(**inputs, max_new_tokens=2048)
	generated_ids_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	# Intended Use

	* OCR to Markdown or HTML Markdown conversion
	* Document reconstruction for manuals, books, and research materials
	* Table extraction and structural transformation
	* Multi page document retrieval and question answering
	* Mathematical OCR and LaTeX generation
	* Form extraction and structured entity mapping
	* Documentation rebuilding for enterprise knowledge systems
	* Automation of digitization and archival systems

	# Limitations

	* Accuracy may drop on highly damaged or extremely low resolution images
	* Limited performance compared to larger VL models in very large document reasoning
	* Language coverage varies for low resource scripts
	* Very complex forms may require secondary refinement

	## References

	* Qwen2.5 VL
	[https://huggingface.co/papers/2502.13923](https://huggingface.co/papers/2502.13923)

	* DocVLM Efficient Reader
	[https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)

	* YaRN Efficient Context Window Extension
	[https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)

	* Qwen2 VL High Resolution Perception
	[https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)

	* Qwen VL Vision Language and OCR
	[https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)

	* OCR Benchmark for Multimodal Models
	[https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)