cernis-intelligence
/

cernis-vision-ocr

document-understanding

Model card Files Files and versions

cernis-vision-ocr / README.md

coolAI's picture

Update README.md

7b38832 verified about 2 months ago

|

history blame contribute delete

3.62 kB

	---
	license: apache-2.0
	base_model: unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit
	tags:
	- vision
	- ocr
	- document-understanding
	- qwen2.5-vl
	- lora
	- latex
	- handwriting
	- invoice
	---

	# CernisOCR

	A vision language model OCR model fine-tuned on Qwen2.5-VL-7B-Instruct for handling mathematical formulas, handwritten text, and structured documents in a single model.

	## Model Description

	CernisOCR is a vision language model, optimized for diverse OCR tasks across multiple document domains. Unlike domain-specific OCR models, CernisOCR unifies three traditionally separate OCR tasks into a single, efficient model:

	- Mathematical LaTeX conversion: Converts handwritten or printed mathematical formulas to LaTeX notation
	- Handwritten text transcription: Transcribes cursive and printed handwriting
	- Structured document extraction: Extracts structured data from invoices and receipts

	Key Features:
	- Multi-domain capability in a single model
	- Handles varied image types, layouts, and text styles
	- Extracts both raw text and structured information
	- Robust to noise and variable image quality

	## Training Details

	- Base Model: Qwen2.5-VL-7B-Instruct
	- Training Data: 10,000 samples from three domains:
	- LaTeX OCR: 3,978 samples (mathematical notation)
	- Invoices & Receipts: 2,043 samples (structured documents)
	- Handwritten Text: 3,978 samples (handwriting transcription)
	- Fine-tuning Method: LoRA (Low-Rank Adaptation)
	- Training Loss: Reduced from 4.802 to 0.116 (97.6% improvement)
	- Training Time: ~8.7 minutes on RTX 5090

	## Intended Use

	This model is designed for:
	- Mathematical formula recognition and LaTeX conversion
	- Handwritten text transcription
	- Invoice and receipt data extraction
	- Multi-domain document processing workflows
	- Applications requiring unified OCR across different document types

	## How to Use

	```python
	from unsloth import FastVisionModel
	from transformers import AutoTokenizer
	from PIL import Image

	# Load model and tokenizer
	model, tokenizer = FastVisionModel.from_pretrained(
	"coolAI/cernis-ocr", # or "coolAI/cernis-vision-ocr" for merged model
	load_in_4bit=True,
	)
	FastVisionModel.for_inference(model)

	# Example 1: LaTeX conversion
	image = Image.open("formula.png")
	messages = [{
	"role": "user",
	"content": [
	{"type": "image", "image": image},
	{"type": "text", "text": "Write the LaTeX representation for this image."}
	]
	}]

	# Example 2: Handwritten transcription
	messages = [{
	"role": "user",
	"content": [
	{"type": "image", "image": image},
	{"type": "text", "text": "Transcribe the handwritten text in this image."}
	]
	}]

	# Example 3: Invoice extraction
	messages = [{
	"role": "user",
	"content": [
	{"type": "image", "image": image},
	{"type": "text", "text": "Extract and structure all text content from this invoice/receipt image."}
	]
	}]

	# Generate
	inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_tensors="pt").to("cuda")
	outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
	text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	```

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{cernis-ocr,
	title={CernisOCR: A Unified Multi-Domain OCR Model},
	author={Cernis AI},
	year={2025},
	howpublished={\url{https://huggingface.co/coolAI/cernis-ocr}}
	}
	```

	## Acknowledgments

	Built using [Unsloth](https://github.com/unslothai/unsloth) for efficient fine-tuning. Training data sourced from publicly available OCR datasets on Hugging Face.