|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit |
|
|
tags: |
|
|
- vision |
|
|
- ocr |
|
|
- document-understanding |
|
|
- qwen2.5-vl |
|
|
- lora |
|
|
- latex |
|
|
- handwriting |
|
|
- invoice |
|
|
--- |
|
|
|
|
|
# CernisOCR |
|
|
|
|
|
A vision language model OCR model fine-tuned on Qwen2.5-VL-7B-Instruct for handling mathematical formulas, handwritten text, and structured documents in a single model. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
CernisOCR is a vision language model, optimized for diverse OCR tasks across multiple document domains. Unlike domain-specific OCR models, CernisOCR unifies three traditionally separate OCR tasks into a single, efficient model: |
|
|
|
|
|
- **Mathematical LaTeX conversion**: Converts handwritten or printed mathematical formulas to LaTeX notation |
|
|
- **Handwritten text transcription**: Transcribes cursive and printed handwriting |
|
|
- **Structured document extraction**: Extracts structured data from invoices and receipts |
|
|
|
|
|
**Key Features:** |
|
|
- Multi-domain capability in a single model |
|
|
- Handles varied image types, layouts, and text styles |
|
|
- Extracts both raw text and structured information |
|
|
- Robust to noise and variable image quality |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Base Model**: Qwen2.5-VL-7B-Instruct |
|
|
- **Training Data**: 10,000 samples from three domains: |
|
|
- LaTeX OCR: 3,978 samples (mathematical notation) |
|
|
- Invoices & Receipts: 2,043 samples (structured documents) |
|
|
- Handwritten Text: 3,978 samples (handwriting transcription) |
|
|
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) |
|
|
- **Training Loss**: Reduced from 4.802 to 0.116 (97.6% improvement) |
|
|
- **Training Time**: ~8.7 minutes on RTX 5090 |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
- Mathematical formula recognition and LaTeX conversion |
|
|
- Handwritten text transcription |
|
|
- Invoice and receipt data extraction |
|
|
- Multi-domain document processing workflows |
|
|
- Applications requiring unified OCR across different document types |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
from unsloth import FastVisionModel |
|
|
from transformers import AutoTokenizer |
|
|
from PIL import Image |
|
|
|
|
|
# Load model and tokenizer |
|
|
model, tokenizer = FastVisionModel.from_pretrained( |
|
|
"coolAI/cernis-ocr", # or "coolAI/cernis-vision-ocr" for merged model |
|
|
load_in_4bit=True, |
|
|
) |
|
|
FastVisionModel.for_inference(model) |
|
|
|
|
|
# Example 1: LaTeX conversion |
|
|
image = Image.open("formula.png") |
|
|
messages = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "image": image}, |
|
|
{"type": "text", "text": "Write the LaTeX representation for this image."} |
|
|
] |
|
|
}] |
|
|
|
|
|
# Example 2: Handwritten transcription |
|
|
messages = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "image": image}, |
|
|
{"type": "text", "text": "Transcribe the handwritten text in this image."} |
|
|
] |
|
|
}] |
|
|
|
|
|
# Example 3: Invoice extraction |
|
|
messages = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "image": image}, |
|
|
{"type": "text", "text": "Extract and structure all text content from this invoice/receipt image."} |
|
|
] |
|
|
}] |
|
|
|
|
|
# Generate |
|
|
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_tensors="pt").to("cuda") |
|
|
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7) |
|
|
text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{cernis-ocr, |
|
|
title={CernisOCR: A Unified Multi-Domain OCR Model}, |
|
|
author={Cernis AI}, |
|
|
year={2025}, |
|
|
howpublished={\url{https://huggingface.co/coolAI/cernis-ocr}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
Built using [Unsloth](https://github.com/unslothai/unsloth) for efficient fine-tuning. Training data sourced from publicly available OCR datasets on Hugging Face. |
|
|
|
|
|
|