--- license: apache-2.0 base_model: unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit tags: - vision - ocr - document-understanding - qwen2.5-vl - lora - latex - handwriting - invoice --- # CernisOCR A vision language model OCR model fine-tuned on Qwen2.5-VL-7B-Instruct for handling mathematical formulas, handwritten text, and structured documents in a single model. ## Model Description CernisOCR is a vision language model, optimized for diverse OCR tasks across multiple document domains. Unlike domain-specific OCR models, CernisOCR unifies three traditionally separate OCR tasks into a single, efficient model: - **Mathematical LaTeX conversion**: Converts handwritten or printed mathematical formulas to LaTeX notation - **Handwritten text transcription**: Transcribes cursive and printed handwriting - **Structured document extraction**: Extracts structured data from invoices and receipts **Key Features:** - Multi-domain capability in a single model - Handles varied image types, layouts, and text styles - Extracts both raw text and structured information - Robust to noise and variable image quality ## Training Details - **Base Model**: Qwen2.5-VL-7B-Instruct - **Training Data**: 10,000 samples from three domains: - LaTeX OCR: 3,978 samples (mathematical notation) - Invoices & Receipts: 2,043 samples (structured documents) - Handwritten Text: 3,978 samples (handwriting transcription) - **Fine-tuning Method**: LoRA (Low-Rank Adaptation) - **Training Loss**: Reduced from 4.802 to 0.116 (97.6% improvement) - **Training Time**: ~8.7 minutes on RTX 5090 ## Intended Use This model is designed for: - Mathematical formula recognition and LaTeX conversion - Handwritten text transcription - Invoice and receipt data extraction - Multi-domain document processing workflows - Applications requiring unified OCR across different document types ## How to Use ```python from unsloth import FastVisionModel from transformers import AutoTokenizer from PIL import Image # Load model and tokenizer model, tokenizer = FastVisionModel.from_pretrained( "coolAI/cernis-ocr", # or "coolAI/cernis-vision-ocr" for merged model load_in_4bit=True, ) FastVisionModel.for_inference(model) # Example 1: LaTeX conversion image = Image.open("formula.png") messages = [{ "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Write the LaTeX representation for this image."} ] }] # Example 2: Handwritten transcription messages = [{ "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Transcribe the handwritten text in this image."} ] }] # Example 3: Invoice extraction messages = [{ "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Extract and structure all text content from this invoice/receipt image."} ] }] # Generate inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7) text = tokenizer.decode(outputs[0], skip_special_tokens=True) ``` ## Citation If you use this model, please cite: ```bibtex @misc{cernis-ocr, title={CernisOCR: A Unified Multi-Domain OCR Model}, author={Cernis AI}, year={2025}, howpublished={\url{https://huggingface.co/coolAI/cernis-ocr}} } ``` ## Acknowledgments Built using [Unsloth](https://github.com/unslothai/unsloth) for efficient fine-tuning. Training data sourced from publicly available OCR datasets on Hugging Face.