FullyOCR-2 Model Card
FullyOCR-2 is a specialized Vision-Language Model (VLM) optimized for high-fidelity Optical Character Recognition. Built using the Unsloth framework, it excels at transforming document images into structured Markdown, preserving headers, lists, and layout formatting with high efficiency.
Model Details
- Developed by: sapkotapraful
- Model Type: Vision-Language Model (VLM)
- Base Architecture: FastVisionModel (Unsloth-optimized)
- License: Apache 2.0
- Finetuning Trigger:
<|MD|>
Intended Use
Primary Use Case
FullyOCR-2 is designed for Document-to-Markdown conversion. It is ideal for:
- Extracting text from complex layouts (multi-column documents).
- Converting handwritten or typed notes into digital formats.
- Preserving structural elements like bold text, headers, and bullet points.
Technical Specifications
Optimization
The model utilizes Unsloth optimizations to achieve:
- 4-bit Quantization: Native support for
bitsandbytesto run on GPUs with limited VRAM (e.g., 8GB - 12GB). - Inference Speed: Significantly reduced latency during token generation compared to standard Transformers.
- Memory Efficiency: Efficient KV caching for long document processing.
Training Details
- Prompt Template: Chat-based vision template.
- Instruction Token: The model specifically looks for the
<|MD|>token to trigger the OCR extraction flow.
Installation
Ensure you have the latest versions of the Unsloth ecosystem installed:
pip install unsloth torch pillow
Quick Start (Inference)
from unsloth import FastVisionModel
import torch
from PIL import Image
# 1. Load model + tokenizer
model, tokenizer = FastVisionModel.from_pretrained(
"sapkotapraful/FullyOCR-2",
load_in_4bit=True,
)
# 2. Setup Device
image = Image.open('document.jpg')
model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
model = model.to(device)
# 3. Prepare Prompt
instruction = "<|MD|>"
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": instruction}
]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
# 4. Generate
inputs = tokenizer(
image,
input_text,
add_special_tokens=False,
return_tensors="pt",
).to(device)
with torch.no_grad(), torch.amp.autocast(device_type="cuda", enabled=(device=="cuda")):
output_ids = model.generate(
**inputs,
max_new_tokens=1024,
use_cache=True,
num_beams=1,
do_sample=False,
pad_token_id=tokenizer.pad_token_id,
)
# 5. Extract Markdown
decoded = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
extracted = decoded.split(instruction)[-1].strip()
print(extracted)
Performance & Limitations
Strengths
- Format Integrity: Specifically tuned to output valid Markdown.
- Resource Friendly: Can be deployed on consumer hardware using 4-bit weights.
Limitations
- Resolution Sensitivity: Very small text on low-resolution scans may result in "hallucinated" characters.
- Context Window: Designed for single-page extraction; very long documents should be processed page-by-page.
Ethical Considerations
When using FullyOCR-2, ensure compliance with data privacy laws (like GDPR) when processing documents containing sensitive Personal Identifiable Information (PII).
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support