FullyOCR-2 Model Card

FullyOCR-2 is a specialized Vision-Language Model (VLM) optimized for high-fidelity Optical Character Recognition. Built using the Unsloth framework, it excels at transforming document images into structured Markdown, preserving headers, lists, and layout formatting with high efficiency.

Model Details

Developed by: sapkotapraful
Model Type: Vision-Language Model (VLM)
Base Architecture: FastVisionModel (Unsloth-optimized)
License: Apache 2.0
Finetuning Trigger: <|MD|>

Intended Use

Primary Use Case

FullyOCR-2 is designed for Document-to-Markdown conversion. It is ideal for:

Extracting text from complex layouts (multi-column documents).
Converting handwritten or typed notes into digital formats.
Preserving structural elements like bold text, headers, and bullet points.

Technical Specifications

Optimization

The model utilizes Unsloth optimizations to achieve:

4-bit Quantization: Native support for bitsandbytes to run on GPUs with limited VRAM (e.g., 8GB - 12GB).
Inference Speed: Significantly reduced latency during token generation compared to standard Transformers.
Memory Efficiency: Efficient KV caching for long document processing.

Training Details

Prompt Template: Chat-based vision template.
Instruction Token: The model specifically looks for the <|MD|> token to trigger the OCR extraction flow.

Installation

Ensure you have the latest versions of the Unsloth ecosystem installed:

pip install unsloth torch pillow

Quick Start (Inference)

from unsloth import FastVisionModel
import torch
from PIL import Image

# 1. Load model + tokenizer
model, tokenizer = FastVisionModel.from_pretrained(
    "sapkotapraful/FullyOCR-2",
    load_in_4bit=True,
)

# 2. Setup Device
image = Image.open('document.jpg')
model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
    model = model.to(device)

# 3. Prepare Prompt
instruction = "<|MD|>"
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

# 4. Generate
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to(device)

with torch.no_grad(), torch.amp.autocast(device_type="cuda", enabled=(device=="cuda")):
    output_ids = model.generate(
        **inputs,
        max_new_tokens=1024,
        use_cache=True,
        num_beams=1,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id,
    )

# 5. Extract Markdown
decoded = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
extracted = decoded.split(instruction)[-1].strip()
print(extracted)

Performance & Limitations

Strengths

Format Integrity: Specifically tuned to output valid Markdown.
Resource Friendly: Can be deployed on consumer hardware using 4-bit weights.

Limitations

Resolution Sensitivity: Very small text on low-resolution scans may result in "hallucinated" characters.
Context Window: Designed for single-page extraction; very long documents should be processed page-by-page.

Ethical Considerations

When using FullyOCR-2, ensure compliance with data privacy laws (like GDPR) when processing documents containing sensitive Personal Identifiable Information (PII).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support