FullyOCR-2 Model Card

FullyOCR-2 is a specialized Vision-Language Model (VLM) optimized for high-fidelity Optical Character Recognition. Built using the Unsloth framework, it excels at transforming document images into structured Markdown, preserving headers, lists, and layout formatting with high efficiency.

Model Details

  • Developed by: sapkotapraful
  • Model Type: Vision-Language Model (VLM)
  • Base Architecture: FastVisionModel (Unsloth-optimized)
  • License: Apache 2.0
  • Finetuning Trigger: <|MD|>

Intended Use

Primary Use Case

FullyOCR-2 is designed for Document-to-Markdown conversion. It is ideal for:

  • Extracting text from complex layouts (multi-column documents).
  • Converting handwritten or typed notes into digital formats.
  • Preserving structural elements like bold text, headers, and bullet points.

Technical Specifications

Optimization

The model utilizes Unsloth optimizations to achieve:

  • 4-bit Quantization: Native support for bitsandbytes to run on GPUs with limited VRAM (e.g., 8GB - 12GB).
  • Inference Speed: Significantly reduced latency during token generation compared to standard Transformers.
  • Memory Efficiency: Efficient KV caching for long document processing.

Training Details

  • Prompt Template: Chat-based vision template.
  • Instruction Token: The model specifically looks for the <|MD|> token to trigger the OCR extraction flow.

Installation

Ensure you have the latest versions of the Unsloth ecosystem installed:

pip install unsloth torch pillow

Quick Start (Inference)

from unsloth import FastVisionModel
import torch
from PIL import Image

# 1. Load model + tokenizer
model, tokenizer = FastVisionModel.from_pretrained(
    "sapkotapraful/FullyOCR-2",
    load_in_4bit=True,
)

# 2. Setup Device
image = Image.open('document.jpg')
model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
    model = model.to(device)

# 3. Prepare Prompt
instruction = "<|MD|>"
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

# 4. Generate
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to(device)

with torch.no_grad(), torch.amp.autocast(device_type="cuda", enabled=(device=="cuda")):
    output_ids = model.generate(
        **inputs,
        max_new_tokens=1024,
        use_cache=True,
        num_beams=1,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id,
    )

# 5. Extract Markdown
decoded = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
extracted = decoded.split(instruction)[-1].strip()
print(extracted)

Performance & Limitations

Strengths

  • Format Integrity: Specifically tuned to output valid Markdown.
  • Resource Friendly: Can be deployed on consumer hardware using 4-bit weights.

Limitations

  • Resolution Sensitivity: Very small text on low-resolution scans may result in "hallucinated" characters.
  • Context Window: Designed for single-page extraction; very long documents should be processed page-by-page.

Ethical Considerations

When using FullyOCR-2, ensure compliance with data privacy laws (like GDPR) when processing documents containing sensitive Personal Identifiable Information (PII).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support