Sarashina2.2-OCR:
End-to-end OCR Model for Japanese Document Parsing

Website Hugging Face Twitter Follow

About the model

Sarashina2.2-OCR is an end-to-end 3B-parameter OCR model developed by SB Intuitions, specifically tailored for parsing Japanese and English documents.

The model is refined by human preference optimization to enable intuitive document parsing, and excels at converting a wide range of documents—including vertical Japanese text—into Markdown format while maintaining a natural reading order.

Key Features

Beyond standard text extraction, Sarashina2.2-OCR reconstructs documents into naturally structured Markdown, accurately converting complex elements into the following formats:

  • 📊 Tables: Reconstructs tabular layouts into plain HTML format.

  • 📐 Math Formulas: Transcribes mathematical equations directly into standard LaTeX format.

  • 🖼️ Graphics: Detects visual components (e.g., images, charts) and indicates their positions using bounding boxes in the format <bbox>[(x1, y1), (x2, y2)]</bbox>, using normalized integer coordinates (0–1000) with a top-left origin.

Training Summary

Sarashina2.2-OCR integrates a SigLIP2-based vision encoder with the Sarashina2.2-3B-Instruct language model, and was trained through the following pipeline:

1. Pre-training on general image-text dataset:

To build basic image understanding and prepare for high-resolution document parsing to boost downstream OCR performance, we expanded Sarashina2.2-Vision-3B's pre-training into the following 4-substage pipeline, upscaling the resolution to ~2.5M pixels:

  1. 🔌 Projector warmup: bridging the gap between the embedding spaces of the LLM and vision encoder.

  2. 👁️ Vision encoder pre-training: enhancing image comprehension, especially for understanding Japan-specific images and text.

  3. 🔥 Full-parameter pre-training: enhancing the model's unified understanding of images and language using interleaved data.

  4. 🔍 High-resolution continual pre-training: expanding the maximum resolution to capture fine-grained details and dense text in complex documents.

We also mixed in a large amount of OCR and grounding data from the start to build basic document understanding early on.

2. Supervised fine-tuning (SFT) on large-scale OCR dataset:

We fine-tuned the model on diverse Japanese and English documents so it can recognize complex layouts and output naturally structured Markdown. We heavily used synthetic datasets alongside open OCR data, keeping the resolution at ~2.5M pixels.

3. Preference Optimization with manually annotated OCR datasets:

Finally, we applied Mixed Preference Optimization (MPO) to achieve more natural reading order understanding for documents with complex layouts. We identified two major challenges when applying it to long-context OCR tasks:

  1. 💸 High human-annotation costs:
    High-quality OCR data with dense, complex layouts requires a lot of time to annotate manually, making large-scale data collection difficult.

  2. 🧩 Lack of effective data for late-stage errors:
    A common approach to prepare negative examples is to sample directly from the model. In long-context OCR tasks, an early mistake in reading order can corrupt the entire remaining prediction, making it difficult to obtain preference pairs that target late-stage errors.

To overcome these, we augmented the negative examples of preference pairs by feeding the model with the first N ground-truth paragraphs and making the model complete the rest of the content. For each human-annotated sample, we applied this to every paragraph at position N to obtain multiple negative examples.

Benchmark Results

Sarashina2.2-OCR delivers highly competitive overall performance among the end-to-end OCR models, achieving the best scores on VJRODa despite its compact 3B-parameter size, while demonstrating strong capabilities in the Math and Table categories of olmOCR-bench.

VJRODa

VJRODa evaluates OCR capabilities for Japanese documents, particularly focusing on complex layouts and vertical text reading order.

Model CER(↓) BLEU(↑)
gpt-5-mini-2025-08-07 72.4 23.6
Qwen3.5-4B(non-thinking) 86.1 47.8
KARAKURI VL 32B Instruct 2507 280 14.1
LightOnOCR-2-1B 158 28.9
dots.ocr 40.1 71.5
Sarashina2.2-OCR 22.6 79.9

olmOCR-bench

A comprehensive benchmark designed to evaluate document parsing capabilities across diverse and complex structures, such as mathematical equations, multi-column layouts, and tables.

Model arXiv Math Headers Footers Long Tiny Text Multi Column Old Scans Old Scans Math Table Tests Overall
LightOnOCR-2-1B 0.890 0.204 0.889 0.848 0.418 0.865 0.887 0.773
dots.ocr 0.674 0.849 0.921 0.803 0.411 0.603 0.823 0.722
Sarashina2.2-OCR 0.778 0.291 0.846 0.752 0.323 0.528 0.829 0.683

Usage with Transformers

pip install transformers==4.57.1 torch torchvision pillow protobuf sentencepiece accelerate
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, TextStreamer, set_seed

# Load model and processor
model_path = "sbintuitions/sarashina2.2-ocr"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
set_seed(42)

# Prepare inputs
image_url = "https://huggingface.co/sbintuitions/sarashina2.2-ocr/resolve/main/assets/sample1.jpeg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
message = [
    {
        "role": "user",
        "content": [{"type": "image", "image": image}],
    }
]
inputs = processor.apply_chat_template(
    message, 
    tokenize=True, 
    add_generation_prompt=True, 
    return_dict=True, 
    return_tensors="pt"
).to(model.device)
streamer = TextStreamer(processor, skip_prompt=True, skip_special_tokens=True)

# Generate outputs in streaming mode
output_ids = model.generate(
    **inputs,
    max_new_tokens=6000,
    temperature=0.0,
    top_p=0.95,
    repetition_penalty=1.2,
    use_cache=True,
    streamer=streamer,
)

Examples

1. Vertical Japanese document parsing

Input image OCR result
Input image OCR result

The following image visualizes the output bounding boxes in red:

Detected figures

*https://warp.ndl.go.jp/web/20200609034301/www.town.suo-oshima.lg.jp/data/open/cnt/3/669/1/R01.10.P11.pdf?20200413190829

2. Complex business slide parsing

Input image OCR result
Input image OCR result

*https://www.aec.go.jp/kettei/kettei/20230220_3.pdf

3. Tabular layout parsing

Input image OCR result
Input image OCR result

*https://mhcc.maryland.gov/mhcc/pages/home/workgroups/documents/cardiac/Standing%20Advisory%20Committee%20Members%209-20-19.pdf

4. Mathematical formula parsing

Input image OCR result
Input image OCR result

*https://arxiv.org/pdf/2503.09208

LICENSE

MIT License

Citation

@misc{sarashinaOCR2026,
  title  = {Sarashina2.2-OCR: End-to-end OCR Model for Japanese Document Parsing},
  author = {Takumi Takada and Toshiyuki Tanaka and Kohei Uehara and Mikihiro Tanaka and Alexis Vallet and Aman Jain and Ryuichiro Hataya and Seitaro Shinagawa and Yuto Imai and Teppei Suzuki},
  year   = {2026},
  url    = {https://huggingface.co/sbintuitions/sarashina2.2-ocr}
}
Downloads last month
14
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sbintuitions/sarashina2.2-ocr

Paper for sbintuitions/sarashina2.2-ocr