Sarashina2.2-OCR:
End-to-end OCR Model for Japanese Document Parsing

About the model

Sarashina2.2-OCR is an end-to-end 3B-parameter OCR model developed by SB Intuitions, specifically tailored for parsing Japanese and English documents.

The model is refined by human preference optimization to enable intuitive document parsing, and excels at converting a wide range of documents—including vertical Japanese text—into Markdown format while maintaining a natural reading order.

Key Features

Beyond standard text extraction, Sarashina2.2-OCR reconstructs documents into naturally structured Markdown, accurately converting complex elements into the following formats:

📊 Tables: Reconstructs tabular layouts into plain HTML format.
📐 Math Formulas: Transcribes mathematical equations directly into standard LaTeX format.
🖼️ Graphics: Detects visual components (e.g., images, charts) and indicates their positions using bounding boxes in the format <bbox>[(x1, y1), (x2, y2)]</bbox>, using normalized integer coordinates (0–1000) with a top-left origin.

Training Summary

Sarashina2.2-OCR integrates a SigLIP2-based vision encoder with the Sarashina2.2-3B-Instruct language model, and was trained through the following pipeline:

1. Pre-training on general image-text dataset:

To build basic image understanding and prepare for high-resolution document parsing to boost downstream OCR performance, we expanded Sarashina2.2-Vision-3B's pre-training into the following 4-substage pipeline, upscaling the resolution to ~2.5M pixels:

🔌 Projector warmup: bridging the gap between the embedding spaces of the LLM and vision encoder.
👁️ Vision encoder pre-training: enhancing image comprehension, especially for understanding Japan-specific images and text.
🔥 Full-parameter pre-training: enhancing the model's unified understanding of images and language using interleaved data.
🔍 High-resolution continual pre-training: expanding the maximum resolution to capture fine-grained details and dense text in complex documents.

We also mixed in a large amount of OCR and grounding data from the start to build basic document understanding early on.

2. Supervised fine-tuning (SFT) on large-scale OCR dataset:

We fine-tuned the model on diverse Japanese and English documents so it can recognize complex layouts and output naturally structured Markdown. We heavily used synthetic datasets alongside open OCR data, keeping the resolution at ~2.5M pixels.

3. Preference Optimization with manually annotated OCR datasets:

Finally, we applied Mixed Preference Optimization (MPO) to achieve more natural reading order understanding for documents with complex layouts. We identified two major challenges when applying it to long-context OCR tasks:

💸 High human-annotation costs:
High-quality OCR data with dense, complex layouts requires a lot of time to annotate manually, making large-scale data collection difficult.
🧩 Lack of effective data for late-stage errors:
A common approach to prepare negative examples is to sample directly from the model. In long-context OCR tasks, an early mistake in reading order can corrupt the entire remaining prediction, making it difficult to obtain preference pairs that target late-stage errors.

To overcome these, we augmented the negative examples of preference pairs by feeding the model with the first N ground-truth paragraphs and making the model complete the rest of the content. For each human-annotated sample, we applied this to every paragraph at position N to obtain multiple negative examples.

Benchmark Results

Sarashina2.2-OCR delivers highly competitive overall performance among the end-to-end OCR models, achieving the best scores on VJRODa despite its compact 3B-parameter size, while demonstrating strong capabilities in the Math and Table categories of olmOCR-bench.

VJRODa

VJRODa evaluates OCR capabilities for Japanese documents, particularly focusing on complex layouts and vertical text reading order.

Model	CER(↓)	BLEU(↑)
gpt-5-mini-2025-08-07	72.4	23.6
Qwen3.5-4B(non-thinking)	86.1	47.8
KARAKURI VL 32B Instruct 2507	280	14.1
LightOnOCR-2-1B	158	28.9
dots.ocr	40.1	71.5
Sarashina2.2-OCR	22.6	79.9

olmOCR-bench

A comprehensive benchmark designed to evaluate document parsing capabilities across diverse and complex structures, such as mathematical equations, multi-column layouts, and tables.

Model	arXiv Math	Headers Footers	Long Tiny Text	Multi Column	Old Scans	Old Scans Math	Table Tests	Overall
LightOnOCR-2-1B	0.890	0.204	0.889	0.848	0.418	0.865	0.887	0.773
dots.ocr	0.674	0.849	0.921	0.803	0.411	0.603	0.823	0.722
Sarashina2.2-OCR	0.778	0.291	0.846	0.752	0.323	0.528	0.829	0.683

Usage with Transformers

pip install transformers==4.57.1 torch torchvision pillow protobuf sentencepiece accelerate

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, TextStreamer, set_seed

# Load model and processor
model_path = "sbintuitions/sarashina2.2-ocr"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
set_seed(42)

# Prepare inputs
image_url = "https://huggingface.co/sbintuitions/sarashina2.2-ocr/resolve/main/assets/sample1.jpeg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
message = [
    {
        "role": "user",
        "content": [{"type": "image", "image": image}],
    }
]
inputs = processor.apply_chat_template(
    message, 
    tokenize=True, 
    add_generation_prompt=True, 
    return_dict=True, 
    return_tensors="pt"
).to(model.device)
streamer = TextStreamer(processor, skip_prompt=True, skip_special_tokens=True)

# Generate outputs in streaming mode
output_ids = model.generate(
    **inputs,
    max_new_tokens=6000,
    temperature=0.0,
    top_p=0.95,
    repetition_penalty=1.2,
    use_cache=True,
    streamer=streamer,
)

Examples

1. Vertical Japanese document parsing

Input image	OCR result

The following image visualizes the output bounding boxes in red:

*https://warp.ndl.go.jp/web/20200609034301/www.town.suo-oshima.lg.jp/data/open/cnt/3/669/1/R01.10.P11.pdf?20200413190829

2. Complex business slide parsing

Input image	OCR result

*https://www.aec.go.jp/kettei/kettei/20230220_3.pdf

3. Tabular layout parsing

Input image	OCR result

*https://mhcc.maryland.gov/mhcc/pages/home/workgroups/documents/cardiac/Standing%20Advisory%20Committee%20Members%209-20-19.pdf

4. Mathematical formula parsing

Input image	OCR result

*https://arxiv.org/pdf/2503.09208

LICENSE

MIT License

Citation

@misc{sarashinaOCR2026,
  title  = {Sarashina2.2-OCR: End-to-end OCR Model for Japanese Document Parsing},
  author = {Takumi Takada and Toshiyuki Tanaka and Kohei Uehara and Mikihiro Tanaka and Alexis Vallet and Aman Jain and Ryuichiro Hataya and Seitaro Shinagawa and Yuto Imai and Teppei Suzuki},
  year   = {2026},
  url    = {https://huggingface.co/sbintuitions/sarashina2.2-ocr}
}

Downloads last month: 14

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for sbintuitions/sarashina2.2-ocr

Base model

sbintuitions/sarashina2.2-3b

Finetuned

sbintuitions/sarashina2.2-3b-instruct-v0.1

Finetuned

(32)

this model

Paper for sbintuitions/sarashina2.2-ocr

Optimal Control of Medical Drug in a Nonlocal Model of Solid Tumor Growth

Paper • 2503.09208 • Published Mar 12, 2025