arXiv

Overview

Chitrapathak-2 (Chitra: Image; Pathak: Reader) is a VLM-based multilingual Optical Character Recognition (OCR) system designed specifically for the linguistic diversity and document complexity of the Indian ecosystem. Trained for high-fidelity OCR on Indic language book pages, Chitrapathak-2 demonstrates strong generalization across 10 major Indian languages and English. It is the second model in the Chitrapathak OCR series, continuing the effort to build robust OCR systems for Indic scripts and multilingual documents.

Model Summary

Property Details
Architecture Vision-Encoder + 3B Decoder LLM
Languages Hindi, Sanskrit, Bengali, Telugu, Tamil, Marathi, Kannada, Malayalam, Odia, Punjabi and English
Use Cases OCR for printed text in multilingual books, pdf documents etc.
Frameworks TRL 0.22.1, Transformers 4.56.0, PyTorch 2.6.0+cu124
Training Strategy Supervised Fine-Tuning (SFT), mixed precision (FP16 / bfloat16), multinode training, DeepSpeed ZeRO-2 optimization

Usage

Using transformers

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText

model_path = "krutrim-ai-labs/Chitrapathak-2"

model = AutoModelForImageTextToText.from_pretrained(
    model_path, 
    torch_dtype="auto", 
    device_map="auto", 
    attn_implementation="flash_attention_2"
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)


def perform_ocr(image_path, model, processor, max_new_tokens=4096):
    image = Image.open(image_path)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": f"file://{image_path}"},
            {"type": "text", "text": "Perform OCR on this image and transcribe all visible text exactly as it appears."},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)
    
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return output_text[0]

image_path = "/path/to/your/document.jpg"
result = perform_ocr(image_path, model, processor, max_new_tokens=15000)
print(result)

Using vLLM

  1. Start the vLLM server.
vllm serve krutrim-ai-labs/Chitrapathak-2
  1. Predict with the model
from openai import OpenAI
import base64

client = OpenAI(api_key="123", base_url="http://localhost:8000/v1")

model = "krutrim-ai-labs/Chitrapathak-2"

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def perform_ocr(img_base64):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{img_base64}"},
                    },
                    {
                        "type": "text",
                        "text": "Perform OCR on this image and transcribe all visible text exactly as it appears.",
                    },
                ],
            }
        ],
        temperature=0.0,
        max_tokens=15000
    )
    return response.choices[0].message.content

test_img_path = "/path/to/your/document.jpg"
img_base64 = encode_image(test_img_path)
print(perform_ocr(img_base64))

Evaluation Results

Indic OCR Performance

Model Bn Word ↓ Bn Char ↓ Hi Word ↓ Hi Char ↓ Kn Word ↓ Kn Char ↓ Ml Word ↓ Ml Char ↓ Mr Word ↓ Mr Char ↓ Or Word ↓ Or Char ↓ Pa Word ↓ Pa Char ↓ Ta Word ↓ Ta Char ↓ Te Word ↓ Te Char ↓
Maya 99.42 95.77 99.7 94.91 - - - - - - - - - - - - - -
PALO 96.3 91.15 99.26 91.98 - - - - - - - - - - - - - -
Pangea 94.66 80.33 99.53 91.5 - - - - - - - - - - 99.44 84.13 99.95 89.91
Chitrarth-1 96.16 84.65 98.56 89.81 99.58 85.29 99.62 94.77 99.66 86.58 99.99 93.21 99.16 90.17 99.1 89.94 99.86 89.02
LLaMA-4 maverick 31.52 13.21 25.73 11.91 36.9 11.17 75.5 45.75 20.94 8.05 97.51 86.78 29.77 12.68 31.36 10.79 57.07 18.72
Gemma-3 27B 42.15 24.41 46.47 29.5 84.22 54.24 92.06 72.64 50.4 31.06 92.67 70.72 70.88 42.65 39.52 16.51 86.76 54.14
GPT-4o 55.51 32.68 54.62 35.54 94.33 69.79 94.67 78.47 63.44 37.93 94.61 73.46 68.88 40.71 74.35 43.39 95.97 70.08
Nanonets-OCR2-3B 28.56 12.42 32.26 16.78 99.38 93.07 97.24 89.81 40.97 15.92 99.82 97.11 98.70 82.84 95.25 78.83 99.42 89.39
Chitrapathak-1 17.14 7.03 25.55 13.74 26.24 8.78 71.97 48.19 15.68 6.09 50.72 31.62 17.7 7.87 19.25 5.81 38.79 11
Chitrapathak-2 14.51 5.47 19.87 8.36 18.8 4.81 64.47 34.7 9.82 2.27 44.74 21.83 15.24 7.06 17.66 5.68 31.81 6.69
Gemini-2.5 Flash 11.3 4.04 16.01 5.88 17.18 4.38 59.64 30.6 8.06 1.79 41.7 18.6 14.56 4.98 15.26 3.01 33.32 7.16
  • The above table shows the performance comparison of models on the IndicVisionBench-OCR benchmark.
  • The metrics used above are word-level ANLS and character-level ANLS.
  • The best value in each column is in bold and the second best is underlined.
  • Chitrapathak-2 delivers SOTA OCR accuracy in Telugu and is remarkably close to Gemini-2.5 in other Indic languages, with an average difference of only 2.21 (word) and 1.83 (char) across the nine Indic languages, and a maximum gap of just 4.83 (word) and 4.10 (char).

English OCR Performance

ModelSynthdogSROIE
ANLS-WordANLS-Char% Match
Gemma-3 27B61.5630.2968.37
Llama-4 maverick29.3714.0970.32
GPT-4o82.2273.6536.09
Nanonets-OCR2-3B23.910.872.33
Chitrapathak-224.920.268.95
Gemini-2.5 Flash22.4315.3370.1
  • The above table shows the performance comparison of models on Synthdog and SROIE benchmarks.
  • The metrics for Synthdog are ANLS-word and ANLS-char. In these metrics, lower is better.
  • The metric for SROIE is %Match, indicating the percentage of fields that matched exactly with the ground truth. Here, higher is better.
  • As we see, Chitrapathak-2 retains much of the English OCR capabilities of its base model Nanonets-OCR2-3B.

We also evaluated the model on the Old books OCR dataset, which consists of scanned book pages in English.

Model ANLS-word ANLS-char
Nanonets 4.33 2.96
Chitrapathak-2 4.49 1.89
Gemini-2.5 4.36 1.86

Token Efficiency and Latency Breakdown of Chitrapathak-2

Metric bn hi kn ml mr or pa ta te en
Tokens / Word 5.9 4.8 11.2 12.6 6.5 11.7 6.9 9.4 13.2 1.4
Tokens (200 words) 1174.8 951.4 2242.2 2514.0 1292.4 2334.2 1387.2 1873.6 2646.6 280.0
Latency (200 words) 4.9s 4.0s 9.2s 10.3s 5.3s 9.5s 5.7s 7.7s 10.8s 1.3s

Note that the above latency values are calculated assuming the size of the input image to be ~1024x1024.

Observations

  • TTFT (Time-to-First-Token): ~125 ms
  • Inter-token latency: ~4 ms per token
  • Language impact: Latency varies with tokenization efficiency.
    • English and Hindi → Lower latency due to compact token-to-word ratios.
    • Telugu and Malayalam → Higher latency due to fragmented tokenization (larger number of tokens per word).

Example OCR Outputs

Input Image Model Output (Chitrapathak-2)
CHAPTER XIV
MILT GODDARD returned from Pancake that night,
bringing letters for Taylor.

Sitting on the deacon's bench in the men's shanty John
opened them. One was from his father. The address was
typewritten, but within was a scant page of Luke's scrawl.
It had been years since the old man had touched pen to
paper for his son and that fact was thrilling!

"You are crazy to talk of that much pine. It can't be
done. Don't believe everything they tell you up there
just because you're a gullible cub. I'm sending Rowe to
Pancake Monday night just to see how big a fool you are.
Your mother is well. Yours, etc. L. Taylor."

John breathed deeply and smiled and scratched his
head and re-read the crabbed sentences. Beneath their
crustiness was genuine interest, a willingness, after Luke's
manner, to take him seriously at last, an indication that
the favors he had asked two months before and which had
drawn only a cruel trick now were his.

Yesterday he would have tried to calculate the profit
that might accrue to him from Luke Taylor's aid; tonight
he saw only in that note a promise that the burden on
Helen Foraker's shoulders would be lightened. She had
helped him, she had shaped him, she had taught him;
and now, perhaps, he could repay some of that obligation.
He could not know what waited just over the horizon
of time!

The other letter was in a smudged, scrawled envelope,
140
हिन्दू मत और मसीही मत ।
१ ईश्वर |
भूमिका |
इन व्याख्याओं में हमारा विशेष अभिप्राय यह है
कि हम हिन्दू मत और मसीही मत के मुख्य सिद्धान्तों
पर सोच विचार करके निर्णय करें कि वे कहां लो
समान हैं और कहां लो उन में भिन्नता पाई जाती है।
यह नहीं समझना चाहिये कि मसीही और हिन्दू
मत हर एक बात में विरोधी हैं और कभी यह नहीं
समझना चाहिये कि हिन्दू और मसीही आपस में शत्रु
हैं। मेरा आसरा है कि यह बात प्रगट होगी कि दोनों
मतों की मनसा और अभिप्राय एक है और दोनों में
कई एक सिद्धान्त हैं जो कुछ समान हैं तौभी बहुत सी
बातें हैं जिन में विरुद्धता और भिन्नता पाई जाती है।
हर एक प्रकार से हमारे लिये यह लाभदायक बात
होगी कि हम किसी प्रकार की समानता पाके आनन्दित
होवें और भिन्नता देखके निरूपण करें कि कौन २
सिद्धान्त यथार्थ और उत्तम और स्वीकार करने के
योग्य हैं।
कभी न भूलना चाहिये कि मसीहियों के लिये यह
बात काफी नहीं है कि वे इस बात को स्थापित करें
कि अमुक २ सिद्धान्त बैबल में हैं क्योंकि हिन्दू नह
मानते हैं कि बैबल प्रामाणिक और ईश्वरीय पुस्तक है

* The example images shown above were obtained from publicly available internet sources and are used only for demonstration of OCR outputs.


Highlights

  • Supports 9 major Indic languages + English
  • Optimized for printed book pages, PDF documents, and scanned text
  • Strong layout robustness — handles multi-column, multi-font, and dense paragraphs with ease
  • Second-best overall model on IndicVisionBench-OCR, ranking just behind Gemini-2.5, and the best-performing model for Telugu
  • Compatible with vLLM and Hugging Face inference pipelines
  • Multilingual generalization across diverse font styles, layouts, and page qualities

Limitations

  • The model is only capable of OCR and is not meant for a document intelligence use-case.
  • Given an image, the model only returns the OCR transcription in its default output format.
  • Performance may drop on handwritten, noisy, or low-resolution images.
  • Some degradation may be observed for rare Indic/English scripts or non-book domains (e.g. forms).
  • Performance degradation is observed on Index-page layouts and other complicated page layouts.

License

This model is distributed under the Krutrim Community License Agreement v1.0.
Ensure compliance before any commercial or redistributed usage.

Update: This model is released only for research purposes since the underlying Qwen-2.5-VL model has changed from Apache to a more restrictive license.

Citation

If you use Chitrapathak-2 in your research, please cite:

@misc{faraz2026indicocr,
      title={Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems}, 
      author={Ali Faraz and Raja Kolla and Ashish Kulkarni and Shubham Agarwal},
      year={2026},
      eprint={2602.16430},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.16430}, 
}

Acknowledgements

Chitrapathak-2 builds upon the foundations of the following projects and open-source efforts:

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for krutrim-ai-labs/Chitrapathak-2

Finetuned
(4)
this model

Paper for krutrim-ai-labs/Chitrapathak-2