Overview
Chitrapathak-2 (Chitra: Image; Pathak: Reader) is a VLM-based multilingual Optical Character Recognition (OCR) system designed specifically for the linguistic diversity and document complexity of the Indian ecosystem. Trained for high-fidelity OCR on Indic language book pages, Chitrapathak-2 demonstrates strong generalization across 10 major Indian languages and English. It is the second model in the Chitrapathak OCR series, continuing the effort to build robust OCR systems for Indic scripts and multilingual documents.
Model Summary
| Property | Details |
|---|---|
| Architecture | Vision-Encoder + 3B Decoder LLM |
| Languages | Hindi, Sanskrit, Bengali, Telugu, Tamil, Marathi, Kannada, Malayalam, Odia, Punjabi and English |
| Use Cases | OCR for printed text in multilingual books, pdf documents etc. |
| Frameworks | TRL 0.22.1, Transformers 4.56.0, PyTorch 2.6.0+cu124 |
| Training Strategy | Supervised Fine-Tuning (SFT), mixed precision (FP16 / bfloat16), multinode training, DeepSpeed ZeRO-2 optimization |
Usage
Using transformers
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
model_path = "krutrim-ai-labs/Chitrapathak-2"
model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
attn_implementation="flash_attention_2"
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)
def perform_ocr(image_path, model, processor, max_new_tokens=4096):
image = Image.open(image_path)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image", "image": f"file://{image_path}"},
{"type": "text", "text": "Perform OCR on this image and transcribe all visible text exactly as it appears."},
]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
inputs = inputs.to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
return output_text[0]
image_path = "/path/to/your/document.jpg"
result = perform_ocr(image_path, model, processor, max_new_tokens=15000)
print(result)
Using vLLM
- Start the vLLM server.
vllm serve krutrim-ai-labs/Chitrapathak-2
- Predict with the model
from openai import OpenAI
import base64
client = OpenAI(api_key="123", base_url="http://localhost:8000/v1")
model = "krutrim-ai-labs/Chitrapathak-2"
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
def perform_ocr(img_base64):
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img_base64}"},
},
{
"type": "text",
"text": "Perform OCR on this image and transcribe all visible text exactly as it appears.",
},
],
}
],
temperature=0.0,
max_tokens=15000
)
return response.choices[0].message.content
test_img_path = "/path/to/your/document.jpg"
img_base64 = encode_image(test_img_path)
print(perform_ocr(img_base64))
Evaluation Results
Indic OCR Performance
| Model | Bn Word ↓ | Bn Char ↓ | Hi Word ↓ | Hi Char ↓ | Kn Word ↓ | Kn Char ↓ | Ml Word ↓ | Ml Char ↓ | Mr Word ↓ | Mr Char ↓ | Or Word ↓ | Or Char ↓ | Pa Word ↓ | Pa Char ↓ | Ta Word ↓ | Ta Char ↓ | Te Word ↓ | Te Char ↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Maya | 99.42 | 95.77 | 99.7 | 94.91 | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| PALO | 96.3 | 91.15 | 99.26 | 91.98 | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| Pangea | 94.66 | 80.33 | 99.53 | 91.5 | - | - | - | - | - | - | - | - | - | - | 99.44 | 84.13 | 99.95 | 89.91 |
| Chitrarth-1 | 96.16 | 84.65 | 98.56 | 89.81 | 99.58 | 85.29 | 99.62 | 94.77 | 99.66 | 86.58 | 99.99 | 93.21 | 99.16 | 90.17 | 99.1 | 89.94 | 99.86 | 89.02 |
| LLaMA-4 maverick | 31.52 | 13.21 | 25.73 | 11.91 | 36.9 | 11.17 | 75.5 | 45.75 | 20.94 | 8.05 | 97.51 | 86.78 | 29.77 | 12.68 | 31.36 | 10.79 | 57.07 | 18.72 |
| Gemma-3 27B | 42.15 | 24.41 | 46.47 | 29.5 | 84.22 | 54.24 | 92.06 | 72.64 | 50.4 | 31.06 | 92.67 | 70.72 | 70.88 | 42.65 | 39.52 | 16.51 | 86.76 | 54.14 |
| GPT-4o | 55.51 | 32.68 | 54.62 | 35.54 | 94.33 | 69.79 | 94.67 | 78.47 | 63.44 | 37.93 | 94.61 | 73.46 | 68.88 | 40.71 | 74.35 | 43.39 | 95.97 | 70.08 |
| Nanonets-OCR2-3B | 28.56 | 12.42 | 32.26 | 16.78 | 99.38 | 93.07 | 97.24 | 89.81 | 40.97 | 15.92 | 99.82 | 97.11 | 98.70 | 82.84 | 95.25 | 78.83 | 99.42 | 89.39 |
| Chitrapathak-1 | 17.14 | 7.03 | 25.55 | 13.74 | 26.24 | 8.78 | 71.97 | 48.19 | 15.68 | 6.09 | 50.72 | 31.62 | 17.7 | 7.87 | 19.25 | 5.81 | 38.79 | 11 |
| Chitrapathak-2 | 14.51 | 5.47 | 19.87 | 8.36 | 18.8 | 4.81 | 64.47 | 34.7 | 9.82 | 2.27 | 44.74 | 21.83 | 15.24 | 7.06 | 17.66 | 5.68 | 31.81 | 6.69 |
| Gemini-2.5 Flash | 11.3 | 4.04 | 16.01 | 5.88 | 17.18 | 4.38 | 59.64 | 30.6 | 8.06 | 1.79 | 41.7 | 18.6 | 14.56 | 4.98 | 15.26 | 3.01 | 33.32 | 7.16 |
- The above table shows the performance comparison of models on the IndicVisionBench-OCR benchmark.
- The metrics used above are word-level ANLS and character-level ANLS.
- The best value in each column is in bold and the second best is underlined.
- Chitrapathak-2 delivers SOTA OCR accuracy in Telugu and is remarkably close to Gemini-2.5 in other Indic languages, with an average difference of only 2.21 (word) and 1.83 (char) across the nine Indic languages, and a maximum gap of just 4.83 (word) and 4.10 (char).
English OCR Performance
| Model | Synthdog | SROIE | |
|---|---|---|---|
| ANLS-Word | ANLS-Char | % Match | |
| Gemma-3 27B | 61.56 | 30.29 | 68.37 |
| Llama-4 maverick | 29.37 | 14.09 | 70.32 |
| GPT-4o | 82.22 | 73.65 | 36.09 |
| Nanonets-OCR2-3B | 23.9 | 10.8 | 72.33 |
| Chitrapathak-2 | 24.9 | 20.2 | 68.95 |
| Gemini-2.5 Flash | 22.43 | 15.33 | 70.1 |
- The above table shows the performance comparison of models on Synthdog and SROIE benchmarks.
- The metrics for Synthdog are ANLS-word and ANLS-char. In these metrics, lower is better.
- The metric for SROIE is %Match, indicating the percentage of fields that matched exactly with the ground truth. Here, higher is better.
- As we see, Chitrapathak-2 retains much of the English OCR capabilities of its base model Nanonets-OCR2-3B.
We also evaluated the model on the Old books OCR dataset, which consists of scanned book pages in English.
| Model | ANLS-word | ANLS-char |
|---|---|---|
| Nanonets | 4.33 | 2.96 |
| Chitrapathak-2 | 4.49 | 1.89 |
| Gemini-2.5 | 4.36 | 1.86 |
Token Efficiency and Latency Breakdown of Chitrapathak-2
| Metric | bn | hi | kn | ml | mr | or | pa | ta | te | en |
|---|---|---|---|---|---|---|---|---|---|---|
| Tokens / Word | 5.9 | 4.8 | 11.2 | 12.6 | 6.5 | 11.7 | 6.9 | 9.4 | 13.2 | 1.4 |
| Tokens (200 words) | 1174.8 | 951.4 | 2242.2 | 2514.0 | 1292.4 | 2334.2 | 1387.2 | 1873.6 | 2646.6 | 280.0 |
| Latency (200 words) | 4.9s | 4.0s | 9.2s | 10.3s | 5.3s | 9.5s | 5.7s | 7.7s | 10.8s | 1.3s |
Note that the above latency values are calculated assuming the size of the input image to be ~1024x1024.
Observations
- TTFT (Time-to-First-Token): ~125 ms
- Inter-token latency: ~4 ms per token
- Language impact: Latency varies with tokenization efficiency.
- English and Hindi → Lower latency due to compact token-to-word ratios.
- Telugu and Malayalam → Higher latency due to fragmented tokenization (larger number of tokens per word).
Example OCR Outputs
| Input Image | Model Output (Chitrapathak-2) |
|---|---|
|
CHAPTER XIV MILT GODDARD returned from Pancake that night, bringing letters for Taylor. Sitting on the deacon's bench in the men's shanty John opened them. One was from his father. The address was typewritten, but within was a scant page of Luke's scrawl. It had been years since the old man had touched pen to paper for his son and that fact was thrilling! "You are crazy to talk of that much pine. It can't be done. Don't believe everything they tell you up there just because you're a gullible cub. I'm sending Rowe to Pancake Monday night just to see how big a fool you are. Your mother is well. Yours, etc. L. Taylor." John breathed deeply and smiled and scratched his head and re-read the crabbed sentences. Beneath their crustiness was genuine interest, a willingness, after Luke's manner, to take him seriously at last, an indication that the favors he had asked two months before and which had drawn only a cruel trick now were his. Yesterday he would have tried to calculate the profit that might accrue to him from Luke Taylor's aid; tonight he saw only in that note a promise that the burden on Helen Foraker's shoulders would be lightened. She had helped him, she had shaped him, she had taught him; and now, perhaps, he could repay some of that obligation. He could not know what waited just over the horizon of time! The other letter was in a smudged, scrawled envelope, 140 |
|
हिन्दू मत और मसीही मत । १ ईश्वर | भूमिका | इन व्याख्याओं में हमारा विशेष अभिप्राय यह है कि हम हिन्दू मत और मसीही मत के मुख्य सिद्धान्तों पर सोच विचार करके निर्णय करें कि वे कहां लो समान हैं और कहां लो उन में भिन्नता पाई जाती है। यह नहीं समझना चाहिये कि मसीही और हिन्दू मत हर एक बात में विरोधी हैं और कभी यह नहीं समझना चाहिये कि हिन्दू और मसीही आपस में शत्रु हैं। मेरा आसरा है कि यह बात प्रगट होगी कि दोनों मतों की मनसा और अभिप्राय एक है और दोनों में कई एक सिद्धान्त हैं जो कुछ समान हैं तौभी बहुत सी बातें हैं जिन में विरुद्धता और भिन्नता पाई जाती है। हर एक प्रकार से हमारे लिये यह लाभदायक बात होगी कि हम किसी प्रकार की समानता पाके आनन्दित होवें और भिन्नता देखके निरूपण करें कि कौन २ सिद्धान्त यथार्थ और उत्तम और स्वीकार करने के योग्य हैं। कभी न भूलना चाहिये कि मसीहियों के लिये यह बात काफी नहीं है कि वे इस बात को स्थापित करें कि अमुक २ सिद्धान्त बैबल में हैं क्योंकि हिन्दू नह मानते हैं कि बैबल प्रामाणिक और ईश्वरीय पुस्तक है |
* The example images shown above were obtained from publicly available internet sources and are used only for demonstration of OCR outputs.
Highlights
- Supports 9 major Indic languages + English
- Optimized for printed book pages, PDF documents, and scanned text
- Strong layout robustness — handles multi-column, multi-font, and dense paragraphs with ease
- Second-best overall model on IndicVisionBench-OCR, ranking just behind Gemini-2.5, and the best-performing model for Telugu
- Compatible with vLLM and Hugging Face inference pipelines
- Multilingual generalization across diverse font styles, layouts, and page qualities
Limitations
- The model is only capable of OCR and is not meant for a document intelligence use-case.
- Given an image, the model only returns the OCR transcription in its default output format.
- Performance may drop on handwritten, noisy, or low-resolution images.
- Some degradation may be observed for rare Indic/English scripts or non-book domains (e.g. forms).
- Performance degradation is observed on Index-page layouts and other complicated page layouts.
License
This model is distributed under the Krutrim Community License Agreement v1.0.
Ensure compliance before any commercial or redistributed usage.
Update: This model is released only for research purposes since the underlying Qwen-2.5-VL model has changed from Apache to a more restrictive license.
Citation
If you use Chitrapathak-2 in your research, please cite:
@misc{faraz2026indicocr,
title={Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems},
author={Ali Faraz and Raja Kolla and Ashish Kulkarni and Shubham Agarwal},
year={2026},
eprint={2602.16430},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.16430},
}
Acknowledgements
Chitrapathak-2 builds upon the foundations of the following projects and open-source efforts:
- Downloads last month
- -