1. Introduction
Chitrapathak-1 (Chitra: Image; Pathak: Reader) is a multilingual OCR system built using a Vision–Language Model (VLM) architecture designed specifically for the linguistic diversity and document complexity of the Indian ecosystem. The model formulates OCR as an image-to-text generation task, allowing direct transcription of document images into text across multiple Indic languages. This model is part of the Chitrapathak OCR series and follows a LLaVA-style multimodal architecture.
Chitrapathak-1 integrates a vision encoder with a multilingual large language model, enabling end-to-end OCR generation without relying on traditional multi-stage pipelines. The model supports 10 Indic languages — Hindi, Sanskrit, Bengali, Marathi, Tamil, Telugu, Kannada, Malayalam, Punjabi, and Odia, and is trained on a large corpus of multilingual printed documents.
2. Model Architecture & Training
Key Features
- Architecture: Vision–Language Model (LLaVA-style)
- Vision Encoder: CLIP ViT-L/14-336
- Language Model: Krutrim-1 7B multilingual LLM
- Projection Module: 2-layer MLP connecting visual embeddings to the language model token space
- Languages Supported: Hindi, Sanskrit, Bengali, Marathi, Tamil, Telugu, Kannada, Malayalam, Punjabi and Odia
- Task: Multilingual Optical Character Recognition (OCR)
- Input: Document images
- Output: Free-form text transcription
Chitrapathak-1 follows a LLaVA-style vision–language architecture where document images are first encoded using a CLIP ViT-L/14-336 vision encoder. The resulting visual embeddings are mapped into the token space of the Krutrim-1 7B multilingual language model through a two-layer MLP projection module, after which the language model autoregressively generates the OCR transcription.
To improve recognition on dense document pages, the model uses dynamic image cropping, which is an aspect-ratio-aware tiling strategy, decomposing each page into multiple crops and a global view before visual encoding.
The model is trained in two stages. During multimodal pretraining, only the projection layer is optimized while the vision encoder and language model remain frozen to stabilize multimodal alignment. This is followed by supervised fine-tuning, where the projection layer and language model are jointly trained on multilingual OCR data while keeping the vision encoder frozen.
3. Inference code
# Clone the Repository
git clone https://github.com/ola-krutrim/Chitrapathak.git
conda create --name chitrapathak python=3.10 -y
conda activate chitrapathak
cd Chitrapathak
pip install -r requirements.txt
pip install -e .
python chitrapathak/inference.py --model-path "krutrim-ai-labs/Chitrapathak-1" --image-file "assets/hin1.png"
4. Evaluation Results
Chitrapathak-1 is evaluated on IndicVisionBench-OCR, a multilingual benchmark for OCR performance across Indic scripts.
The benchmark measures Average Normalized Levenshtein Distance (ANLS) at both the word level and character level, where lower scores indicate better OCR quality.
Chitrapathak-1 demonstrates strong OCR performance across multiple Indic languages, validating the effectiveness of end-to-end vision-language training for multilingual document transcription.
5. License
This code repository and the model weights are licensed under the Krutrim Community License Agreement Version 1.0.
6. Citation
If you use Chitrapathak-1 in your research, please cite:
@misc{faraz2026indicocr,
title={Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems},
author={Ali Faraz and Raja Kolla and Ashish Kulkarni and Shubham Agarwal},
year={2026},
eprint={2602.16430},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.16430},
}
7. Contact
Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.
8. Acknowledgement
Chitrapathak is built with reference to the code of the following projects:
We thank them for their great work!
- Downloads last month
- 13

