1. Introduction

Chitrapathak-1 (Chitra: Image; Pathak: Reader) is a multilingual OCR system built using a Vision–Language Model (VLM) architecture designed specifically for the linguistic diversity and document complexity of the Indian ecosystem. The model formulates OCR as an image-to-text generation task, allowing direct transcription of document images into text across multiple Indic languages. This model is part of the Chitrapathak OCR series and follows a LLaVA-style multimodal architecture.

Chitrapathak-1 integrates a vision encoder with a multilingual large language model, enabling end-to-end OCR generation without relying on traditional multi-stage pipelines. The model supports 10 Indic languages — Hindi, Sanskrit, Bengali, Marathi, Tamil, Telugu, Kannada, Malayalam, Punjabi, and Odia, and is trained on a large corpus of multilingual printed documents.

2. Model Architecture & Training

Key Features

Architecture: Vision–Language Model (LLaVA-style)
Vision Encoder: CLIP ViT-L/14-336
Language Model: Krutrim-1 7B multilingual LLM
Projection Module: 2-layer MLP connecting visual embeddings to the language model token space
Languages Supported: Hindi, Sanskrit, Bengali, Marathi, Tamil, Telugu, Kannada, Malayalam, Punjabi and Odia
Task: Multilingual Optical Character Recognition (OCR)
Input: Document images
Output: Free-form text transcription

Chitrapathak-1 follows a LLaVA-style vision–language architecture where document images are first encoded using a CLIP ViT-L/14-336 vision encoder. The resulting visual embeddings are mapped into the token space of the Krutrim-1 7B multilingual language model through a two-layer MLP projection module, after which the language model autoregressively generates the OCR transcription.

To improve recognition on dense document pages, the model uses dynamic image cropping, which is an aspect-ratio-aware tiling strategy, decomposing each page into multiple crops and a global view before visual encoding.

The model is trained in two stages. During multimodal pretraining, only the projection layer is optimized while the vision encoder and language model remain frozen to stabilize multimodal alignment. This is followed by supervised fine-tuning, where the projection layer and language model are jointly trained on multilingual OCR data while keeping the vision encoder frozen.

3. Inference code

# Clone the Repository
git clone https://github.com/ola-krutrim/Chitrapathak.git
conda create --name chitrapathak python=3.10 -y
conda activate chitrapathak
cd Chitrapathak
pip install -r requirements.txt
pip install -e .
python chitrapathak/inference.py --model-path "krutrim-ai-labs/Chitrapathak-1" --image-file "assets/hin1.png"

4. Evaluation Results

Chitrapathak-1 is evaluated on IndicVisionBench-OCR, a multilingual benchmark for OCR performance across Indic scripts.

The benchmark measures Average Normalized Levenshtein Distance (ANLS) at both the word level and character level, where lower scores indicate better OCR quality.

Chitrapathak-1 demonstrates strong OCR performance across multiple Indic languages, validating the effectiveness of end-to-end vision-language training for multilingual document transcription.

5. License

This code repository and the model weights are licensed under the Krutrim Community License Agreement Version 1.0.

6. Citation

If you use Chitrapathak-1 in your research, please cite:

@misc{faraz2026indicocr,
      title={Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems}, 
      author={Ali Faraz and Raja Kolla and Ashish Kulkarni and Shubham Agarwal},
      year={2026},
      eprint={2602.16430},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.16430}, 
}

7. Contact

Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.

8. Acknowledgement

Chitrapathak is built with reference to the code of the following projects:

We thank them for their great work!

Downloads last month: 13

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for krutrim-ai-labs/Chitrapathak-1

Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems

Paper • 2602.16430 • Published Feb 18