1. Introduction

Chitranuvad (चित्रानुवाद; Chitra = Image, Anuvad = Translation) is a multimodal translation model that leverages both images and text to perform translation from English to Indic languages. The model is designed to improve translation quality by grounding language understanding in visual context. Chitranuvad was developed for the English-to-Low Resource Multimodal Translation Task at WMT 2024, focusing on translation to Indic languages including Hindi, Bengali, and Malayalam and is a fine-tuned version of Chitrarth-1. The model integrates a vision encoder with a multilingual large language model, allowing it to resolve ambiguities in text using visual information.

Example of multimodal translation where the meaning of words like court and right depends on visual context. This demonstrates how visual input helps disambiguate translations.

Supported Tasks

Text-only Translation: Translate English text into Indic languages without image input.
Image Captioning: Generate captions from images in Indic languages.
Multimodal Translation: Translate English text while conditioning on the corresponding image.

These tasks were evaluated as part of the WMT 2024 Multimodal Machine Translation shared task.

2. Model Architecture and Training

Chitranuvad is a multimodal translation model that integrates visual grounding with multilingual language modeling. The model processes the input image using a vision encoder to extract visual features, which are then projected onto the embedding space of the language model. These visual embeddings are fused with tokenized text embeddings and processed by the multilingual LLM to generate the translation.

Chitranuvad is trained in three stages to progressively align visual and textual representations and specialize the model for multimodal translation.

Stage 1 — Feature Alignment

The vision encoder and modality projector are trained to align image features with textual embeddings. This stage teaches the model to associate visual objects with their textual representations.

Stage 2 — Multimodal Instruction Tuning

The model is instruction-tuned on multimodal prompts involving both text and images. This enables the system to follow instructions such as captioning images or translating text while considering visual context.

Stage 3 — Task-Specific Finetuning

The final stage fine-tunes the model specifically for translation tasks using multimodal translation datasets containing images, English source sentences, and their translations in Indic languages.

3. Inference code

# Clone the Repository
git clone https://github.com/ola-krutrim/Chitranuvad.git
conda create --name chitranuvad python=3.10 -y
conda activate chitranuvad
cd Chitranuvad
pip install -r requirements.txt
pip install -e .
python chitranuvad/inference.py --model-path "krutrim-ai-labs/Chitranuvad" --image-file "assets/player_tennis_court.jpg" --query "Translate the caption below for the given image to Hindi language:\nA Tennis player in a court."

4. Example outputs

Example translations into Hindi, Bengali, and Malayalam conditioned on visual input. We enrich the dataset to include labels of all the identified objects

5. Results

Chitranuvad was the winning entry among participating submissions for the English to Low-Resource Multimodal Machine Translation task at the Workshop on Machine Translation, EMNLP 2024. The shared task comprised three tracks: Text-only Translation, Multimodal Translation, and Image Captioning, covering the languages Hindi, Bengali, and Malayalam. Model performance was evaluated using the BLEU and RIBES metrics.

6. License

This code repository and the model weights are licensed under the Krutrim Community License Agreement Version 1.0.

7. Citation

@inproceedings{khan-etal-2024-chitranuvad,
    title = "Chitranuvad: Adapting Multi-lingual {LLM}s for Multimodal Translation",
    author = "Khan, Shaharukh  and
      Tarun, Ayush  and
      Faraz, Ali  and
      Kamble, Palash  and
      Dahiya, Vivek  and
      Pokala, Praveen  and
      Kulkarni, Ashish  and
      Khatri, Chandra  and
      Ravi, Abhinav  and
      Agarwal, Shubham",
    editor = "Haddow, Barry  and
      Kocmi, Tom  and
      Koehn, Philipp  and
      Monz, Christof",
    booktitle = "Proceedings of the Ninth Conference on Machine Translation",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.wmt-1.80/",
    doi = "10.18653/v1/2024.wmt-1.80",
    pages = "839--851"
}

8. Contact

Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.

9. Acknowledgements

Chitranuvad is built with reference to the code of the following projects:

We thank them for their great work!

Downloads last month: 9

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for krutrim-ai-labs/Chitranuvad

Base model

krutrim-ai-labs/Chitrarth

Finetuned

(1)

this model

Paper for krutrim-ai-labs/Chitranuvad

Chitranuvad: Adapting Multi-Lingual LLMs for Multimodal Translation

Paper • 2502.20420 • Published Feb 27, 2025