1. Introduction
Chitranuvad (चित्रानुवाद; Chitra = Image, Anuvad = Translation) is a multimodal translation model that leverages both images and text to perform translation from English to Indic languages. The model is designed to improve translation quality by grounding language understanding in visual context. Chitranuvad was developed for the English-to-Low Resource Multimodal Translation Task at WMT 2024, focusing on translation to Indic languages including Hindi, Bengali, and Malayalam and is a fine-tuned version of Chitrarth-1. The model integrates a vision encoder with a multilingual large language model, allowing it to resolve ambiguities in text using visual information.
Example of multimodal translation where the meaning of words like court and right depends on visual context. This demonstrates how visual input helps disambiguate translations.
Supported Tasks
- Text-only Translation: Translate English text into Indic languages without image input.
- Image Captioning: Generate captions from images in Indic languages.
- Multimodal Translation: Translate English text while conditioning on the corresponding image.
These tasks were evaluated as part of the WMT 2024 Multimodal Machine Translation shared task.
2. Model Architecture and Training
Chitranuvad is a multimodal translation model that integrates visual grounding with multilingual language modeling. The model processes the input image using a vision encoder to extract visual features, which are then projected onto the embedding space of the language model. These visual embeddings are fused with tokenized text embeddings and processed by the multilingual LLM to generate the translation.
Chitranuvad is trained in three stages to progressively align visual and textual representations and specialize the model for multimodal translation.
Stage 1 — Feature Alignment
The vision encoder and modality projector are trained to align image features with textual embeddings. This stage teaches the model to associate visual objects with their textual representations.
Stage 2 — Multimodal Instruction Tuning
The model is instruction-tuned on multimodal prompts involving both text and images. This enables the system to follow instructions such as captioning images or translating text while considering visual context.
Stage 3 — Task-Specific Finetuning
The final stage fine-tunes the model specifically for translation tasks using multimodal translation datasets containing images, English source sentences, and their translations in Indic languages.
3. Inference code
# Clone the Repository
git clone https://github.com/ola-krutrim/Chitranuvad.git
conda create --name chitranuvad python=3.10 -y
conda activate chitranuvad
cd Chitranuvad
pip install -r requirements.txt
pip install -e .
python chitranuvad/inference.py --model-path "krutrim-ai-labs/Chitranuvad" --image-file "assets/player_tennis_court.jpg" --query "Translate the caption below for the given image to Hindi language:\nA Tennis player in a court."
4. Example outputs
Example translations into Hindi, Bengali, and Malayalam conditioned on visual input. We enrich the dataset to include labels of all the identified objects
5. Results
Chitranuvad was the winning entry among participating submissions for the English to Low-Resource Multimodal Machine Translation task at the Workshop on Machine Translation, EMNLP 2024. The shared task comprised three tracks: Text-only Translation, Multimodal Translation, and Image Captioning, covering the languages Hindi, Bengali, and Malayalam. Model performance was evaluated using the BLEU and RIBES metrics.
6. License
This code repository and the model weights are licensed under the Krutrim Community License Agreement Version 1.0.
7. Citation
@inproceedings{khan-etal-2024-chitranuvad,
title = "Chitranuvad: Adapting Multi-lingual {LLM}s for Multimodal Translation",
author = "Khan, Shaharukh and
Tarun, Ayush and
Faraz, Ali and
Kamble, Palash and
Dahiya, Vivek and
Pokala, Praveen and
Kulkarni, Ashish and
Khatri, Chandra and
Ravi, Abhinav and
Agarwal, Shubham",
editor = "Haddow, Barry and
Kocmi, Tom and
Koehn, Philipp and
Monz, Christof",
booktitle = "Proceedings of the Ninth Conference on Machine Translation",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.wmt-1.80/",
doi = "10.18653/v1/2024.wmt-1.80",
pages = "839--851"
}
8. Contact
Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.
9. Acknowledgements
Chitranuvad is built with reference to the code of the following projects:
We thank them for their great work!
- Downloads last month
- -
Model tree for krutrim-ai-labs/Chitranuvad
Base model
krutrim-ai-labs/Chitrarth

