| # IndicTrans2 HF Compatible Models | |
| [](https://colab.research.google.com/github/AI4Bharat/IndicTrans2/blob/main/huggingface_interface/colab_inference.ipynb) | |
| In this section, we provide details on how to use our [IndicTrans2](https://github.com/AI4Bharat/IndicTrans2) models which were originally trained with the [fairseq](https://github.com/facebookresearch/fairseq) to [HuggingFace transformers](https://huggingface.co/docs/transformers/index) for inference purpose. Our scripts for HuggingFace compatible models are adapted from [M2M100 repository](https://github.com/huggingface/transformers/tree/main/src/transformers/models/m2m_100). | |
| > Note: We have migrated IndicTrans2 tokenizer for HF compatible IndicTrans2 models to [IndicTransToolkit](https://github.com/VarunGumma/IndicTransToolkit) and will be maintained separately there from now onwards. This is automatically installed when you call `install.sh` script in `huggingface_interface`. | |
| ### Setup | |
| To get started, follow these steps to set up the environment: | |
| ``` | |
| # Clone the github repository and navigate to the project directory. | |
| git clone https://github.com/AI4Bharat/IndicTrans2 | |
| cd IndicTrans2/huggingface_interface | |
| # Install all the dependencies and requirements associated with the project for running HF compatible models. | |
| source install.sh | |
| ``` | |
| > Note: The `install.sh` script in this directory is specifically for running HF compatible models for inference. | |
| ### Converting | |
| In order to convert the fairseq checkpoint to a PyTorch checkpoint that is compatible with HuggingFace Transformers, use the following command: | |
| ```bash | |
| python3 convert_indictrans_checkpoint_to_pytorch.py --fairseq_path <fairseq_checkpoint_best.pt> --pytorch_dump_folder_path <hf_output_dir> | |
| ``` | |
| - `<fairseq_checkpoint_best.pt>`: path to the fairseq `checkpoint_best.pt` that needs to be converted to HF compatible models | |
| - `<hf_output_dir>`: path to the output directory where the HF compatible models will be saved | |
| ### Models | |
| | Model | π€ HuggingFace Checkpoints | | |
| | -------------------------------- | ----------------------------------------------------------------------------------------------------------------- | | |
| | En-Indic | [ai4bharat/indictrans2-en-indic-1B](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B) | | |
| | Indic-En | [ai4bharat/indictrans2-indic-en-1B](https://huggingface.co/ai4bharat/indictrans2-indic-en-1B) | | |
| | Distilled En-Indic | [ai4bharat/indictrans2-en-indic-dist-200M](https://huggingface.co/ai4bharat/indictrans2-en-indic-dist-200M) | | |
| | Distilled Indic-En | [ai4bharat/indictrans2-indic-en-dist-200M](https://huggingface.co/ai4bharat/indictrans2-indic-en-dist-200M) | | |
| | Indic-Indic (Stitched) | [ai4bharat/indictrans2-indic-indic-1B](https://huggingface.co/ai4bharat/indictrans2-indic-indic-1B) | | |
| | Distilled Indic-Indic (Stitched) | [ai4bharat/indictrans2-indic-indic-dist-320M](https://huggingface.co/ai4bharat/indictrans2-indic-indic-dist-320M) | | |
| ### Inference | |
| With the conversion complete, you can now perform inference using the HuggingFace Transformers. | |
| You can start with the provided `example.py` script and customize it for your specific translation use case: | |
| ```bash | |
| python3 example.py | |
| ``` | |
| Feel free to modify the `example.py` script to suit your translation needs. | |
| ### Fine-tuning with LoRA | |
| Before starting with fine-tuning IndicTrans2 models, you will need to restructure the training data in the following format. | |
| ``` | |
| en-indic-exp | |
| βββ train | |
| β βββ eng_Latn-asm_Beng | |
| β β βββ train.eng_Latn | |
| β β βββ train.asm_Beng | |
| β βββ eng_Latn-ben_Beng | |
| β β βββ ... | |
| β βββ {src_lang}-{tgt_lang} | |
| β βββ train.{src_lang} | |
| β βββ train.{tgt_lang} | |
| βββ dev | |
| βββ eng_Latn-asm_Beng | |
| β βββ dev.eng_Latn | |
| β βββ dev.asm_Beng | |
| βββ eng_Latn-ben_Beng | |
| β βββ ... | |
| βββ {src_lang}-{tgt_lang} | |
| βββ dev.{src_lang} | |
| βββ dev.{tgt_lang} | |
| ``` | |
| Once you have data ready in above specified format, use the following command. | |
| ```bash | |
| bash train_lora.sh <data_dir> <model_name> <output_dir> <direction> <src_lang_list> <tgt_lang_list> | |
| ``` | |
| We recommend you to refer to `train_lora.sh` for defaults arguments for fine-tuning. Please note that the specified hyperparameters may not be optimal and might require tuning for your use case. | |
| ### Inference with LoRA | |
| You can load the LoRA adapters with the base model for inference by modifying the model initialization in `example.py` script. | |
| ```python | |
| from transformers import AutoModelForSeq2SeqLM | |
| from peft import PeftConfig, PeftModel | |
| base_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B" # you will need to change as per your use case | |
| base_model = AutoModelForSeq2SeqLM.from_pretrained(base_ckpt_dir, trust_remote_code=True) | |
| lora_model = PeftModel.from_pretrained(base_model, lora_ckpt_dir) | |
| ``` | |
| > Note: Please feel free to open issues on the GitHub repo in case of any queries/issues. | |
| ### Citation | |
| ```bibtex | |
| @article{gala2023indictrans, | |
| title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages}, | |
| author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan}, | |
| journal={Transactions on Machine Learning Research}, | |
| issn={2835-8856}, | |
| year={2023}, | |
| url={https://openreview.net/forum?id=vfT4YuzAYA}, | |
| note={} | |
| } | |
| ``` | |