--- license: mit tags: - machine-translation - mbart - multilingual - huggingface - peft - lora - english - telugu datasets: - HackHedron/English_Telugu_Parallel_Corpus language: - en - te library_name: peft inference: false widget: - text: "Hello, how are you?" --- # 🌍 LoRA-mBART50: English ↔ Telugu Translation (Few-shot) This model is a parameter-efficient fine-tuned version of [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) using [LoRA (Low-Rank Adaptation)](https://arxiv.org/abs/2106.09685) via the Hugging Face PEFT library. It is fine-tuned in a **few-shot setting** on the [HackHedron English-Telugu Parallel Corpus](https://huggingface.co/datasets/HackHedron/English_Telugu_Parallel_Corpus) using just **1% of the data (~4.3k pairs)**. --- ## 🧠 Model Details - **Base model**: `facebook/mbart-large-50-many-to-many-mmt` - **Languages**: `en_XX` ↔ `te_IN` - **Technique**: LoRA (r=8, α=32, dropout=0.1) - **Training regime**: 3 epochs, batch size 8, learning rate 5e-4 - **Library**: 🤗 PEFT (`peft`), `transformers`, `datasets` --- ## 📚 Dataset - **Source**: [HackHedron/English_Telugu_Parallel_Corpus](https://huggingface.co/datasets/HackHedron/English_Telugu_Parallel_Corpus) - **Size used**: 4338 sentence pairs (~1%) - **Format**: - `english`: Source text - `telugu`: Target translation --- ## 💻 Usage ### Load Adapter with Base mBART ```python from transformers import MBartForConditionalGeneration, MBart50TokenizerFast from peft import PeftModel # Load base model & tokenizer base_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") tokenizer = MBart50TokenizerFast.from_pretrained("your-username/lora-mbart-en-te") # Load LoRA adapter model = PeftModel.from_pretrained(base_model, "your-username/lora-mbart-en-te") # Set source and target languages tokenizer.src_lang = "en_XX" tokenizer.tgt_lang = "te_IN" # Prepare input inputs = tokenizer("Hello, how are you?", return_tensors="pt") generated_ids = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"]) translation = tokenizer.decode(generated_ids[0], skip_special_tokens=True) print(translation) ```` --- ## 🔧 Training Configuration | Setting | Value | | --------------- | -------- | | Base Model | mBART-50 | | LoRA r | 8 | | LoRA Alpha | 32 | | Dropout | 0.1 | | Optimizer | AdamW | | Batch Size | 8 | | Epochs | 3 | | Mixed Precision | fp16 | --- ## 🚀 Applications * English ↔ Telugu translation for low-resource settings * Mobile/Edge inference with minimal memory * Foundation for multilingual LoRA adapters --- ## ⚠️ Limitations * Trained on limited data (1% subset) * Translation quality may vary on unseen or complex sentences * Only supports `en_XX` and `te_IN` (Telugu) at this stage --- ## 📎 Citation If you use this model, please cite the base model: ```bibtex @inproceedings{liu2020mbart, title={Multilingual Denoising Pre-training for Neural Machine Translation}, author={Liu, Yinhan and others}, booktitle={ACL}, year={2020} } ``` --- ## 🧑‍💻 Author Fine-tuned by **Koushik Reddy**, ML & DL Enthusiast | NLP | LoRA | mBART | Hugging Face Connect: [Hugging Face](https://huggingface.co/Koushim)