lora-mbart-en-te / README.md
Koushim's picture
Update README.md
4905964 verified
---
license: mit
tags:
- machine-translation
- mbart
- multilingual
- huggingface
- peft
- lora
- english
- telugu
datasets:
- HackHedron/English_Telugu_Parallel_Corpus
language:
- en
- te
library_name: peft
inference: false
widget:
- text: "Hello, how are you?"
---
# 🌍 LoRA-mBART50: English ↔ Telugu Translation (Few-shot)
This model is a parameter-efficient fine-tuned version of [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) using [LoRA (Low-Rank Adaptation)](https://arxiv.org/abs/2106.09685) via the Hugging Face PEFT library.
It is fine-tuned in a **few-shot setting** on the [HackHedron English-Telugu Parallel Corpus](https://huggingface.co/datasets/HackHedron/English_Telugu_Parallel_Corpus) using just **1% of the data (~4.3k pairs)**.
---
## 🧠 Model Details
- **Base model**: `facebook/mbart-large-50-many-to-many-mmt`
- **Languages**: `en_XX` ↔ `te_IN`
- **Technique**: LoRA (r=8, Ξ±=32, dropout=0.1)
- **Training regime**: 3 epochs, batch size 8, learning rate 5e-4
- **Library**: πŸ€— PEFT (`peft`), `transformers`, `datasets`
---
## πŸ“š Dataset
- **Source**: [HackHedron/English_Telugu_Parallel_Corpus](https://huggingface.co/datasets/HackHedron/English_Telugu_Parallel_Corpus)
- **Size used**: 4338 sentence pairs (~1%)
- **Format**:
- `english`: Source text
- `telugu`: Target translation
---
## πŸ’» Usage
### Load Adapter with Base mBART
```python
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
from peft import PeftModel
# Load base model & tokenizer
base_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("your-username/lora-mbart-en-te")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "your-username/lora-mbart-en-te")
# Set source and target languages
tokenizer.src_lang = "en_XX"
tokenizer.tgt_lang = "te_IN"
# Prepare input
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
generated_ids = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"])
translation = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(translation)
````
---
## πŸ”§ Training Configuration
| Setting | Value |
| --------------- | -------- |
| Base Model | mBART-50 |
| LoRA r | 8 |
| LoRA Alpha | 32 |
| Dropout | 0.1 |
| Optimizer | AdamW |
| Batch Size | 8 |
| Epochs | 3 |
| Mixed Precision | fp16 |
---
## πŸš€ Applications
* English ↔ Telugu translation for low-resource settings
* Mobile/Edge inference with minimal memory
* Foundation for multilingual LoRA adapters
---
## ⚠️ Limitations
* Trained on limited data (1% subset)
* Translation quality may vary on unseen or complex sentences
* Only supports `en_XX` and `te_IN` (Telugu) at this stage
---
## πŸ“Ž Citation
If you use this model, please cite the base model:
```bibtex
@inproceedings{liu2020mbart,
title={Multilingual Denoising Pre-training for Neural Machine Translation},
author={Liu, Yinhan and others},
booktitle={ACL},
year={2020}
}
```
---
## πŸ§‘β€πŸ’» Author
Fine-tuned by **Koushik Reddy**, ML & DL Enthusiast | NLP | LoRA | mBART | Hugging Face
Connect: [Hugging Face](https://huggingface.co/Koushim)