|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- machine-translation |
|
|
- mbart |
|
|
- multilingual |
|
|
- huggingface |
|
|
- peft |
|
|
- lora |
|
|
- english |
|
|
- telugu |
|
|
datasets: |
|
|
- HackHedron/English_Telugu_Parallel_Corpus |
|
|
language: |
|
|
- en |
|
|
- te |
|
|
library_name: peft |
|
|
inference: false |
|
|
widget: |
|
|
- text: "Hello, how are you?" |
|
|
--- |
|
|
|
|
|
# π LoRA-mBART50: English β Telugu Translation (Few-shot) |
|
|
|
|
|
This model is a parameter-efficient fine-tuned version of [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) using [LoRA (Low-Rank Adaptation)](https://arxiv.org/abs/2106.09685) via the Hugging Face PEFT library. |
|
|
|
|
|
It is fine-tuned in a **few-shot setting** on the [HackHedron English-Telugu Parallel Corpus](https://huggingface.co/datasets/HackHedron/English_Telugu_Parallel_Corpus) using just **1% of the data (~4.3k pairs)**. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Model Details |
|
|
|
|
|
- **Base model**: `facebook/mbart-large-50-many-to-many-mmt` |
|
|
- **Languages**: `en_XX` β `te_IN` |
|
|
- **Technique**: LoRA (r=8, Ξ±=32, dropout=0.1) |
|
|
- **Training regime**: 3 epochs, batch size 8, learning rate 5e-4 |
|
|
- **Library**: π€ PEFT (`peft`), `transformers`, `datasets` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Dataset |
|
|
|
|
|
- **Source**: [HackHedron/English_Telugu_Parallel_Corpus](https://huggingface.co/datasets/HackHedron/English_Telugu_Parallel_Corpus) |
|
|
- **Size used**: 4338 sentence pairs (~1%) |
|
|
- **Format**: |
|
|
- `english`: Source text |
|
|
- `telugu`: Target translation |
|
|
|
|
|
--- |
|
|
|
|
|
## π» Usage |
|
|
|
|
|
### Load Adapter with Base mBART |
|
|
```python |
|
|
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast |
|
|
from peft import PeftModel |
|
|
|
|
|
# Load base model & tokenizer |
|
|
base_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") |
|
|
tokenizer = MBart50TokenizerFast.from_pretrained("your-username/lora-mbart-en-te") |
|
|
|
|
|
# Load LoRA adapter |
|
|
model = PeftModel.from_pretrained(base_model, "your-username/lora-mbart-en-te") |
|
|
|
|
|
# Set source and target languages |
|
|
tokenizer.src_lang = "en_XX" |
|
|
tokenizer.tgt_lang = "te_IN" |
|
|
|
|
|
# Prepare input |
|
|
inputs = tokenizer("Hello, how are you?", return_tensors="pt") |
|
|
generated_ids = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"]) |
|
|
translation = tokenizer.decode(generated_ids[0], skip_special_tokens=True) |
|
|
print(translation) |
|
|
```` |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Training Configuration |
|
|
|
|
|
| Setting | Value | |
|
|
| --------------- | -------- | |
|
|
| Base Model | mBART-50 | |
|
|
| LoRA r | 8 | |
|
|
| LoRA Alpha | 32 | |
|
|
| Dropout | 0.1 | |
|
|
| Optimizer | AdamW | |
|
|
| Batch Size | 8 | |
|
|
| Epochs | 3 | |
|
|
| Mixed Precision | fp16 | |
|
|
|
|
|
--- |
|
|
|
|
|
## π Applications |
|
|
|
|
|
* English β Telugu translation for low-resource settings |
|
|
* Mobile/Edge inference with minimal memory |
|
|
* Foundation for multilingual LoRA adapters |
|
|
|
|
|
--- |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
* Trained on limited data (1% subset) |
|
|
* Translation quality may vary on unseen or complex sentences |
|
|
* Only supports `en_XX` and `te_IN` (Telugu) at this stage |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use this model, please cite the base model: |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{liu2020mbart, |
|
|
title={Multilingual Denoising Pre-training for Neural Machine Translation}, |
|
|
author={Liu, Yinhan and others}, |
|
|
booktitle={ACL}, |
|
|
year={2020} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π§βπ» Author |
|
|
|
|
|
Fine-tuned by **Koushik Reddy**, ML & DL Enthusiast | NLP | LoRA | mBART | Hugging Face |
|
|
|
|
|
Connect: [Hugging Face](https://huggingface.co/Koushim) |
|
|
|