Koushim
/

lora-mbart-en-te

machine-translation

Model card Files Files and versions

lora-mbart-en-te / README.md

Koushim's picture

Update README.md

4905964 verified 8 months ago

|

history blame contribute delete

3.42 kB

	---
	license: mit
	tags:
	- machine-translation
	- mbart
	- multilingual
	- huggingface
	- peft
	- lora
	- english
	- telugu
	datasets:
	- HackHedron/English_Telugu_Parallel_Corpus
	language:
	- en
	- te
	library_name: peft
	inference: false
	widget:
	- text: "Hello, how are you?"
	---

	# 🌍 LoRA-mBART50: English ↔ Telugu Translation (Few-shot)

	This model is a parameter-efficient fine-tuned version of [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) using [LoRA (Low-Rank Adaptation)](https://arxiv.org/abs/2106.09685) via the Hugging Face PEFT library.

	It is fine-tuned in a few-shot setting on the [HackHedron English-Telugu Parallel Corpus](https://huggingface.co/datasets/HackHedron/English_Telugu_Parallel_Corpus) using just 1% of the data (~4.3k pairs).

	---

	## 🧠 Model Details

	- Base model: `facebook/mbart-large-50-many-to-many-mmt`
	- Languages: `en_XX` ↔ `te_IN`
	- Technique: LoRA (r=8, α=32, dropout=0.1)
	- Training regime: 3 epochs, batch size 8, learning rate 5e-4
	- Library: 🤗 PEFT (`peft`), `transformers`, `datasets`

	---

	## 📚 Dataset

	- Source: [HackHedron/English_Telugu_Parallel_Corpus](https://huggingface.co/datasets/HackHedron/English_Telugu_Parallel_Corpus)
	- Size used: 4338 sentence pairs (~1%)
	- Format:
	- `english`: Source text
	- `telugu`: Target translation

	---

	## 💻 Usage

	### Load Adapter with Base mBART
	```python
	from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
	from peft import PeftModel

	# Load base model & tokenizer
	base_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
	tokenizer = MBart50TokenizerFast.from_pretrained("your-username/lora-mbart-en-te")

	# Load LoRA adapter
	model = PeftModel.from_pretrained(base_model, "your-username/lora-mbart-en-te")

	# Set source and target languages
	tokenizer.src_lang = "en_XX"
	tokenizer.tgt_lang = "te_IN"

	# Prepare input
	inputs = tokenizer("Hello, how are you?", return_tensors="pt")
	generated_ids = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"])
	translation = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
	print(translation)
	````

	---

	## 🔧 Training Configuration

	\| Setting \| Value \|
	\| --------------- \| -------- \|
	\| Base Model \| mBART-50 \|
	\| LoRA r \| 8 \|
	\| LoRA Alpha \| 32 \|
	\| Dropout \| 0.1 \|
	\| Optimizer \| AdamW \|
	\| Batch Size \| 8 \|
	\| Epochs \| 3 \|
	\| Mixed Precision \| fp16 \|

	---

	## 🚀 Applications

	* English ↔ Telugu translation for low-resource settings
	* Mobile/Edge inference with minimal memory
	* Foundation for multilingual LoRA adapters

	---

	## ⚠️ Limitations

	* Trained on limited data (1% subset)
	* Translation quality may vary on unseen or complex sentences
	* Only supports `en_XX` and `te_IN` (Telugu) at this stage

	---

	## 📎 Citation

	If you use this model, please cite the base model:

	```bibtex
	@inproceedings{liu2020mbart,
	title={Multilingual Denoising Pre-training for Neural Machine Translation},
	author={Liu, Yinhan and others},
	booktitle={ACL},
	year={2020}
	}
	```

	---

	## 🧑‍💻 Author

	Fine-tuned by Koushik Reddy, ML & DL Enthusiast \| NLP \| LoRA \| mBART \| Hugging Face

	Connect: [Hugging Face](https://huggingface.co/Koushim)