English β Telugu Translation (mBART)
This repository contains an English to Telugu neural machine translation model based on mBART-50, fine-tuned and deployed using the Hugging Face ecosystem.
The model translates English text (en) into Telugu (te) and is suitable for research, learning, and demo purposes.
π Live Demo
You can test the model live using the Hugging Face Space below:
π https://huggingface.co/spaces/Yaser77/mbart-en-te-demo
π§ Model Details
- Model name:
mbart-en-te - Base model:
facebook/mbart-large-50-many-to-many-mmt - Task: Machine Translation (English β Telugu)
- Framework: π€ Transformers
- Model size: ~600M parameters
- Precision: FP32
- Tokenizer: SentencePiece (mBART tokenizer)
π How It Works
- Source language: English (
en_XX) - Target language: Telugu (
te_IN) - The model uses forced BOS token decoding, which is required for mBART-based translation models.
π Intended Uses
β Direct Use
- Translating English text into Telugu
- Educational demos and learning projects
- NLP experimentation with Indic languages
π Downstream Use
- Can be fine-tuned further on domain-specific parallel corpora
- Can be integrated into web apps or APIs for translation services
β Out-of-Scope Use
- Not intended for real-time production systems without optimization
- Not suitable for legal, medical, or safety-critical translations
β οΈ Limitations & Risks
- Translation quality depends heavily on sentence complexity
- May struggle with idioms, slang, or highly technical language
- Biases present in the original training data may be reflected
- Slower inference on CPU due to model size
π§ͺ How to Use the Model (Code)
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model_id = "Yaser77/mbart-en-te"
tokenizer = MBart50TokenizerFast.from_pretrained(model_id)
model = MBartForConditionalGeneration.from_pretrained(model_id)
tokenizer.src_lang = "en_XX"
forced_bos_token_id = tokenizer.lang_code_to_id["te_IN"]
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(
**inputs,
forced_bos_token_id=forced_bos_token_id,
max_length=128
)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
π Training Details
Training Data
- Based on multilingual translation data used by mBART
- No custom dataset is claimed beyond fine-tuning and experimentation
Training Method
- Sequence-to-sequence Transformer
- Teacher forcing during training
- Cross-entropy loss
π§Ύ Evaluation
- Qualitative evaluation using common English sentences
- No official BLEU or automatic metrics reported
- Intended mainly as a demo and learning project
π Environmental Impact
- Hardware: CPU (Hugging Face Spaces β Basic tier)
- Compute Region: Managed by Hugging Face
- Precision: FP32
- Carbon Emissions: Not explicitly measured
π§© Technical Specifications
Architecture: Transformer encoderβdecoder (mBART)
Libraries:
transformerstorchsentencepiecegradio
π References
- mBART Paper: https://arxiv.org/abs/2001.08210
- Hugging Face Transformers: https://huggingface.co/docs/transformers
Acknowledgements
This project was developed as part of a hands-on learning workshop conducted by Vijender P at Alumnx AI Labs during the "GPU Hours" session on fine-tuning and deploying large AI models.
The deployment, debugging, Hugging Face integration, and demo Space were independently implemented by the author as a learning exercise.
π€ Author
T Mohamed Yaser Computer Science Engineering Student Interested in NLP, ML deployment, and real-world AI applications
π¬ Contact
For questions, feedback, or collaboration:
- Hugging Face: https://huggingface.co/Yaser77
- Downloads last month
- -
Model tree for Yaser77/mbart-en-te
Base model
facebook/mbart-large-50-many-to-many-mmt