English → Telugu Translation (mBART)

This repository contains an English to Telugu neural machine translation model based on mBART-50, fine-tuned and deployed using the Hugging Face ecosystem.

The model translates English text (en) into Telugu (te) and is suitable for research, learning, and demo purposes.

🔎 Live Demo

You can test the model live using the Hugging Face Space below:

👉 https://huggingface.co/spaces/Yaser77/mbart-en-te-demo

🧠 Model Details

Model name: mbart-en-te
Base model: facebook/mbart-large-50-many-to-many-mmt
Task: Machine Translation (English → Telugu)
Framework: 🤗 Transformers
Model size: ~600M parameters
Precision: FP32
Tokenizer: SentencePiece (mBART tokenizer)

📌 How It Works

Source language: English (en_XX)
Target language: Telugu (te_IN)
The model uses forced BOS token decoding, which is required for mBART-based translation models.

🚀 Intended Uses

✅ Direct Use

Translating English text into Telugu
Educational demos and learning projects
NLP experimentation with Indic languages

🔄 Downstream Use

Can be fine-tuned further on domain-specific parallel corpora
Can be integrated into web apps or APIs for translation services

❌ Out-of-Scope Use

Not intended for real-time production systems without optimization
Not suitable for legal, medical, or safety-critical translations

⚠️ Limitations & Risks

Translation quality depends heavily on sentence complexity
May struggle with idioms, slang, or highly technical language
Biases present in the original training data may be reflected
Slower inference on CPU due to model size

🧪 How to Use the Model (Code)

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model_id = "Yaser77/mbart-en-te"

tokenizer = MBart50TokenizerFast.from_pretrained(model_id)
model = MBartForConditionalGeneration.from_pretrained(model_id)

tokenizer.src_lang = "en_XX"
forced_bos_token_id = tokenizer.lang_code_to_id["te_IN"]

text = "Hello, how are you?"

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(
    **inputs,
    forced_bos_token_id=forced_bos_token_id,
    max_length=128
)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

📊 Training Details

Training Data

Based on multilingual translation data used by mBART
No custom dataset is claimed beyond fine-tuning and experimentation

Training Method

Sequence-to-sequence Transformer
Teacher forcing during training
Cross-entropy loss

🧾 Evaluation

Qualitative evaluation using common English sentences
No official BLEU or automatic metrics reported
Intended mainly as a demo and learning project

🌍 Environmental Impact

Hardware: CPU (Hugging Face Spaces – Basic tier)
Compute Region: Managed by Hugging Face
Precision: FP32
Carbon Emissions: Not explicitly measured

🧩 Technical Specifications

Architecture: Transformer encoder–decoder (mBART)
Libraries:
- transformers
- torch
- sentencepiece
- gradio

📖 References

mBART Paper: https://arxiv.org/abs/2001.08210
Hugging Face Transformers: https://huggingface.co/docs/transformers

Acknowledgements

This project was developed as part of a hands-on learning workshop conducted by Vijender P at Alumnx AI Labs during the "GPU Hours" session on fine-tuning and deploying large AI models.

The deployment, debugging, Hugging Face integration, and demo Space were independently implemented by the author as a learning exercise.

👤 Author

T Mohamed Yaser Computer Science Engineering Student Interested in NLP, ML deployment, and real-world AI applications

📬 Contact

For questions, feedback, or collaboration:

Hugging Face: https://huggingface.co/Yaser77

Downloads last month: 5

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for Yaser77/mbart-en-te

Base model

facebook/mbart-large-50-many-to-many-mmt

Finetuned

(252)

this model

Dataset used to train Yaser77/mbart-en-te

Space using Yaser77/mbart-en-te 1

Paper for Yaser77/mbart-en-te

Multilingual Denoising Pre-training for Neural Machine Translation

Paper • 2001.08210 • Published Jan 22, 2020