English → Telugu Translation (mBART)

This repository contains an English to Telugu neural machine translation model based on mBART-50, fine-tuned and deployed using the Hugging Face ecosystem.

The model translates English text (en) into Telugu (te) and is suitable for research, learning, and demo purposes.


🔎 Live Demo

You can test the model live using the Hugging Face Space below:

👉 https://huggingface.co/spaces/Yaser77/mbart-en-te-demo


🧠 Model Details

  • Model name: mbart-en-te
  • Base model: facebook/mbart-large-50-many-to-many-mmt
  • Task: Machine Translation (English → Telugu)
  • Framework: 🤗 Transformers
  • Model size: ~600M parameters
  • Precision: FP32
  • Tokenizer: SentencePiece (mBART tokenizer)

📌 How It Works

  • Source language: English (en_XX)
  • Target language: Telugu (te_IN)
  • The model uses forced BOS token decoding, which is required for mBART-based translation models.

🚀 Intended Uses

✅ Direct Use

  • Translating English text into Telugu
  • Educational demos and learning projects
  • NLP experimentation with Indic languages

🔄 Downstream Use

  • Can be fine-tuned further on domain-specific parallel corpora
  • Can be integrated into web apps or APIs for translation services

❌ Out-of-Scope Use

  • Not intended for real-time production systems without optimization
  • Not suitable for legal, medical, or safety-critical translations

⚠️ Limitations & Risks

  • Translation quality depends heavily on sentence complexity
  • May struggle with idioms, slang, or highly technical language
  • Biases present in the original training data may be reflected
  • Slower inference on CPU due to model size

🧪 How to Use the Model (Code)

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model_id = "Yaser77/mbart-en-te"

tokenizer = MBart50TokenizerFast.from_pretrained(model_id)
model = MBartForConditionalGeneration.from_pretrained(model_id)

tokenizer.src_lang = "en_XX"
forced_bos_token_id = tokenizer.lang_code_to_id["te_IN"]

text = "Hello, how are you?"

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(
    **inputs,
    forced_bos_token_id=forced_bos_token_id,
    max_length=128
)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

📊 Training Details

Training Data

  • Based on multilingual translation data used by mBART
  • No custom dataset is claimed beyond fine-tuning and experimentation

Training Method

  • Sequence-to-sequence Transformer
  • Teacher forcing during training
  • Cross-entropy loss

🧾 Evaluation

  • Qualitative evaluation using common English sentences
  • No official BLEU or automatic metrics reported
  • Intended mainly as a demo and learning project

🌍 Environmental Impact

  • Hardware: CPU (Hugging Face Spaces – Basic tier)
  • Compute Region: Managed by Hugging Face
  • Precision: FP32
  • Carbon Emissions: Not explicitly measured

🧩 Technical Specifications

  • Architecture: Transformer encoder–decoder (mBART)

  • Libraries:

    • transformers
    • torch
    • sentencepiece
    • gradio

📖 References


Acknowledgements

This project was developed as part of a hands-on learning workshop conducted by Vijender P at Alumnx AI Labs during the "GPU Hours" session on fine-tuning and deploying large AI models.

The deployment, debugging, Hugging Face integration, and demo Space were independently implemented by the author as a learning exercise.


👤 Author

T Mohamed Yaser Computer Science Engineering Student Interested in NLP, ML deployment, and real-world AI applications


📬 Contact

For questions, feedback, or collaboration:

Downloads last month
5
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Yaser77/mbart-en-te

Finetuned
(252)
this model

Dataset used to train Yaser77/mbart-en-te

Space using Yaser77/mbart-en-te 1

Paper for Yaser77/mbart-en-te